When you think of choosing a database for your application or website, which ones feature on your list? Would they be MySQL, Oracle, PostgreSQL, or MS SQL? Have you noticed what’s common among all these databases? Yes, they all are RDBMSs, and use SQL to interact with data. What if you wanted to look beyond RDBMSs and SQL, to something that’s new, unconventional, and provides much better performance and scalability?
It has always bothered me that whenever I start reading books on database management systems, they always presume the relational model of data, and the fact that each and every type of data has to be stored in a table, while NoSQL databases are introduced in the “Other Databases” section.
Don’t you think that’s unfair? We need to have a more flexible approach, rather than trying to find a way to fit the data into a database, we should be trying to find a way to match the database with the data. Thus, our choice should essentially depend on the type of data we are trying to store in the database.
You might think that NoSQL is not “anti-SQL”; in fact, it stands for “not only SQL”. Its intention is not to oppose SQL, but to provide a viable alternative in terms of storage and the ways of interacting with a database. Many NoSQL databases also provide an SQL-like query interface. Actually, both kinds of databases are complementary; they are built to solve different problems, and can easily coexist with each other.
The most common misconceptions about NoSQL are:
- “NoSQL databases are immune to security issues such as SQL injection attacks.” In fact, since most NoSQL databases are still not mature enough, security is still a big issue with them.
- “Switch over to NoSQL if you find RDBMSs or SQL hard to use.” Chances are, if you find conventional databases difficult, NoSQL will prove to be an even bigger headache — so you’d better not jump into uncharted territory!
Earlier, it was considered that relational databases had to satisfy the four basic requirements, ACID, without which a database system is not considered good. NoSQL databases depart from this concept in favour of the more recent CAP Theorem or Brewer’s Theorem, formulated by Eric Brewer in 2000. This theorem talks about the three basic properties of Consistency, Availability and Partition Tolerance, while postulating that a distributed database can satisfy only two of these.
NoSQL distributed databases follow this by using what is known as eventual consistency. This is a more relaxed form of consistency, wherein consistency is achieved over a sufficient period of time, instead of guaranteeing it immediately. Doing this improves availability and scalability to a large extent. Some people have started calling this paradigm BASE (Basically Available, Soft state, Eventual consistency).
You might argue that RDBMSs are mature, robust, reliable and time-tested. But, did you know that the first paper on the relational model appeared back in 1970? In my opinion, that’s pretty old! NoSQL databases are the future, and are bound to be the ascendant technology in the upcoming years. Not only do they leverage the modern concepts of cloud and distributed computing effectively, but also revolutionise the way in which we have been thinking about DBMSs.
NoSQL databases are also being used by some of the largest Internet firms in their software architecture. Yet, you will rarely hear about a particular firm having switched over; what firms have been doing lately is to use different databases for different types of applications and features. Take Facebook, for example: apart from using MySQL in its database, it has used Apache Cassandra as a storage system for reverse indexes of the inbox search in Facebook messages, while also employing Apache Hadoop for other purposes.
Here’s what’s on the Facebook Engineering blog, “Apache Hadoop is being used in three broad types of systems: as a warehouse for Web analytics, as storage for a distributed database, and for MySQL database backups.”
Now, let us discuss the common features that NoSQL databases have, and why they are getting more popular these days:
- Big data: When it comes to handling big data — a huge number of read-write cycles, a massive amount of users, and data in petabytes, NoSQL databases can do it easily.
- Schema-less: Most NoSQL databases are schema-less, and very flexible. They provide many choices when it comes to constructing a schema, and hence objects can easily be mapped into them. You can get rid of techniques like normalisation and complex joins!
- Programmer-friendly: These databases have an advantage that they provide simple APIs in every major programming language, avoiding the need for complex ORM frameworks. Even if it isn’t available in a particular programming language, data can easily be accessed over HTTP via a simple RESTful API, using XML or JSON.
- Availability: Since most NoSQL databases are distributed, they provide easy replication of data, and failure of one node does not affect the availability of data — it only becomes a minor hindrance in terms of performance.
- Highly scalable: This is the major reason these databases are all the rage these days, because they do not require a dedicated high-performance database server. In fact, they can easily be run on a cluster of commodity hardware; scaling out is just a matter of adding a new node.
- Low latency: Latency of the order of a few milliseconds can be achieved using these databases, although it also depends on the amount of data that can be loaded into memory. However, since we might mostly be dealing with a cluster of data servers, I don’t think memory would be a problem.
NoSQL data models
Some of the major NoSQL databases can be differentiated into the following types:
- Document stores
- Key-value stores
- Triple stores
Nowadays, data is no longer as elementary as simple rows or columns; it is often represented in the form of XML or JSON on the Web, since these technologies are highly portable, compact and standardised. So, instead of trying to map these XML or JSON documents into a relational form, it makes much more sense to use some of the document stores already available in the market.
As you can guess, these databases are schema-free, since there is no predefined format for an XML/JSON document, and each document is independent of the other.
Sample use-cases for the database include CRM, Web-related data, real-time data, etc. Some of the most popular implementations include databases like MongoDB, CouchDB and RavenDB. Practically, databases like MongoDB have been used by websites like Foursquare, bit.ly, SourceForge, etc, to store their data in production environments.
Hierarchical databases, as is obvious from the term, store data that is hierarchically relevant in the form of a tree or a parent-child relationship. In terms of the relational model, this may be dubbed a 1:N relationship. In the most recent perspective, geospatial databases can be used in a hierarchical model to store location information, which is inherently hierarchical (in the same sense that continents are made up of countries, which comprise states, and then come cities, and so on) although actual algorithms vary.
Features like geotagging and geolocation are getting increasingly popular these days. A geospatial database specialises in storing geographical information that is spatially relevant, and can be used in a Geographical Information System. Some examples of these databases include PostGIS, Oracle Spatial, etc. Some other implementations of hierarchical databases include the IMS database developed by IBM — and, of course, the Windows registry.
The most popular form of the network database is the graph database, used to store data that can easily be represented in the form of a graph using graph theory. This type of data typically has a potential to grow exponentially. These databases can be used to store data that changes frequently; representing it in the form of tables would seriously undermine the ways in which the data can be queried.
A graph database can implement an inbuilt graph engine that can deliver very high performance when it comes to traversing graphical data. Practical use-cases of a graph database include FlockDB, developed and used by Twitter to implement a database of its users, and represent a graph of who follows whom. It uses the Gizzard framework to query the database up to 10,000 times a second.
A general technique to query a graph is to initiate from an arbitrary or specified start node, and then start traversing the graph in a depth-first or breadth-first fashion, according to the relationships that match the specified criteria. Major graph databases allow the developer to use simple APIs to do such work, and make it a trivial task. For example, they may allow us to make queries like, “Does Rohan know someone who is a lawyer by profession?”
Some of the most popular graph databases include Neo4j, HyperGraphDB, etc. These can be used for many applications, of which the most popular one is (of course!) social networking, while others include complex graph analysis, security, genetics, etc.
Column-oriented NoSQL databases were originally inspired by Google’s research paper on its BigTable distributed storage system, which Google uses internally along with the Google File System as the distributed filesystem, which has led to various open source interpretations of the data model. Some of the most popular ones, which also lead the NoSQL bandwagon, are Hadoop HBase, Apache Cassandra, HyperTable, etc.
Here, rather than storing data in rigid table-like rows and columns, it is stored more like a sparse matrix of data — like, for example, in an Excel sheet, where only the column families are defined, and the columns themselves can be defined dynamically.
It is stored more like a three-dimensional array, where one dimension is the row identifier — a primary key; the second is the combination of the column family and the column identifier; the third dimension consists of the timestamp, wherein several evisions of the same data are stacked on top of each other. Column-oriented databases like Cassandra are being used by Facebook, Digg, Reddit, Cisco WebEx, etc.
Although these databases are quite “old”, having been around for quite a few years, they can still be counted as NoSQL databases, because they depart from the traditional model of relational data. These databases intend to reduce the “impedance mismatch” between object-oriented programming languages and the data persistence layer, by applying the object-oriented concept to the database itself. This allows the storage of data, in the form of objects, to be highly transparent.
These databases use object identifiers to uniquely identify each object; it remains the same, even if all the data within the object has changed. Some of the most popular object-oriented databases are db4o, Versant, Objectivity, NEO, etc. These have rarely been used for production or Web-scale purposes, and are usually seen in research environments.
As opposed to column-oriented databases, key-value stores are based more on Amazon’s Dynamo Research paper and Distributed Hash Tables. When I talk about key-value stores, I mean simple ones without many frills, although technically, databases like HBase, Cassandra, etc., might also come under this umbrella.
Here, the data model has been simplified to the extent that it just contains a set of global key-value pairs, wherein every value has a unique key associated with it. This key is used to access data when and where required.
As for the “value”, the database is usually not related to what’s being stored, and simply stores data like a blob. The result is that we achieve utter simplicity, along with great performance, and a database that is highly scalable.
Some key-value stores that continue to attract me include Project Voldemort, Tokyo Cabinet, Redis, GT.M, etc. Also, it needs to be mentioned that Project Voldemort is a database that was open-sourced by LinkedIn, and is also being used by it in the production environment.
Triple stores are also a type of network database, but differ in that they store triples of data in the form of subject-predicate-object, where the predicate determines the relation between the subject and object, while being a part of the data itself.
For example, let’s look at, “Shashi lives in Mumbai.” Here “Shashi” is the subject, “Mumbai” is the object, and “lives in” specifies the relation between Shashi and Mumbai, so it is the predicate. This type of data typically appears while building semantic Web applications, and RDF triples are used to represent such data.
Here, you may say that this type of data can easily be modelled into an RDBMS. I agree, but let me point out that it will prove to be very inefficient. Besides, some analysis that can be made sing graph theory will not be available in the relational model. Also, mapping some of the semantic queries into SQL is difficult. Some of the popular triple stores that I have encountered include Jena, Sesame, Virtuoso, AllegroGraph, etc.
Here are some of the links related to this article that I found to be interesting — check them out!
- Codd, E.F. (1970). “A Relational Model of Data for Large Shared Data Banks” [PDF]
- Facebook Engineering’s notes
- “BigTable: A Distributed Storage System for Structured Data”
- “Dynamo: Amazon’s Highly Available Key-value Store” [PDF]
- Project Voldemort: Scaling simple storage at LinkedIn
- A list of NoSQL Databases