Persistence of data is what makes application software usable. Data persistence has come a long way from simple file based storage to the latest sophisticated distributed databases. This article explores an efficient database system called MonetDB, the primary feature of which is column based storage. MonetDB also comes loaded with support for various languages such as Python, Ruby, R and PHP.
Just as the physical world is made up of atoms, the cyber world is made up of data. The effectiveness with which the data is stored and retrieved defines the quality and usability of software applications. Persistence of data in the earlier days was carried out by simply putting the data in flat files. In contrast, modern day databases are very sophisticated in terms of their architecture, data handling efficiency, etc. The world of database management systems is loaded with plenty of choices. This article explores an interesting option called MonetDB.
Most of the database management systems are row-major systems. They work on the assumption that all values of a row are fetched at a time. However, there are instances when we need to do some aggregation operations on the values of columns. In such circumstances, it would be better if the values of columns are stored together in the memory block. Column based databases enable the retrieval of the necessary and related data with efficient disk access. Such databases are better suited for online analytical processing or OLAP (e.g., data warehouses) workloads.
Some of the column based databases (Figure 1) are listed below:
- Apache Kudu
- Apache Druid
Though there are many similarities between the above options, each of them has some unique features.
MonetDB was developed by a team at Centrum Wiskunde & Informatica (CWI), Netherlands. Initially, it was called Monet after the famous French painter, Claude Monet. Later it was updated to MonetDB. It is written in C language and supports various platforms. The latest stable release was in April 2019. The main features (Figure 2) of MonetDB are listed below:
- Column store database kernel
- High performance system
- Multi-core parallel execution
- Support for different query languages (this is achieved through its proprietary algebraic language called MonetDB Assembly Language or MAL)
- Extensible database system
- Support for a broad palette of application domains with the integration of external libraries such as PCRE, Raptor, libxml, Geos, etc.
- Open source
MonetDB has a three-layer architecture. The top layer is for providing an SQL interface, the middle layer holds optimisers for MonetDB Assembly Language (MAL), and the bottom-most layer is the database kernel for providing access to the Binary Association Tables.
Detailed installation instructions for various platforms are available in the official documentation site, https://www.monetdb.org/Downloads. Though Windows 32-bit and 64-bit installers are given, the official documentation recommends Linux 64-bit binaries for scenarios in which the tables are expected to grow larger than 2GB disk space. If you wish to compile from the source, follow the instructions at https://www.monetdb.org/Developers/SourceCompile.
In the Linux environment, start the MonetDB daemon monetdbd. This daemon is controlled by the monetdb application. A simple session demonstration is shown in the following code segment:
shell> monetdbd create /path/to/mydbfarm shell> monetdbd start /path/to/mydbfarm shell> monetdb create voc shell> monetdb release voc shell> mclient -u monetdb -d voc password:<monetdb> Welcome to mclient, the MonetDB/SQL interactive terminal (Oct2014-SP1) Database: MonetDB v11.19.7 (Oct2014-SP1), ‘mapi:monetdb://localhost:50000/voc’ Type \q to quit, \? for a list of available commands auto commit mode: on sql>SELECT ‘hello world’; +---------------+ | single_value | +===============+ | hello world | +---------------+ 1 tuple (0.530ms)
The following code uses the VOC data set, which is available at https://www.monetdb.org/sites/default/files/voc_dump.zip. It’s a compressed file consisting of SQL statements. After the extraction of its contents, it is loaded into MonetDB with the help of mclient. The official documentation suggests that before loading the VOC data set, it is better to create a new user. The user creation and schema setting is shown below:
shell> mclient -u monetdb -d voc password:<monetdb> sql>CREATE USER “voc” WITH PASSWORD ‘voc’ NAME ‘VOC Explorer’ SCHEMA “sys”; sql>CREATE SCHEMA “voc” AUTHORIZATION “voc”; sql>ALTER USER “voc” SET SCHEMA “voc”;
A test table is created as shown below:
sql>CREATE TABLE test (id int, data varchar(30)); operation successful (0.722ms)
voc_dump.sql can be imported with the use of the following code:
shell> mclient -u voc -d voc voc_dump.sql password:<voc>
The VOC data set has data for about 8000 voyages. You may check the data that you loaded using the steps mentioned earlier, with the following SELECT statement.
sql>SELECT count(*) FROM voyages; +------+ | L1 | +======+ | 8131 | +------+
Various operations can be carried out with the loaded data set. A sample is given below:
sql>SELECT chamber, CAST(AVG(invoice) AS integer) AS average FROM invoices WHERE invoice IS NOT NULL GROUP BY chamber ORDER BY average DESC; +---------+---------+ | chamber | average | +=========+=========+ | A | 282996 | | Z | 259300 | | H | 150182 | | R | 149628 | | D | 149522 | | E | 149518 | | null | 83309 | +---------+---------+
Detailed instructions on all the SQL operations are provided in the official documentation.
MonetDB has support for various extensions.
- GeoSpatial: MonetDB has an interface to the simple feature specification of the Open GeoSpatial Consortium. This enables you to develop GIS based applications (https://www.monetdb.org/Documentation/Extensions/GIS).
- LifeScience: MonetDB has the SAM/BAM module, which is among the prominent standards for managing the DNA sequence alignment data (https://www.monetdb.org/bam).
- Data vaults: The latest databases are not restricted only to tabular data. They are often required to handle various files as well. MonetDB data vaults are database-linked external files (https://www.monetdb.org/Documentation/Extensions/DataVaults).
MonetDB is loaded with Python, Perl, Ruby, PHP, JDBC and ODBC interface libraries. The important point to be noted is that these interfaces are native implementations and hence do not require the installation of MonetDB client/server code. Apart from this, there is a MonetDB.R connector (http://monetr.r-forge.r-project.org/), which makes the interaction with MonetDB from R simple and elegant.
This article only explores support. MonetDB has a native Python client API. The support is provided for Python 2.7 and 3.3. This API is compatible with Python DBAPI 2.0.
The PymonetDB can be installed with pip using the following command:
pip install pymonetdb
The interaction from Python is very simple, as shown below:
> # import the SQL module > import pymonetdb > > # set up a connection. arguments below are the defaults > connection = pymonetdb.connect(username=”monetdb”, password=”monetdb”, hostname=”localhost”, database=”demo”) > > # create a cursor > cursor = connection.cursor() > > # increase the rows fetched to increase performance (optional) > cursor.arraysize = 100 > > # execute a query (return the number of rows to fetch) > cursor.execute(‘SELECT * FROM tables’) 26 > > # fetch only one row > cursor.fetchone() (1062, ‘schemas’, 1061, None, 0, True, 0, 0) > > # fetch the remaining rows > cursor.fetchall() [(1067, ‘types’, 1061, None, 0, True, 0, 0), (1076, ‘functions’, 1061, None, 0, True, 0, 0), (1085, ‘args’, 1061, None, 0, True, 0, 0), (1093, ‘sequences’, 1061, None, 0, True, 0, 0), (1103, ‘dependencies’, 1061, None, 0, True, 0, 0), (1107, ‘connections’, 1061, None, 0, True, 0, 0), (1116, ‘_tables’, 1061, None, 0, True, 0, 0), ... (4141, ‘user_role’, 1061, None, 0, True, 0, 0), (4144, ‘auths’, 1061, None, 0, True, 0, 0), (4148, ‘privileges’, 1061, None, 0, True, 0, 0)] > > # Show the table meta data > cursor.description [(‘id’, ‘int’, 4, 4, None, None, None), (‘name’, ‘varchar’, 12, 12, None, None, None), (‘schema_id’, ‘int’, 4, 4, None, None, None), (‘query’, ‘varchar’, 168, 168, None, None, None), (‘type’, ‘smallint’, 1, 1, None, None, None), (‘system’, ‘boolean’, 5, 5, None, None, None), (‘commit_action’, ‘smallint’, 1, 1, None, None, None), (‘temporary’, ‘tinyint’, 1, 1, None, None, None)]
The MAPI library provides the lowest level interface. Sample code to interact through MAPI is shown below:
> from pymonetdb import mapi > mapi_connection = mapi.Connection() > mapi_connection.connect(hostname=”localhost”, port=50000, username=”monetdb”, password=”monetdb”, database=”demo”, language=”sql”, unix_socket=None, connect_timeout=-1) > mapi_connection.cmd(“sSELECT * FROM tables;”)
To summarise, MonetDB is a time tested choice and is worth exploring if you are interested in column based databases.