JuliaDB: An Analytical Database that Performs Better than its Counterparts

0
44

Julia is a programming language that produces code which runs considerably faster than its counterparts. This article introduces JuliaDB – an analytical database that is entirely created in Julia. The various advantages of JuliaDB include JIT compilation, powerful parallelism and a super-fast CSV parser. So let’s explore JuliaDB further.

Julia is a powerful programming language. The programming languages can be broadly grouped into two categories — those that are easy to build programs with but the code created runs comparatively slower (e.g., Python), and those with which building programs requires more time but the resultant code runs considerably faster (e.g., C++). Julia is aimed at combining the benefits of both these programming worlds. If you are new to Julia, there have been earlier articles (five, to be precise) on the subject in Open Source For You itself, which can be referred to at https://www.opensourceforu.com/?s=Julia.

Figure 1: JuliaDB – Major features

JuliaDB is an analytical database. When compared with similar options such as Pandas, the features provided by JuliaDB are better. At the same time, JuliaDB is also comparatively faster.
The major features of JuliaDB are listed below:

  • Just-in-time compilation
  • Superior parallelism
  • It can store many types of data.
  • The UDF (user defined functions) are faster, as they are also JIT compiled.
  • It has a super-fast CSV parser that can handle large files without a lag.

Comparing the performance of JuliaDB with other analytical databases
As per the official website of JuliaDB, the performance with respect to large CSV files is better. Figure 2 shows a comparison between JuliaDB and Pandas (source: https://juliadb.org/ ).

Figure 2: JuliaDB vs Pandas

The design goals of JuliaDB are as follows:

  • To have the capability to load multi-dimensional data sets faster with incremental support.
  • The capability to index data and perform tasks such as aggregate, sort, filter and join.
  • The capability to store the results and load them when required in a faster and more efficient way.
  • The capability to use the parallelism to the maximum extent possible.
Feature JuliaDB Pandas xts TimeArrays
Distributed computing Yes
Data larger than memory Yes
Multiple indexes Yes
Index type(s) Any Built-ins Time Time
Value type(s) Any Built-ins Built-ins Any
Compiled UDFs Yes Yes

JuliaDB is built as an analytical database tool with all of the objectives mentioned earlier. The greatest advantage of JuliaDB is that it has been built all along in Julia itself, giving users the ability to compose queries with Julia code.

The fundamental components of JuliaDB are IndexedTable and NDParse data structures. Typed columns based storage makes the aggregation operations very fast.

Getting started with JuliaDB
Similar to any other package in Julia, JuliaDB can be added to your system with the following commands:

using Pkg
Pkg.add(“JuliaDB”)

using JuliaDB

A simple way to build a table is to use the following command:

t = table(rand(Bool, 10), rand(10))

A sample output is as shown below:

julia> t = table(rand(Bool, 10), rand(10))
Table with 10 rows, 2 columns:
1 2
───────────────
true 0.94025
true 0.99059
false 0.135366
true 0.201484
true 0.454071
true 0.462337
false 0.299111
true 0.73977
true 0.485396
false 0.169812

Here, two columns have been built. The first column carries a Boolean value and the second one a random number.

Getting and loading data
As previously stated, JuliaDB is very good at handling CSV files. You can download the hflights.csv from https://raw.githubusercontent.com/piever/JuliaDBTutorial/master/hflights.csv. Many examples in this article are based on this CSV file.
This CSV can be loaded with the following code snippet:

using JuliaDB
flights = loadtable(“hflights.csv”)

On successful loading of this table, you will get an output as shown in Figure 3.

Figure 3: Loadtable screenshot

It lists all the 21 columns and their data types.

Extracting data matching a condition
The rows that meet a particular condition can be extracted with the filter function. A sample filter function is shown below:

filter(i -> (i.Month == 10) && (i.DayofMonth == 22), flights)

Obviously, this function matches rows where the month is 10 and DayofMonth is 22. You can combine conditions as shown below:

filter(i -> (i.UniqueCarrier == “AA”) || (i.UniqueCarrier == “UA”), flights)

Selecting specific columns
Specific columns can be selected with column names. A sample Select function and its output are shown below:

julia> select(flights, (:FlightNum, :Origin, :Dest))
Table with 227496 rows, 3 columns:
FlightNum Origin Dest
──────────────────
428 “IAH” “DFW”
428 “IAH” “DFW”
428 “IAH” “DFW”
428 “IAH” “DFW”
428 “IAH” “DFW”
428 “IAH” “DFW”
428 “IAH” “DFW”
428 “IAH” “DFW”
428 “IAH” “DFW”
428 “IAH” “DFW”
428 “IAH” “DFW”
428 “IAH” “DFW”
428 “IAH” “DFW”
⋮
1231 “HOU” “SAT”
124 “HOU” “STL”
280 “HOU” “STL”
782 “HOU” “STL”
1050 “HOU” “STL”
201 “HOU” “TPA”
471 “HOU” “TPA”
1191 “HOU” “TPA”
1674 “HOU” “TPA”
127 “HOU” “TUL”
621 “HOU” “TUL”
1597 “HOU” “TUL”

Nesting and piping
When you want to carry out multiple operations one after another, it can be done in two ways — by nesting and/or piping.

In the following code snippet, we are nesting by filtering for departure delays of more than 40 (!ismissing is used to detect missing values and select two columns).

filter(i -> !ismissing(i.DepDelay > 40), select(flights, (:UniqueCarrier, :DepDelay)))

Piping can be achieved with the Lazy package.

import Lazy
Lazy.@as x flights begin
select(x, (:UniqueCarrier, :DepDelay))
filter(i -> !ismissing(i.DepDelay > 60), x)
end

Sorting
Sorting can be carried out with the Sort function, as shown below:

julia> sort(flights, :FlightNum, select = (:FlightNum, :Origin, :Dest))
Table with 227496 rows, 3 columns:
FlightNum Origin Dest

────────────────────────
1 “IAH” “HNL”
1 “IAH” “HNL”
1 “IAH” “HNL”
1 “IAH” “HNL”
1 “IAH” “HNL”
1 “IAH” “HNL”
1 “IAH” “HNL”
1 “IAH” “HNL”
1 “IAH” “HNL”
1 “IAH” “HNL”
1 “IAH” “HNL”
1 “IAH” “HNL”
1 “IAH” “HNL”
⋮
7037 “IAH” “IAD”
7037 “IAH” “IAD”
7037 “IAH” “IAD”
7037 “IAH” “IAD”
7037 “IAH” “IAD”
7231 “IAH” “IAD”
7240 “IAH” “IAD”
7240 “IAH” “IAD”
7240 “IAH” “IAD”
7240 “IAH” “IAD”
7290 “IAH” “IAD”
7290 “IAH” “IAD”

Row-wise operations
You can use map() to apply some operations to every row. In the following example, the distance is divided by airtime and it’s multiplied by 60.

speed = map(i -> i.Distance / i.AirTime * 60, flights)

Adding a new column
A new column can be added to the existing data set with the help of the transform function:

flights = transform(flights, :Speed => speed)

This adds ‘Speed’ as the 22nd column to the existing flight’s data set.
Let us suppose that for each destination, you want to find the count of the total flights and the number of distinct planes. This can be achieved with a groupby as shown below:

groupby((flight_count = length, plane_count = length∘union), flights, :Dest, select = :TailNum)

The output is shown in Figure 4.

Figure 4: Groupby output

Visualisation
One of the greatest advantages of Julia is the availability of professional grade visualisation packages, an example of which is shown below:

using StatsPlots
gr(fmt = :png) # choose the fast GR backend
@df flights scatter(:DepDelay, :ArrDelay, group = :Distance .> 1000, layout = 2, legend = :topleft)

OnlineStats
OnlineStats is a popular package to compute statistics, one observation at a time. It is based on the parallel algorithms. OnlineStats can be integrated with JuliaDB in a seamless manner. Its biggest advantage is that it can handle streaming and Big Data that does not fit in memory. Figure 6 illustrates the computation of statistics in parallel, in JuliaDB.

Figure 5: Visualisation

Integration of OnlineStats is facilitated with reduce() and groupreduce(). Code examples are as shown below:

julia> using JuliaDB, OnlineStats

julia> t = table(1:100, rand(Bool, 100), randn(100));

julia> reduce(Mean(), t; select = 3)
Mean: n=100 | value=-0.0439736

For groupreduce(), the code is:

julia> grp = groupreduce(Mean(), t, 2; select=3)
Table with 2 rows, 2 columns:
1 2
───────────────────

false Mean: n=42 | value=-0.0488427
true Mean: n=58 | value=-0.0404477
Figure 6: JuliaDB parallel computation

To summarise, JuliaDB has various features to process data. With the inherent speed advantage of Julia and the support for parallelism, JuliaDB’s performance is superior to that of its counterparts. If you are a data science enthusiast, you may find JuliaDB very useful.

LEAVE A REPLY

Please enter your comment!
Please enter your name here