It’s a Saturday morning, as I feed these words into a word processor. I have a browser running in the background, with a page on eBay open in one tab, and Engadget in another. The sheer simplicity of browsing through the product catalogues on eBay, or reading the latest and greatest tech news on Engadget, is living proof of the power of the database. The information age, the social networking age, the collaborative age, or whatever “age” you may want to call it, all revolve around data.
Shouldn’t the information age be revolving around, well, information? At this point, I would like to highlight a key difference between information and data. Information is processed data. From a philosophical standpoint, it has been wonderfully put by Gene Bellinger, Durval Castro and Anthony Mills in their article, aptly titled, “Data, Information, Knowledge, and Wisdom”; the cumulative order would be Data < Information < Knowledge < Understanding < Wisdom.
As far as computing goes in day-to-day contexts, the buck stops at information databases, which, as the word suggests, deal with what underlies information, i.e., data. Consider the product page on eBay. All you see is information, the cohesive form, which consists of the product images, the description, the price, information about the seller, etc. The data is beneath this processed presentation in the browser.
In this example, the individual components are data — like the words in the description. The words, by themselves, can’t stand independently on the page. Only when they are in context with one another, and with other data, do they convey something that makes sense. When they are processed, they turn into information.
The domain of databases is vast, and fills volumes and volumes of mammoth publications. The scope of this article is to give you a taste and understanding of databases, how to go about designing one, and I will conclude with designing an actual but hypothetical database, and implementing it.
So, what are databases?
Databases are structures that hold data. The software that enables the flow of data through these structures is called a database management system or DBMS. The most widely used system of DBMS is something called RDBMS or Relational DBMS. This simply means that the data is stored in tables; moreover, whatever relationships that exist within the data are stored within tables.
To put it simply, there are three parts that make a database:
- Tables: We all know what tables are — a matrix of rows and columns. In databases, it’s the same. Each row is a record, or a unit of data. A record (row) can have several columns or fields. Each field is like an attribute of that record.
- Queries: A query is a question posed to the database, to retrieve a specific set of records, based on conditions supplied in the query.
- Views: These are virtual tables, or (a set of) stored queries.
At a physical level, the data is stored in data files specific to the DBMS. Examples of modern-day RDBMSs that are widely used include Oracle, MySQL, etc. Oracle is the largest commercially available RDBMS, and MySQL (earlier acquired by Sun, and subsequently by Oracle) is a free and open source RDBMS that is very well-known.
Some other (important) concepts
A database is accessed, read from and written to, using an interface that allows applications to store and retrieve data. This interface is called an interface driver or database driver. There is usually more than one driver for an RDBMS. ODBC (Open DataBase Connectivity) is a popular database driver. So, if you write a program that needs to store and read data, you would use the API (Application Programming Interface — a collection of functions) provided by ODBC to do so. Similarly, there is JDBC (Java DataBase Connectivity) and others.
Let’s suppose you are making a payment online, and the power goes out at the least opportune moment, say, just when you hit the Pay button (as we all know, Murphy’s Law never fails us; it’s like gravity, I guess). Did the money get deducted from your bank account? If it did, did the merchant receive it? If so, does the system know that it happened successfully?
There are so many steps involved after you push that Pay button on Amazon, eBay or any e-commerce site; like the various messages exchanged between the participating banks, the merchant and the buyer (you). Once the power comes back, you can see what happened: the money has gone from your account, but the order status is Unpaid. The seller says he didn’t receive the money.
This is where the concept of a transaction comes in. If one step in the multitude of steps fails, then the action won’t happen. It’s an all-or-nothing scenario. Either your purchase will be complete, with all steps successful, or nothing happens. There is no in-between state. So, in our example, the bank will roll back the changes, and the money will reappear (depending, of course, on the bank’s time policies).
It’s about ACID
ACID stands for atomicity, consistency, isolation and durability. We have just seen what transactions are all about. The concept of ACID is nothing but a set of properties that define a successful execution of a transaction.
The all-or-nothing attribute of the transaction is called atomicity. The data must be valid for the transaction to be successful; this is the consistency of the transaction. In the above example, it’s a guaranteed possibility that there will be many buyers making purchases on the site. The RDBMS must take care that each transaction is isolated from the rest, to prevent data with errors. This is the isolation of the transaction.
And finally, we have durability, which is an assurance that the RDBMS makes that when the transaction completes successfully, it stays that way—irrespective of any form of failure that may occur once the transaction completes. It’s also a way to make sure that the transaction is rolled back if it has failed intermediately.
Getting in touch with the data: SQL
The theory part is fundamentally in place now. But how does one actually communicate with an RDBMS? This is done by what’s called SQL (Structured Query Language; pronounced “sequel”). SQL is pretty easy to understand, since it’s like broken English. I’ll give you an example: There is a table that stores the names, addresses, cities, countries and phone numbers of college students. I want the RDBMS to show me all students who are from, say, Mumbai. This would be the SQL statement for that query:
SELECT * FROM STUDENT_MASTER WHERE city='Mumbai';
Here, the asterisk (
*) is synonymous to all — it’s a wild-card character. The query will return all rows where the city is Mumbai. This SQL statement can be passed on to the RDBMS via an API function call in code, or through an application like phpMyAdmin. Upon receiving the query, the system returns the records as per the query. Suppose we want only names of the students from Mumbai, and not whole records (we don’t want their addresses, phone numbers or countries of origin), the query would then become:
SELECT name FROM STUDENT_MASTER WHERE city='Mumbai';
Note that the word Mumbai is in quotes because it is text. Numbers won’t have quotes in a query. And the statement must end in a semi-colon.
For the next part of this article, I will be using MySQL to create the database. Here’s a quick way to go about installing and using it:
- Windows users: Download MySQL from its website (it’s called the MySQL Community Server). Just run the installer, and you should be okay.
- Linux users: Use the package manager and install
mysql; alternatively, you can build it from source, if you wish to.
You can start MySQL through the command prompt (Windows) or terminal (Linux), by running
mysql -u root -p. It will ask you for a password; keep it blank since the default is blank. You can change it later.
Types of SQL statements
SQL statements can be classified into the following four categories:
- DDL (Data Definition Language): This class of statement is used to create/destroy or define/change database components. Examples:
- DML (Data Manipulation Language): As the name suggests, these statements manipulate the data itself, and the views related to it. Examples:
- DCL (Data Control Language): These statements control access to the data. There are only two of them:
- TCL (Transaction Control Language): Transactions are handled using these statements. Examples:
Designing a database
Here’s where we design a database for some simple data. Let’s say you want to make a database to store the names, addresses, phone numbers and birthdays of your friends and relatives. The first step is to see what kind of data you want to store:
- Phone Numbers
Let’s begin by creating the database in MySQL:
mysql> CREATE DATABASE mycontacts; Query OK, 1 row affected (0.00 sec) mysql> USE mycontacts; Database changed
USE command tells MySQL which database you want to be working on in this session. You can list the available databases using the following command:
mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | mycontacts | | mysql | | phpmyadmin | +--------------------+ 4 rows in set (0.05 sec)
Now let’s go table by table, and see what columns to put into each. These are the columns I put in:
- Phone Numbers:
So, we can now make four tables:
BIRTHDAY_MASTER. The MySQL command for making
NAMES_MASTER would be:
mysql> CREATE TABLE NAMES_MASTER ( contact_id INT, name VARCHAR(50) ); Query OK, 0 rows affected (0.14 sec)
Other tables can be made similarly.
INT are data-types.
INT stands for integer.
VARCHAR is a variable character. It’s a dynamically changing field, whose size depends on the entered data. The
VARCHAR(50) means the field can have a maximum size of fifty characters. Other supported data-types are listed in the MySQL documentation.
Now let’s explore the concept of primary and secondary keys. If you notice, the field
contact_id is common to all tables. This is to relate the tables by that common field. Here, the
NAMES_MASTER is the primary key, and the rest are secondary keys. In other words,
contact_id is the reference for all the data in each table (as the secondary key), where it finally relates to the name of the contact (as primary key).
Once a table is made, you can populate it using a tool called phpMyAdmin. Doing justice to using phpMyAdmin would require another article; hence, I won’t cover this aspect here. You can also populate using the
INSERT into table_name (column1, column2...) values (value1, value2...);
At the end of this exercise, you will have a database ready to use. This, frankly, is just the starting point in designing a database. I hope it proved helpful to you in understanding what goes on beneath the surface in cyberspace, at least when it comes to managing data.