Better Queries with MySQL, Part 3: The MyISAM Storage Engine

Better queries with MySQL

MyISAM is MySQL’s default storage engine, and is probably most commonly used by new adopters of MySQL. The objective of this article is to discuss how the MyISAM storage engine works, to enable developers to write better SQL queries.

Before we jump into the complex detailing of MyISAM, it’s time to learn some of its important characteristics:

  • MyISAM stores each table in two files — one containing data, and the other, indexes. The extensions of these two files’ names are .MYD and .MYI respectively. Using the DATA DIRECTORY and INDEX DIRECTORY options to the CREATE TABLE query, you can put the above mentioned two files in different directories. These two directories, put on different disks, can offer significant performance improvement. This technique is often used in case of very heavily loaded databases, and is supported by many databases.
  • The data stored in these files is platform-independent — or, more technically put, is not dependent on the endianness of a machine. In other words, you can simply copy (literally) data and index files from an Intel x86/x64 machine to PowerPC or Sun SPARC.
  • MyISAM doesn’t support transactions. Only table locks are available. Row-level locks, which are a must for MVCC capabilities (transactions, to put it in simple words) are not available in MyISAM.
  • All the readers (with SELECT family of queries) have to obtain read locks. Multiple users can acquire read locks at the same time; hence, the common term shared locks is frequently used.
  • All the writers have to obtain exclusive locks. In simple words, two users cannot perform writes (INSERT, UPDATE, DELETE family of operations) at the same time. This is very important to remember. A write operation must finish very quickly. Otherwise, in case of a database with a heavy load, there will be a long queue of connections waiting for exclusive (write) lock. In general, write operations are not time-consuming; however, the indexing strategy plays a role here. More on this later.
  • One great feature offered by MyISAM is called concurrent inserts. These allow the insertion of new records while SELECT-family queries are running. However, as insertion requires an exclusive lock, only one user can perform concurrent inserts.
  • MyISAM supports full-text index.
  • MyISAM supports compressed tables. When a table is compressed using the myisampack utility, individual rows are compressed and not the entire table as a single entity. Please remember that compressed tables are read-only. When a SELECT query is issued on a compressed table, rows are decompressed based on the requirement. A query causing a full-table scan can be a disaster in such a case; however, a well-indexed table performs pretty well. Compressed tables are a great utility when distributing a database on space-constrained media like CDs.
  • MyISAM permits NULL values in indexed columns.
  • Each character (CHAR/VARCHAR) column can have a different character set. For instance, in one column you may store ASCII character values, and in the other you may store UTF-8 or UTF-16 values.

Understanding the B-tree

Anybody dealing with a respectable database-management system must have heard the term “B-tree”. Some systems and databases also use the notion of B+tree. What is it and how does it fit into the picture of database-management systems? Here is what Wikipedia says on the topic:

In computer science, a B-tree is a tree data structure that keeps data sorted and allows searches, sequential access, insertions, and deletions in logarithmic amortised time. The B-tree is a generalisation of a binary search tree, in that a node can have more than two children. Unlike self-balancing binary search trees, the B-tree is optimised for systems that read and write large blocks of data. It is commonly used in databases and filesystems.

Pictorial representation of a B-tree

Figure 1: Pictorial representation of a B-tree

Figure 1 shows a very simple B-tree. Notice the difference between conventional binary search trees and a B-tree. A B-tree is not “binary”, that is, each node having 2 child nodes. Essentially, a B-tree allows storing data in a manner so that searching records, inserting new ones, and deleting existing ones is very fast.

Now, let’s come back to the question about what the relationship between MyISAM and a B-tree is, and how you can write better queries by getting details on this subject.

The B-tree index

When you create an index in a MyISAM table, its type is B-tree. Figure 2 shows how MyISAM stores the index(es) in the .MYI file and data in the .MYDfile.

A B-tree index in MyISAM

Figure 2: A B-tree index in MyISAM

Don’t be bothered if this appears a bit daunting. The following is a description of how data and indexes are stored in case of B-tree indexes, in MyISAM tables:

  1. On top, we have the root node of the B-tree. This is not specific to MySQL or MyISAM. Every tree has to have a top-level/root node. Traversal starts from this node in most cases. In B-tree terminology, this is called a non-leaf node. Leaf nodes are those that have no child nodes.
  2. Below the root node, we have leaf nodes that are connected with the root node. The link between the root node and the immediate children is shown as “pointers”; however, it is not necessary that these be C/C++ style pointers.
  3. Figure 2 shows a B-tree with two levels; however, in a real case, the tree may be much more complex, with multiple levels. This representation is to give you a basic idea about how B-tree works in the case of MyISAM.
  4. In this figure, three leaf nodes are shown. Each leaf node holds the keys of a different range: 1 to 40, 41 to 80, and 81 to 100. Please note that the set of ranges is purely arbitrary, and meant purely for discussion purposes. This has no connection with how MyISAM breaks down the B-tree.
  5. Below the leaf nodes are the actual data pages — i.e., the table data. These are connected with leaf nodes on the basis of key values. This is how searches on a well-indexed table get results at lightning speed.
  6. Please note that the link between leaf nodes and data pages, which is marked as “Pointer to Data Pages” isn’t a C/C++ style pointer. Instead, leaf nodes hold enough information required to access the exact position of a row (or rows) in the .MYD file. In other words, data can be read/updated/deleted with the least required file I/O.
  7. To make this a little simpler, let’s take an example query: SELECT * FROM SOME_TABLE WHERE SOME_COLUMN = 15;
    • It is assumed that the column SOME_COLUMN is indexed.
    • From the root node, follow the left channel. This is based on simple maths, 15 < 40. This lands you at the left-most leaf node.
    • In the left node, you will find the key with the value 15.
    • Once you find the key, you’ll know if there is any data in the .MYD file with SOME_COLUMN = 15.
    • If so, retrieve the information required to reach that exact location in the .MYD file, to read the row in question.

After a close look at Figure 2, you can answer a question that is often a matter of debate — or rather, confusion. Does MyISAM support clustered indexes? The answer is, “NO!”

In case of a clustered index, data is stored in a sorted order, which is based on the keys/indexes. This isn’t how MyISAM works. Data isn’t stored in a sorted order. Instead, data is stored in the order in which it is inserted. It is the indexes that are sorted, and these are not stored with data — MyISAM stores data and indexes separately. However, indexes have enough information to point at the exact location of the data in the .MYD file.

Please note that in case a MyISAM table contains multiple indexes, all the indexes are stored in the same .MYI file.

Please bear in mind that even though B-tree provides a facility for very fast lookup of indexes (and thus the data), everything comes at a cost. You must have heard that indexes improve the speed of searches (SELECT family of queries); however, insertions and deletions are slowed down. This is because whenever data is inserted/deleted, MySQL has to alter the indexes in the .MYI file. In case there are no indexes, then there is almost no need to deal with the .MYI part of a table. In this case, one can experience the fastest insertions, however, at the cost of extremely bad performance of SELECT and UPDATE queries.

Practically, no application relies on this approach. Indexes must be well-thought-out to avoid the cost of B-tree-driven index management while doing insertions/deletions, and yet getting the best results in case of searches. Whenever dealing with indexes, keep index selectivity in mind.

Index selectivity

Though not talked of very often, index selectivity is a very critical issue, and when attended properly, can answer two of the most common questions: why isn’t MySQL using my index, and why isn’t MySQL performing well even when I have properly indexed my table.

Index selectivity, in the simplest words, describes how different the values of a column are. This is assuming that the index contains only one column. In case there are multiple columns in an index, selectivity describes how different the values of those columns are.

Selectivity is a number from 0 to 1 — or think of it as a percentage. A value of 1, or 100 per cent, means that each value in the column is unique. This happens with UNIQUE and PRIMARY keys, although non-unique fields may have a selectivity of 1. This depends on the nature of values in the column(s) in question.

SELECTIVITY = NUMBER OF DISTINCT RECORDS / TOTAL NUMBER OF RECORDS

This implies that UNIQUE and PRIMARY indexes always have a selectivity of 1. The higher the selectivity, the faster the searches are going to be. For example, consider the following table:

Column name Data type Indexed?
email_address VARCHAR (100) Yes, Primary
password VARCHAR (100) No
country_code INT Yes

We have two indexes in this table — email_address and country_code — and 10,000 records. Now let’s try to figure out the selectivity:

  • email_address: Being the primary key, all values in this column must be unique. Hence selectivity is 1, an extremely good index.
  • country_code: Although there are many countries in this world, I am assuming that you have users coming from 100 countries. Selectivity = 100/10000 = 0.01 or 1 per cent. This is too low to serve as a good index.

Lower selectivity is treated by MySQL as an expensive operation. Higher selectivity is treated as an inexpensive operation.

Please note that the correct/scientific way to calculate selectivity is to use a production-class database, and find the distinct rows using a COUNT (DISTINCT <COLUMN>)-like query. If you assume that the number of distinct values in a column do not always work well, avoid heuristics as much as possible.

MySQL has a cost-based optimiser. This means that MySQL calculates the costs of different ways of performing a query, and then chooses the cheapest one. In order to calculate the exact cost, the optimiser would actually have to run the query. If MySQL finds the selectivity of an index (using the above formula) for each query, this will, by all means, kill the machine. So an estimate is taken. This cost-based optimiser uses selectivity when it decides whether or not to use an index. MySQL may not use your index if the selectivity is too low! This means that even when you have an index, MySQL may perform a full-table scan!

An index like country_code, which we discussed earlier, will never be used by MySQL. Another major problem is that it will result in wasteful consumption of CPU time during insertions and deletions too. So, pay attention when you create indexes. Every INSERT operation will cause multiple B-tree operations to adjust the country code to keep the B-tree balanced; yet, there’s no use for this while searching the data.

References
  • Karthik India

    Nice post

All published articles are released under Creative Commons Attribution-NonCommercial 3.0 Unported License, unless otherwise noted.
Open Source For You is powered by WordPress, which gladly sits on top of a CentOS-based LEMP stack.

Creative Commons License.