CodeSport (December 2011)

Compilar optimisation

In this month’s column, we will continue our discussion of compiler optimisations, and focus on compiler analyses and optimisations that require inter-procedural analysis and inter-procedural code transformation.

In last month’s column, we had posed a coding snippet (shown below), and had asked our readers how it can be rewritten to reduce data cache misses associated with that code region.

int** two_dim_array =  (int*) calloc (sizeof (int*) * N);
for (int j = 0; j < N; j++)
    two_dim_array[j] = calloc(sizeof(int) * N);
….. // read in input values for the elements of two_dim_array[i][j]
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
  result   + =  Two_dim_array[i][j];

Thanks to all those who sent in their solutions to this question. Congrats to our reader Prashanth Shah for getting the correct solution.

The problem with the above code snippet is that the adjacent rows of the array are not contiguous, while the elements of a single row itself are contiguous. For instance, if we assume N as 100, then the elements two_dim_array[0][98] and two_dim_array[0][99] are contiguous, but they are not contiguous to two_dim_array[1][0], the first element of the next row. Hence, whenever there is a new iteration of the outer for loop (i in the code above), it can lead to a compulsory cache miss.

Therefore, in order to avoid this cache miss, and enable optimisations like data prefetching, which can read-ahead the next element being accessed, we need to change the allocation pattern of the original array. Instead of allocating each row independently, the compiler can perform an optimisation known as “malloc combine transformation”, wherein it combines the calloc calls in Line 1 and Line 3 of the code snippet above, and allocates a two-dimensional array of size two_dim_array[N][N] contiguously, as shown below:

int* temp =  (int*) calloc (sizeof (int) * N *N);
for (int j = 0; j < N; j++)
    two_dim_array[j] = &temp[j][0];
….. // read in input values for the elements of two_dim_array[i][j]
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
  result   + =  Two_dim_array[i][j];

Inter-procedural analysis and optimisation

Now, let us turn our attention to compiler analysis, and the optimisation that requires the compiler to look at code occurring in more than one function — and, in many cases, in all the functions that are present in the application.

Typically, the compiler is invoked on a .c file that contains code for multiple functions. An application consists of multiple .c files, each of which contains code for multiple functions. Inter-procedural analysis and optimisation can happen at both the intra-file and the inter-file level. Can you think of a very common compiler optimisation that requires the compiler to examine more than one function to perform?

Well, the answer is the optimisation known as inlining, where the compiler replaces the call site in a function with the complete code of the called function. For instance, if the function foo contains a call to the function bar, then the compiler examines the code of both foo and bar and in certain cases, decides to inline the function bar at the call site in foo.

Inlining is a complex transformation, since indiscriminate inlining can cause considerable code bloat. Can you think of the reasons why the compiler performs inlining? It has two benefits. First, it reduces the function call overheads. Performing a call involves certain overheads, such as passing the parameters, storing the return address and saving registers. Inlining avoids these overheads. Second, inlining increases the code that is visible to the compiler for performing optimisations. This is what makes inlining a powerful tool for improving the performance of a given application.

Remember that the compiler performs most of the classical optimisations we have discussed in the previous column, at a function level. Therefore, inlining allows the compiler to examine a larger region of code for potential optimisation opportunities.

For example, consider the following piece of code:

int A[N];
foo()
{
      …..
      for (int I = 0; i < N; i++)
          = A[i];
      bar(N);
      --------
}
bar(int K)
{
      for (int j = 0; j < K; j++)
             = A[j];
       ……
}

In the above code snippet, we have a for loop in function foo, the loop index of which has the same iteration space as the for loop in the function bar. The function bar is invoked right after the for loop in function foo.

Both these for loops iterate over the array A. Therefore, it would be beneficial if these two loops can be fused by the compiler by performing the loop fusion transformation. However, these two for loops occur in two different functions. Almost all the classical scalar and loop transformations happen within a single function unit. Hence, it is not possible for the compiler to apply loop fusion to these two for loops, since they are in different functions.

However, once inter-procedural analysis and transformation has been done by the compiler, the compiler can inline the function bar into the function foo. The compiler performs loop nest optimisations downstream to the inlining phase. As part of this loop nest optimisation phase on function foo, the compiler can now apply standard loop fusion transformation to these two for loops. Thus, inlining enables downstream optimisations, thereby considerably aiding performance.

While inlining is a well-known inter-procedural transformation, there are a number of lesser-known but quite powerful inter-procedural transformations. One of them is known as inter-procedural constant propagation. Most of us would be familiar with local constant propagation. Consider the following piece of code:

foo()
{
     ------  /* do some thing */
     int I = 20;
      result = I * 10;
      return Result;
}

In the above code snippet, the compiler can perform classical constant propagation and return the result as 200. However, consider the following code snippet:

foo( int I)
{
     ------  /* do some thing which does not modify I */
    result = I * 10;
      return Result;
}
int bar()
{
       int I = 20;
      int Res = Foo(I);
       return res;
}

In the above code snippet, it is not possible to perform classical constant propagation in the function foo since the variable I is the formal parameter for the function foo, and without inter-procedural analysis, the compiler cannot determine whether the formal parameter I in foo is a constant.

However, when inter-procedural constant propagation is enabled as part of the inter-procedural analysis (IPA) phase of the compiler, the compiler creates a summary for each call site. It recognises that in all call sites involving the call to foo in the above code snippet, the function foo is passed the argument value of 20. Hence, it can propagate this constant value into the called function foo, and then perform the classical constant propagation within the function ‘foo’.

Some of the other popular inter-procedural transformations are inter-procedural range propagation, procedure boundary elimination, function cloning, partial inlining and structure layout transformations. We will discuss a few of them in next month’s column.

My ‘must-read book’ for this month

This month’s recommendation comes from one of our readers, K Sunil, who suggests we read a classic to learn more about compiler optimisations. The book is the Compiler Handbook edited by Prof Srikanth and Prof Priti Shankar. The book is a compilation of articles on different compiler optimisations, covering both theoretical and practical aspects. Thank you, Sunil, for the recommendation.

While this is a must-read book for anyone wanting to become a compiler engineer, this is probably not the best choice for those just learning compilers as their first course. For a beginner’s course in compiler optimisations, I would suggest readers consult the classic ‘dragon book’, namely, Compiler Design by Aho, Sethi and Ullman. But if you are interested in delving into many of the state-of-the-art compiler optimisations, then Sunil’s recommendation of the Compiler Handbook is a must-have on your bookshelf.

The chapters on memory hierarchy optimisations and register allocation make for interesting reading. The book contains a number of references to many of the classical compiler papers.

If you have a favourite programming book/article that you think is a must-read for every programmer, please do send me a note with the book’s name, and a short write-up on why you think it is useful, so I can mention it in this column. This would help many readers who want to improve their coding skills.

If you have any favourite programming puzzles that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com. Till we meet again next month, happy programming, and here’s wishing you the very best!

Feature image courtesy: sporras. Reused under the terms of CC-BY-SA 2.0 License.
  • Devendra Naga

    I like the way sandy gives the comments on performance tuning :). thanks a lot :)

All published articles are released under Creative Commons Attribution-NonCommercial 3.0 Unported License, unless otherwise noted.
Open Source For You is powered by WordPress, which gladly sits on top of a CentOS-based LEMP stack.

Creative Commons License.