Code Level Optimization Techniques

Some of the techniques mentioned below “arithmetic operators”, “jump table to replace if…elseif”, “faster for loops” makes the code hard to debug, un-portable and un-maintainable. If there is no other way to optimize, then only attempt these techniques.

Use pass by reference for user defined type

Passing parameters by value to functions, result in the complete parameter being copied on to the stack. Pass by value shall be replaced by const references to avoid this data copy. Passing bigger objects as return values also has the same performance issues; instead replace it as OUT parameters for that function.

Pre-compute quantities to speed up run-time calculations

Avoid constant computing inside the loops. Though most of the compilers do this optimization, it is still better if it is handled in our code rather than relying on the compiler.

for (int i=0; i <100; i++) {
    offset = offset + strlen(str);
}
Shall be replaced to 

int size = strlen(str);
for (int i=0; i <100; i++) {
    offset = offset + size;
}

Avoid system calls in time critical section of the code 

System calls are costly as it involves switching into kernel mode from user mode. For instances where two process needs to communicate, using shared memory is better than using message queues of pipes as shared memory does not incur any system calls to send and receive data.

Lazy / Minimal Evaluation 

Lazy or delayed evaluation is the technique of delaying a computation until the result of the computation is actually needed for that code path. 

For example, if the expressions are concatenated using logical ‘AND’ or ‘OR’ operator, the expression is evaluated from left to right. We shall gain CPU cycles by placing the expressions in correct position

Consider the code example

if (strcmp(str, “SYSTEM”) && flag == 1)

{

   //…

}

In the above code segment, strcmp() function takes lot of time and for cases where flag != 1, it still does the comparison wasting lot of CPU cycles. The efficient code is

if (flag ==1 && strcmp(str, “SYSTEM”))

{

   //…

}

The same is applicable for OR logical operator when flag == 1. In general, for expressions using operator &&, make the condition that is most likely to be false the leftmost condition. For expressions using operator ||, make the condition that is most likely to be true the leftmost condition

Frequently executed cases should come first in the switch case or if..elseif statement so that the number of comparisons is reduced for the frequently executed case.

The following switch statement is inefficient as integer datatype, which is used most of the time, is placed in the last case and then string data type. Double data type is seldom used in most of the applications, but it is placed in the first case.

switch(dataType) {
  case typeDouble: { doDoubleAction(); break; }
  case typeDate: {doDateAction();; break; }
  case typeShort: {doShortAction();.; break; }
  case typeString: {doStringAction(); break; }
  case typeTimeStamp: {doTimeStampAction(); break; }
  case typeInt: { doIntAction(); break; }
  default: {doDefaultAction(); break; }
}
This shall be made efficient by keeping the frequently used data types first and followed by least and seldom-used data types.

switch(dataType) {
  case typeInt: {…; break; }
  case typeString: {…; break; }
  case typeShort: {…; break; }
  case typeDouble: {…; break; }
  case typeTimeStamp: {…; break; }
  case typeDate: {…; break; }
  default: {…; break; }
}
Much more elegant way is to implement a jump table using function pointers. Consider the following code

typedef void (*functs)();
functs JumpTable[] = { doIntAction, doStringAction(), doShortAction() /* etc*/} ;

Place your function pointers in the same sequence of dataType enum. The above JumpTable assumes that DataType enum is defined as follows

enum DataType{
typeInt =0,
typeString,
typeShort,
/* other types */
};

To call the appropriate implementation just use the following statement

JumpTable[dataType]();

Now the compare operations are replaced with array indexing operation, which are much faster. Moreover the time taken for every data type is nearly the same as compared to if..elseif and switch() constructs.

Minimize local variables

If the number of local variables in a function is less, then the compiler will be able to fit them into registers, thereby avoiding access to the memory (stack). If no local variables need to be saved on the stack, the compiler need not set up and restore the frame pointer.

Reduce number of parameters

Function calls with large number of parameters may be expensive due to large number of parameter pushes on stack on each call. This is applicable even when struct is passed by value. 

Declare local functions as static

If the function is small enough, it may be inlined without having to maintain an external copy if declared as static.

Avoid using global variables in performance critical code

Global variables are never allocated to registers. Global variables can be changed by assigning them indirectly using a pointer, or by a function call. Hence, the compiler cannot cache the value of a global variable in a register, resulting in extra loads and stores when globals are used. 

Pointer chains

Pointer chains are frequently used to access information in structures. For example, a common code sequence is:

 typedef struct { int x, y, color; } Point;

 typedef struct { Point *pos, int something; } Pixel;

 void initPixel(Pixel*p)

 {

 p->pos->x = 0;

 p->pos->y = 0;

 p->pos->color =0;

 }

However, this code must reload p->pos for each assignment. A better version would cache p->pos in a local variable:

 void initPixel(Pixel *p)

 {

 Point *pos = p->pos;

 pos->x = 0;

 pos->y = 0;

 pos->color = 0;

 }

Another way is to include the Point structure in the Pixel structure, thereby avoiding pointers completely. Some compilers do provide this optimization by default.

Replace frequent read/write with mmap

If the application does lot of read/write calls, then you should seriously consider to convert it into mmap() . This does not incur data copy from kernel to user and vice versa and avoids kernel mode switch for executing the read() or write() system call. You can use fsync() call, once to flush all the data to the disk file after you write all your data into the mapped memory area. 

Place elements together, which are accessed together

When you declare a structure make sure that the data elements are declared in such a way that most frequently accessed elements at the beginning

For example if page element is accessed many times and hasSpace little less and freeNode rarely, then declare the structure as follows.

Struct Node{

Int page;

Int hasSpace;

Int freeNode;

///other types

};

This improves the performance because L1 or L2 cache miss, will not get only the required byte, it also gets one full cache line size of data into the appropriate cache. When page is cached, it also caches hasSpace if cache line is 64 bit.  L1 cache line size is usually 64 bits and for L2, it is 128 bits.

Arithmetic operator

Replacing multiplication and division by shift and add operations.

X * 2 shall be replaced to X+X or X<<1

X * N shall be replaced by X<<I where N is 2 ^ I

X / N shall be replaced by X >>I where N is 2 ^ I

Faster for() loops 

for( i=0; i<10; i++){ … } 

If the code inside the loop does not care about the order of the loop counter, we can do this instead:

for( i=10; i–; ) { … }

This runs faster because it is quicker to process i– as the test condition, which says “Is i non-zero? If so, decrement it and continue”. For the original code, the processor has to calculate “Subtract i from 10. Is the result non-zero? If so, increment i and continue.” 

Prefer int over char and short

If you have an integer value that can fit in a byte, you should still consider using an int to hold the number. This is because, when you use a char, the compiler will first convert the values into integer, perform the operations and then convert back the result to char. Doing so, will increase the memory usage but decreases the CPU cycles 

Advise OS on how it should manage memory pages

Using madvise()  system call, we can specify how the swapping and prefetching should be handled by the operating system. If you are accessing data sequentially, OS can prefetch some pages and keep it ready before the program references it. If program accesses data in some memory location frequently, it shall be locked in physical memory avoiding page swap to disk, using mlock() system call.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s