Design Level Optimization Techniques

Replace algorithms
Programs run slower most of the time because of the algorithm chosen to implement the functionality. There are two ways you can measure the performance of algorithms, one is theoretical on standard algorithms (sorting, searching, etc) and practically by measuring the time taken for the custom algorithm.
The most common metric for calculating time complexity is Big O notation. It provides theoretical estimates of number of steps needed by any algorithm for the input length ‘N’.
Finding an item in an unsorted list, the time complexity is O(n), means it takes max of ‘n’ steps(traversals) to search for an element in a list which contains ‘n’ elements. For example, if searching through a list of 1,000 elements takes one second, it will take two seconds to search through a list of 2,000 elements.
Following are some replacements, which are faster than their counterparts.

  • Replace Binary search O(log n) with normal linked traversal O(n)
  • Replace Hashing O(1) with normal linked list traversal O(n)
  • Replace Quick sort O(n log n) with bubble or insertion sort O(n^2)

More complex algorithms and data structures perform well when there are many elements, whereas simple algorithms are more suitable for small amounts of data. In this case we can implement both the algorithms in the system and pick appropriate algorithm based on the number of elements.

Parallel/Distributed computing
Parallel computing is dividing large tasks into smaller ones, which are then executed in parallel in multiple threads or process on multiple processors. If the subdivided tasks are inter dependent, then task sequence needs to be synchronized to yield correct result.
Distributed computing is similar to parallel computing in which instead of multiple processors, many computers will be used to solve the problem. The inherent advantage of distributed computing is scalability (can handle more requests by adding more machines) and high availability (by providing redundant machines for computing in case of failure).

Use Efficient data structures
Data structure should be chosen based on the data access requirements. For sequential access, list is better and for random access based on index, array is better. If we store customer information in memory as a list, when customer information needs to be searched based on the customer ID, then it needs to traverse the list by comparing each element with the required ID till it find the element. This list data structure shall be replaced with a hash table structure, which will avoid traversals, and required information can be found faster.
If we need to get range of customer ids say from 100 to 200, then with list data structure, it traverse the whole list by checking whether the element value is within the range specified. This list data structure shall be replaced with binary tree data structure to store the records so that retrieval will be faster. Below table summarizes which data structure should be used for a data access pattern

Array – Random access on finite elements with an index
List  – Sequential access on infinite elements and more insertions than selection or deletion
Hash Table – Random access on infinite elements on a key
Tree – Find largest, range browse alphabetically, etc on a key

Reduce Contention
Multi threaded programs when they share some data, it will be synchronized through either a mutex or semaphore. When multiple threads try to grab the lock simultaneously to access the shared data, it will be blocked and only one thread at a time will be allowed to take the lock and change the data structure.
Once that thread completes, the next thread, which is in queue to get that mutex, will be granted the lock. If this contention increases in your program, it means that though you have multiple threads all are waiting for that lock to perform operation on the shared resource one after the other sequentially. Below are some techniques to reduce lock contention

  • Keep critical section as small as possible (critical section is code written between mutex acquire and mutex release). Keep only the code that changes the shared data in the critical section. Move all the other
    operations outside the critical section.
  • When there are lots of threads which reads the shared data and rarely changes the value, then use read/write locks. In this, multiple threads are allowed to read simultaneously and only one thread is allowed to write exclusively.
  • Reduce the granularity of lock. Instead of locking whole list, lock only the element in the list. This way others are allowed to accesses other elements in the list.
  • If you are using semaphores for synchronization, then you can detect contention by getting statistics using semctl() function. This will contain the last semaphore operation time, pid which did the last operation,
    value of the semaphore, number of processes currently waiting on a semaphore, etc.

Partial Evaluation
An application may use a generic function in a very specific way and pay a performance penalty for functionality it does not need from it. If this generic function is called many a times from our program, then the overall performance of our program will degrade. This shall be overcome by replacing the generic function with a specialized function, which suits only our need so that it performs faster. This technique is called partial evaluation.

malloc() and free() are generic allocators designed in such a way that it allows simultaneous allocation from multiple thread. This is implemented internally with the help of either a mutex or semaphore, which incurs computational cost.
If you are allocating many items from single thread, then it is better to write your own custom allocator to improve the performance. malloc() is designed for allocating arbitrary size of memory. If your program requires allocating many fixed size items, then writing custom allocator for fixed size items will drastically improve the performance.

Cache frequently accessed data which does not change
Cache is the duplication of original values stored elsewhere or computed earlier close to where the computation happens, as the original data is expensive to fetch. Once the data is stored in the cache, subsequent fetch will access it from the cached copy rather than re-fetching the original data. For example, data stored in disk or database shall be cached if it is accessed frequently to improve the performance. If that frequently accessed data shall be modified by external systems, then the caching mechanisms should deal with it to avoid stale reads
from the cache.
When configuration parameters are stored in file and are used in programs many times in many place, instead of opening the file and reading the appropriate name, value pair we shall read the file once and store the data in memory and subsequent calls to read configuration value shall be retrieved from memory without involving any disk I/O.

Perform time taking operations asynchronously
In a function, if some operations take lot of time (for example writing log message to disk), it can be done asynchronously by using a dedicated thread or process (log writer process) to perform that job. The function will return immediately after it instructs the log writer process to initiate file write. One common issue with this is last few messages will be lost in case of crash.

Use macros to turn on/off debug messages
Prefer to use macros than condition checking to enable debugging or logging. Debug messages are mostly used during the development of the program and usually are disabled in the production systems. Macros do not incur any run time overhead, which makes them a good candidate for implementing debug messages.

int debug=0;
int function()
{
//do something here
if (debug)
printf(“This is debug message”);
//do something here
}
shall be replaced to
#define DEBUG
int function()
{
//do something here
#ifdef DEBUG
printf(“This is debug message”);
#endif
//do something here
}