Alignment
In the ancient world as in the modern one, some events are seen as important, and as such a special significant is associated with them. Both science and psydo-science alike have established the importance of correct alignment.
Myan Calendar
Stonehenge
Nazca Lines
Egyptian Pyramids
Egyptian Astrology
Computer science is no exception. In modern processors memory alignment is important for performance reasons and for correctness. The exact implications of alignment vary with processor architecture, but in general accessing data in memory which is aligned to a N byte boundary where N is the machine word size is faster than accessing that which is not aligned. In older processors which were created when memory was limited alignment was not enforced becuase the performance gain was not as important as saving space. In 64 bit processors alignment is given priority over space saving to maximuse performance.
Becuase alignment is architecture dependent many implementations of system routines such as those found in the standard C runtime library are implemented with deliberately varying behaviour. On Intel x86 processors alignment is not enforced, nor is it critical in all cases. Despite this fact the standard C routines still check for alignment so as to avoid issues with page boundaries and portability.
It is often claimed that accessing unaligned memory even on an x86 architecture incurs a massive performance hit. Recent investigations that i have performed have led me to believe that this is a gross generalisation. In fact i have found that in most cases reading 4 bytes of unaligned data in a single operation is still faster than reading 4 bytes individually.
This contradicts many of the claims made about alignment performance, but can be explained on x86 architectures becuase the processor actually does the alignment correction in hardware. Therefore accessing 2 blocks of memory as happens during an unaligned 4 byte access is still faster than accesing 4 bytes individually.
The following code will perfom up to 200% faster than strcmp() in the general case due to the fact that strcmp() performes a single byte comparison whenever either parameter falls on an unaligned address.
It is also highly optimised in the way it checks for null characters in either string. Unfortunately i can't confirm if this code is safe in all circumstances though it does compile and run succesfully on both windows and linux. There is potential danger particularly when either string falls on a page boundary in memory, an occurance which is statistically unlikely but non-the less possible.
A similar algoirthm can however be employed as a replacement for memcmp() where the length is known without the uncertainty. Such code performs much faster than memcmp() in the general case. Following from this principle there is also an algorithm for rapid conversion from characters to numbers, i.e. atoi(), which can consume 4 bytes at a time. This is somewhat more complicated but again involved masking in order to caplitalise on the values stored in the cpu registers and cache.
Where does this leave us ?
Given that even with these optimisations aligned access still results in better performance by a factor of 2 it makes sense to ensure that as much memory is aligned as possible, particularly if you are using the standard C routines.
Considder the implications of the following code.
Myan Calendar
Stonehenge
Nazca Lines
Egyptian Pyramids
Egyptian Astrology
Computer science is no exception. In modern processors memory alignment is important for performance reasons and for correctness. The exact implications of alignment vary with processor architecture, but in general accessing data in memory which is aligned to a N byte boundary where N is the machine word size is faster than accessing that which is not aligned. In older processors which were created when memory was limited alignment was not enforced becuase the performance gain was not as important as saving space. In 64 bit processors alignment is given priority over space saving to maximuse performance.
Becuase alignment is architecture dependent many implementations of system routines such as those found in the standard C runtime library are implemented with deliberately varying behaviour. On Intel x86 processors alignment is not enforced, nor is it critical in all cases. Despite this fact the standard C routines still check for alignment so as to avoid issues with page boundaries and portability.
It is often claimed that accessing unaligned memory even on an x86 architecture incurs a massive performance hit. Recent investigations that i have performed have led me to believe that this is a gross generalisation. In fact i have found that in most cases reading 4 bytes of unaligned data in a single operation is still faster than reading 4 bytes individually.
This contradicts many of the claims made about alignment performance, but can be explained on x86 architectures becuase the processor actually does the alignment correction in hardware. Therefore accessing 2 blocks of memory as happens during an unaligned 4 byte access is still faster than accesing 4 bytes individually.
The following code will perfom up to 200% faster than strcmp() in the general case due to the fact that strcmp() performes a single byte comparison whenever either parameter falls on an unaligned address.
int Compare(char * left, char * right)
{
if (left == right) return 0;
while (*(long*)left == *(long*)right)
{
long mask = *(long*)left;
if ( (mask&0xFF000000)==0 ||
(mask&0x00FF0000)==0 ||
(mask&0x0000FF00)==0 ||
(mask&0x000000FF)==0 ) return 0;
left+=4;
right+=4;
}
while(*left == *right)
{
if (*left==0)return 0;
++left;
++right;
}
return *left-*right;
}
It is also highly optimised in the way it checks for null characters in either string. Unfortunately i can't confirm if this code is safe in all circumstances though it does compile and run succesfully on both windows and linux. There is potential danger particularly when either string falls on a page boundary in memory, an occurance which is statistically unlikely but non-the less possible.
A similar algoirthm can however be employed as a replacement for memcmp() where the length is known without the uncertainty. Such code performs much faster than memcmp() in the general case. Following from this principle there is also an algorithm for rapid conversion from characters to numbers, i.e. atoi(), which can consume 4 bytes at a time. This is somewhat more complicated but again involved masking in order to caplitalise on the values stored in the cpu registers and cache.
Where does this leave us ?
Given that even with these optimisations aligned access still results in better performance by a factor of 2 it makes sense to ensure that as much memory is aligned as possible, particularly if you are using the standard C routines.
Considder the implications of the following code.
inline void * operator new (size_t size)
{
char * address = (char *) malloc(size+4);
return (void *) (address + (((unsigned long)address)&0x00000003));
}
<< Home