Custom memcmp() implementation that beats standard library
So, for a very long time the question of how memcmp(), memcpy() and memset() can operate so fast was on my mind, without definitive answers. Now, after making a guess and implementing it, I finally got my answer. I did this on Windows, using MSVC compiler. I do not know how the actual functions are implemented. But it must be very close to this:
Here is results of profiling tests: As you can see, it has identical running speed for big regions of memory and faster speed for small regions (which is actually 90% of the use case for string comparisons). Now, initially I used this variant of first loop:
However, I noticed that my implementation was persistently slower by 30-35% than memcpy(). To understand why, I looked at the generated ASM code: I noticed that I had forced the compiler to produce pretty inefficient code, in that it was constantly incrementing 2 memory pointers, wasting 2 instructions. When I changed the loop, look what I got: By using index instead of pointers, I allowed compiler to execute very effective machine command: mov rax, QWORD PTR ?membo@@3PADA[rdx+rcx*8]. This made computation of rdx+rcx*8 in a single command! I also tried using __m128i vector registers to check 16 bytes at a time, but for memory sizes of less than 1 MB the vector operations seem to have too much overhead compared to using simple 64 bit integers. Maybe for very large memory chunks like 1 GB or more, it might be faster. If you ever wondered how standard memory manipulation functions are implemented, I hope this post will be useful to further your comprehension of the low level basics that practically every program rely on.
Here is results of profiling tests: As you can see, it has identical running speed for big regions of memory and faster speed for small regions (which is actually 90% of the use case for string comparisons). Now, initially I used this variant of first loop:
However, I noticed that my implementation was persistently slower by 30-35% than memcpy(). To understand why, I looked at the generated ASM code: I noticed that I had forced the compiler to produce pretty inefficient code, in that it was constantly incrementing 2 memory pointers, wasting 2 instructions. When I changed the loop, look what I got: By using index instead of pointers, I allowed compiler to execute very effective machine command: mov rax, QWORD PTR ?membo@@3PADA[rdx+rcx*8]. This made computation of rdx+rcx*8 in a single command! I also tried using __m128i vector registers to check 16 bytes at a time, but for memory sizes of less than 1 MB the vector operations seem to have too much overhead compared to using simple 64 bit integers. Maybe for very large memory chunks like 1 GB or more, it might be faster. If you ever wondered how standard memory manipulation functions are implemented, I hope this post will be useful to further your comprehension of the low level basics that practically every program rely on.
Comments