Continuing where I left off from my previous post, my task was clear: make string copy better! When I initially picked this function, what I didn’t realize that this function was clearly a good candidate for optimization, so good in fact that a couple of years ago, someone had already gone and ahead and made an Aarch64 specific version. There are actually quite a few functions already optimized for this architecture (no where near the number of x86_64 optimized ones) that were placed in a folder called sysdeps/aarch64. In this folder, I found an assembly file called strcpy.S which contained aarch64 specific optimizations. I won’t be going through the whole file, but here is a link to it for reference:
So was all hope lost? Was that it for me and the possibility for making this function better? Was this version as good as it got? Initially I thought yes, but upon further inspection, I did in fact find that this version, despite being obviously better than the naive implementation, did lack utilization of SIMD to accomplish the copying. This was actually made pretty obvious by simply looking at the registers that were declared at the top of the file.
/* Locals and temporaries. */ #define src x2 #define dst x3 #define data1 x4 #define data1w w4 #define data2 x5 #define data2w w5 #define has_nul1 x6 #define has_nul2 x7 #define tmp1 x8 #define tmp2 x9 #define tmp3 x10 #define tmp4 x11 #define zeroones x12 #define data1a x13 #define data2a x14 #define pos x15 #define len x16 #define to_align x17
What do you know, no vector registers in sight! This opened the door to my potential idea of vectorizing the code. This was the part that I was interesting in optimizing:
.p2align 6 /* Aligning here ensures that the entry code and main loop all lies within one 64-byte cache line. */ L(bulk_entry): sub to_align, to_align, #16 stp data1, data2, [dstin] sub src, srcin, to_align sub dst, dstin, to_align b L(entry_no_page_cross) /* The inner loop deals with two Dwords at a time. This has a slightly higher start-up cost, but we should win quite quickly, especially on cores with a high number of issue slots per cycle, as we get much better parallelism out of the operations. */ L(main_loop): stp data1, data2, [dst], #16 L(entry_no_page_cross): ldp data1, data2, [src], #16 sub tmp1, data1, zeroones orr tmp2, data1, #REP8_7f sub tmp3, data2, zeroones orr tmp4, data2, #REP8_7f bic has_nul1, tmp1, tmp2 bics has_nul2, tmp3, tmp4 ccmp has_nul1, #0, #0, eq /* NZCV = 0000 */ b.eq L(main_loop) /* Since we know we are copying at least 16 bytes, the fastest way to deal with the tail is to determine the location of the trailing NUL, then (re)copy the 16 bytes leading up to that. */ cmp has_nul1, #0
But there was also another potential optimization that could be done that was actually pointed out by my professor, who is working on a similar function in strlen.
/* NUL detection works on the principle that (X - 1) & (~X) & 0x80 (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and can be done in parallel across the entire word. */ #define REP8_01 0x0101010101010101 #define REP8_7f 0x7f7f7f7f7f7f7f7f #define REP8_80 0x8080808080808080
Looking at the above defines, we see that the author has actually used an interesting null detection system, where if the character is not null, it will overflow. I still need to further clarify the exact mechanics behind this detection system, but it seemed fairly apparent that there was probably an easier way to do this.
So looking at the potential optimizations above, with the help of my professor, we came up with some pseudo code to go ahead and allow for SIMD copying:
Loop: 1. load 16 bytes at a time into a vector register 2. Find the minimum value in the vector: if it is zero, then you know you've reached the nullbyte somewhere in that vector, so you branch and begin copying byte by byte. If the lowest value isn't zero, then you know you haven't reached the end of the string and it's safe to continue loop to store another 16 bytes. Store back into pointer to destination array once entire loop was complete.
Looking at the pseudo code, we can see the potential speed up as well as the fact that his method seems more intuitive in null detection. But that’s all conjecture until we actually show some empirical evidence that SIMD copying is faster. The actually assembly implementation will have to wait till the next blog post!