SPO600 Project – Strcpy – is it already optimized?

Continuing where I left off from my previous post, my task was clear: make string copy better! When I initially picked this function, what I didn’t realize that this function was clearly a good candidate for optimization, so good in fact that a couple of years ago, someone had already gone and ahead and made an Aarch64 specific version. There are actually quite a few functions already optimized for this architecture (no where near the number of x86_64 optimized ones) that were placed in a folder called sysdeps/aarch64. In this folder, I found an assembly file called strcpy.S which contained aarch64 specific optimizations. I won’t be going through the whole file, but here is a link to it for reference:

https://github.com/bminor/glibc/blob/master/sysdeps/aarch64/strcpy.S

So was all hope lost? Was that it for me and the possibility for making this function better? Was this version as good as it got? Initially I thought yes, but upon further inspection, I did in fact find that this version, despite being obviously better than the naive implementation, did lack utilization of SIMD to accomplish the copying. This was actually made pretty obvious by simply looking at the registers that were declared at the top of the file.

/* Locals and temporaries.  */
#define src		x2
#define dst		x3
#define data1		x4
#define data1w		w4
#define data2		x5
#define data2w		w5
#define has_nul1	x6
#define has_nul2	x7
#define tmp1		x8
#define tmp2		x9
#define tmp3		x10
#define tmp4		x11
#define zeroones	x12
#define data1a		x13
#define data2a		x14
#define pos		x15
#define len		x16
#define to_align	x17

What do you know, no vector registers in sight! This opened the door to my potential idea of vectorizing the code. This was the part that I was interesting in optimizing:

	.p2align 6
	/* Aligning here ensures that the entry code and main loop all lies
	   within one 64-byte cache line.  */
L(bulk_entry):
	sub	to_align, to_align, #16
	stp	data1, data2, [dstin]
	sub	src, srcin, to_align
	sub	dst, dstin, to_align
	b	L(entry_no_page_cross)

	/* The inner loop deals with two Dwords at a time.  This has a
	   slightly higher start-up cost, but we should win quite quickly,
	   especially on cores with a high number of issue slots per
	   cycle, as we get much better parallelism out of the operations.  */
L(main_loop):
	stp	data1, data2, [dst], #16
L(entry_no_page_cross):
	ldp	data1, data2, [src], #16
	sub	tmp1, data1, zeroones
	orr	tmp2, data1, #REP8_7f
	sub	tmp3, data2, zeroones
	orr	tmp4, data2, #REP8_7f
	bic	has_nul1, tmp1, tmp2
	bics	has_nul2, tmp3, tmp4
	ccmp	has_nul1, #0, #0, eq	/* NZCV = 0000  */
	b.eq	L(main_loop)

	/* Since we know we are copying at least 16 bytes, the fastest way
	   to deal with the tail is to determine the location of the
	   trailing NUL, then (re)copy the 16 bytes leading up to that.  */
	cmp	has_nul1, #0

But there was also another potential optimization that could be done that was actually pointed out by my professor, who is working on a similar function in strlen.

/* NUL detection works on the principle that (X - 1) & (~X) & 0x80
   (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
   can be done in parallel across the entire word.  */

#define REP8_01 0x0101010101010101
#define REP8_7f 0x7f7f7f7f7f7f7f7f
#define REP8_80 0x8080808080808080

Looking at the above defines, we see that the author has actually used an interesting null detection system, where if the character is not null, it will overflow. I still need to further clarify the exact mechanics behind this detection system, but it seemed fairly apparent that there was probably an easier way to do this.

So looking at the potential optimizations above, with the help of my professor, we came up with some pseudo code to go ahead and allow for SIMD copying:

Loop:
     1. load 16 bytes at a time into a vector register
     2. Find the minimum value in the vector: if it is zero, then you know 
     you've reached the nullbyte somewhere in that vector, so you branch 
     and begin copying byte by byte. If the lowest value isn't zero, 
     then you know you haven't reached the end of the string and 
     it's safe to continue loop to store another 16 bytes.

Store back into pointer to destination array once entire loop was complete.

Looking at the pseudo code, we can see the potential speed up as well as the fact that his method seems more intuitive in null detection. But that’s all conjecture until we actually show some empirical evidence that SIMD copying is faster. The actually assembly implementation will have to wait till the next blog post!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s