SPO600 – Strcpy Optimization Implementation

So last I checked in, I had some pseudo code for my potential optimization for strcpy. As stated previously, in order to implement this optimization for Aarch64, I had to delve into the assembly file, strcpy.S, located in the sysdeps/aarch64 directory. The code that I was interested in optimizing is listed below:

.p2align 6
/* Aligning here ensures that the entry code and main loop all lies
within one 64-byte cache line. */
L(bulk_entry):
sub to_align, to_align, #16
stp data1, data2, [dstin]
sub src, srcin, to_align
sub dst, dstin, to_align
b L(entry_no_page_cross)

/* The inner loop deals with two Dwords at a time. This has a
slightly higher start-up cost, but we should win quite quickly,
especially on cores with a high number of issue slots per
cycle, as we get much better parallelism out of the operations. */
L(main_loop):
stp data1, data2, [dst], #16
L(entry_no_page_cross):
ldp data1, data2, [src], #16
sub tmp1, data1, zeroones
orr tmp2, data1, #REP8_7f
sub tmp3, data2, zeroones
orr tmp4, data2, #REP8_7f
bic has_nul1, tmp1, tmp2
bics has_nul2, tmp3, tmp4
ccmp has_nul1, #0, #0, eq /* NZCV = 0000 */
b.eq L(main_loop)

/* Since we know we are copying at least 16 bytes, the fastest way
to deal with the tail is to determine the location of the
trailing NUL, then (re)copy the 16 bytes leading up to that. */
cmp has_nul1, #0

From my understanding, after alignment, the entry_no_page_cross loop loads 16 bytes at a time and if it doesn’t find null then it goes ahead and store those 16 bytes.

So I went ahead and changed a couple of things to allow for the use vector registers as well a new null detection methodology.

        
        .p2align 6
        /* Aligning here ensures that the entry code and main loop all lies
           within one 64-byte cache line.  */
L(bulk_entry):
        sub     to_align, to_align, #16
        stp     data1, data2, [dstin]
        sub     src, srcin, to_align
        sub     dst, dstin, to_align

L(vector_entry):
        ld1     {v0.16b}, [src], #16    /*load 16 bytes into vector register*/ 
        uminv   B3, v0.16b              /*find the minimum value in the vector register*/
        umov    w10, v3.16b[0]          /*move it to a 32 bit register to use                 
                                          with cmp instruction*/
        cmp     w10, #0                 /*if null is found then enter byte by                
                                          byte copy*/
        b.eq    L(byte_copy)

L(vector_store):
        st1     {v0.16b}, [dst]      
        b       L(vector_entry)

L(byte_copy):
        ldrb    w1, [src], #1
        strb    w1, [dst], #1
        cmp     w1, #0
        b.ne    L(byte_copy)

So in order to actually commence the compilation, I had a bit of preamble that had to be done:

mkdir $HOME/src
cd $HOME/src
git clone git://sourceware.org/git/glibc.git
mkdir -p $HOME/build/glibc
cd $HOME/build/glibc
$HOME/src/glibc/configure --prefix=/usr
make

After running the make command, my optimized version compiled successfully (after a few iterations and tweaking)! My next post will be about how I went about testing my new version using the included testrun script.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s