SPO600 Project Phase 3 – False Hope

EDIT 1:30PM:

Upon further inspection, it is unfortunate to say but my suspicions about using the testrun script versus running the code without it proved to be correct. When the unoptimized version was tested using the script, the timings were roughly the same. I have shown these results in the final part of this blog entry

———-

So last I checked in was quite some time ago. With my final project for PRJ666 finally complete, along with exams, I’ve finally had some time to sit and look at my code in peace. After some time mulling over the code, I finally realized that the fix was actually quite simple.

In my last iteration, the code looked something like this:

     .p2align 6
     /* Aligning here ensures that the entry code and main loop all lies
        within one 64-byte cache line.  */
L(bulk_entry):
     sub to_align, to_align, #16
     stp data1, data2, [dstin]
     sub src, srcin, to_align
     sub dst, dstin, to_align

L(vector_entry):
     ld1     {v0.16b}, [src], #16     
     uminv   B3, v0.16b
     umov    w10, v3.16b[0] 
     cmp     w10, #0 
     b.eq    L(byte_entry)

L(vector_store):
     st1 {v0.16b}, [dst]      
     b  L(vector_entry)

L(byte_entry):
     sub src, src, #16
     b  L(byte_copy)

L(byte_copy):
     ldrb       w1, [src], #1
     strb       w1, [dst], #1
     cmp w1, #0
     b.ne       L(byte_copy)

The fix was actually to add an immediate offset to the st1 operation in the vector_store branch. The code now looks like this:

https://github.com/jdesmond91/spo600-glibc/blob/strcpy/sysdeps/aarch64/strcpy.S 

        .p2align 6
        /* Aligning here ensures that the entry code and main loop all lies
           within one 64-byte cache line.  */
L(bulk_entry):
        sub     to_align, to_align, #16
        stp     data1, data2, [dstin]
        sub     src, srcin, to_align
        sub     dst, dstin, to_align

L(vector_entry):
        ld1     {v0.16b}, [src], #16     
        uminv   B3, v0.16b
        umov    w10, v3.16b[0] 
        cmp     w10, #0
        b.eq    L(byte_entry)

L(vector_store):
        st1     {v0.16b}, [dst], #16      
        b       L(vector_entry)

L(byte_entry):
        sub     src, src, #16
        b       L(byte_copy)

L(byte_copy):
        ldrb    w1, [src], #1 
        strb    w1, [dst], #1
        cmp     w1, #0
        b.ne    L(byte_copy)

With this new change, my optimized string copy now works perfectly.

Here is the source code for my basic tester:

#include <stdio.h>
#include <string.h>

int main()
{
   char src[300];
   char dest[300];
  
   memset(dest, '\0', sizeof(dest));
   strcpy(src, "This is a basic string copy tester that will simply test the functionality without measuring time");
   strcpy(dest, src);
   printf("Final copied string : %s\n", dest);
   if (strcmp(src, dest) == 0) {
        printf("The two strings are the same and the copying worked correctly\n");
   }
   return(0);
}

Here are the expected results:
Simple_expected_new

Here are the observed results:
Simple_observed_working

As you can see my new and improved function works as intended. So let’s trying something more rigorous to truly see whether or not there’s an improvement.

Here is the source code for my benchmark tester:

#include <stdio.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>
 
void rand_str(char *dest, size_t length) {
    char charset[] = "0123456789"
                     "abcdefghijklmnopqrstuvwxyz"
                     "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

    while (length-- > 0) {
        size_t index = (double) rand() / RAND_MAX * (sizeof charset - 1);
        *dest++ = charset[index];
    }
    *dest = '\0';
}

int main()
{
   struct timespec tstart={0,0}, tend={0,0};
   printf("entering test\n"); 

   char *src2;
   int bytes = (1024*1024);
   src2 = (char *) malloc(bytes);

   char *dest2;
   dest2 = (char *) malloc(bytes);   

   rand_str(src2, bytes);
   //printf("%s\n\n", src2);
   
   clock_gettime(CLOCK_MONOTONIC, &tstart);
   strcpy(dest2, src2);
   clock_gettime(CLOCK_MONOTONIC, &tend);

   printf("string copy took about %.5f seconds\n",
           ((double)tend.tv_sec + 1.0e-9*tend.tv_nsec) - 
           ((double)tstart.tv_sec + 1.0e-9*tstart.tv_nsec));

   //printf("%s\n\n", dest2);
   if(strcmp(src2, dest2) == 0) {
        printf("copied strings are the same\n");
   }
   return(0);
}

Essentially what this more advanced tester does is that it allocates two 1MB char arrays labelled as src2 and dest2. Then it calls the rand_str function which fills src2 with random alphanumeric characters. I then call the strcpy function to copy src2 into dest2, while measuring the time. It then prints the measured time and lastly, it calls strcmp to ensure that the two strings are the same.

Here is the expected output using the regular string copy function:

Complex_expected

Here is the observed output using my optimized version:

Complex_observed

As you can see the functionality is working as intended, as the string compare check printed the line. And on top of this, you can see that the original string copy took about 0.00116 seconds, while my new optimized version took about 0.00027 seconds. That is approximately a 4.3x speedup!

Initially, I allocated a 1MB string by setting the bytes variable to 1024*1024.
I changed the tester to allocate a 2MB string by setting the bytes variable to 2048*2048.

Upon testing, here are the results of the normal string copy:

Complex_expected_2mb

And here is my optimized version:

Complex_observed_2mb

As you can see that even with a 2MB string, the copying functionality still worked correctly AND the speed up was consistent, albeit not as fast. Here the speed up was about 3.44x the unoptimized version, which is still very significant!

At last my optimized function was working as intended. Just to reiterate what my optimization actually was, here is the basic outline:

1. load 16 bytes into vector register
2. find the minimum value in the vector register
3. move it to a 64 bit register to use with cmp instruction
4. cmp and if null is found then copy one byte at a time, 
   else continue copying 16 bytes at a time

So the two key points here are the fact that we are actually copying 16 bytes at a time AND we are simplifying the existing null detection which took A LOT of instructions. By using the UMIN instruction, we can easily find the lowest value in a vector register, which if null will be 0.

The interesting thing to note is that I actually only optimized one section of the assembly file. There are many other branches in the assembly file that use the old null detection system. Due to time constraints I was unfortunately not able to change the other places, but perhaps someone will take up the mantle after me and change the rest. As you can see, even a change in one branch and had huge improvements in the execution time, so I think there is still a lot of room for improvement.

The other thing to note is that I was only able to test on our little endian system. I have no idea whether or not this improvement will work on a big endian system. Clearly further testing is required as I think there may be some differences when running the function natively without the testrun script. As I am not fully confident in my testing results, I will not be submitting a pull request.

EDIT 1:30PM:

As stated at the top of the post, my suspicions have been confirmed. Here are the results of testing when using the testrun script for the unoptimized version already in glibc:

Complex_expected_testrun

Just to reiterate, here are the results of my version of the function:

Complex_observed_testrun

As you can see, the timings are basically the same. It is unfortunate that my optimizations did not make a difference. Perhaps adding the new null detection algorithm to other parts of the assembly file would potentially increase speed up, but as of right now, this is where I close my work on this project.

—————–

But as I said, I would love for future students to continue where I left off and as such will be leaving my code here:

https://github.com/jdesmond91/spo600-glibc/blob/strcpy/sysdeps/aarch64/strcpy.S

This has been a really interesting project and I will write some thoughts about that as well as the course in general in my next and final blog post.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s