SPO600 – Final Thoughts

On the first day of class, our professor told us that this would be a very challenging course, and boy did it live up to those expectations. Completing this project, and the course in general proved to be quite demanding but more importantly, it forced to me to think about programming in a way I never did before.

I had always heard that assembly language was very difficult, but when we were first actually tasked with constructing a simple for loop, it was completely different than what I was normally used to. So already I could tell I was in for a ride. But as weeks progressed and our tasks got more complex, it became easier and easier. And yet the thought that we had to actually go and optimize a function in glibc, did not fill me with ease.

I would say picking a function was just as hard as optimizing itself. As stated in my previous entries, looking at the assembly file for strcpy was very unnerving. But I sat down with my professor and we realized that there was actually a lot that could be done, especially with null detection and the usage of vector registers. Although in the end I wasn’t able to optimize my function, more important though, was the thinking that was involved in such a task. All of the sudden coding became more than just making the program run, but rather making the program run efficiently. And in fact this thinking was in line with a lot of my other courses as well. This semester I learned about data structures and picking the most appropriate one to use based on your needs, as well as basic parallel programming where we learned to use the GPU and the CUDA api to perform simpler operations much faster than on a CPU.

So although it was at times a very frustrating experience, I definitely feel like I came out from this project with a whole new understanding of what it means to write well written code. Definitely not an easy course, but I would highly recommend it to all future students.

SPO600 Project Phase 3 – False Hope

EDIT 1:30PM:

Upon further inspection, it is unfortunate to say but my suspicions about using the testrun script versus running the code without it proved to be correct. When the unoptimized version was tested using the script, the timings were roughly the same. I have shown these results in the final part of this blog entry

———-

So last I checked in was quite some time ago. With my final project for PRJ666 finally complete, along with exams, I’ve finally had some time to sit and look at my code in peace. After some time mulling over the code, I finally realized that the fix was actually quite simple.

In my last iteration, the code looked something like this:

     .p2align 6
     /* Aligning here ensures that the entry code and main loop all lies
        within one 64-byte cache line.  */
L(bulk_entry):
     sub to_align, to_align, #16
     stp data1, data2, [dstin]
     sub src, srcin, to_align
     sub dst, dstin, to_align

L(vector_entry):
     ld1     {v0.16b}, [src], #16     
     uminv   B3, v0.16b
     umov    w10, v3.16b[0] 
     cmp     w10, #0 
     b.eq    L(byte_entry)

L(vector_store):
     st1 {v0.16b}, [dst]      
     b  L(vector_entry)

L(byte_entry):
     sub src, src, #16
     b  L(byte_copy)

L(byte_copy):
     ldrb       w1, [src], #1
     strb       w1, [dst], #1
     cmp w1, #0
     b.ne       L(byte_copy)

The fix was actually to add an immediate offset to the st1 operation in the vector_store branch. The code now looks like this:

https://github.com/jdesmond91/spo600-glibc/blob/strcpy/sysdeps/aarch64/strcpy.S 

        .p2align 6
        /* Aligning here ensures that the entry code and main loop all lies
           within one 64-byte cache line.  */
L(bulk_entry):
        sub     to_align, to_align, #16
        stp     data1, data2, [dstin]
        sub     src, srcin, to_align
        sub     dst, dstin, to_align

L(vector_entry):
        ld1     {v0.16b}, [src], #16     
        uminv   B3, v0.16b
        umov    w10, v3.16b[0] 
        cmp     w10, #0
        b.eq    L(byte_entry)

L(vector_store):
        st1     {v0.16b}, [dst], #16      
        b       L(vector_entry)

L(byte_entry):
        sub     src, src, #16
        b       L(byte_copy)

L(byte_copy):
        ldrb    w1, [src], #1 
        strb    w1, [dst], #1
        cmp     w1, #0
        b.ne    L(byte_copy)

With this new change, my optimized string copy now works perfectly.

Here is the source code for my basic tester:

#include <stdio.h>
#include <string.h>

int main()
{
   char src[300];
   char dest[300];
  
   memset(dest, '\0', sizeof(dest));
   strcpy(src, "This is a basic string copy tester that will simply test the functionality without measuring time");
   strcpy(dest, src);
   printf("Final copied string : %s\n", dest);
   if (strcmp(src, dest) == 0) {
        printf("The two strings are the same and the copying worked correctly\n");
   }
   return(0);
}

Here are the expected results:
Simple_expected_new

Here are the observed results:
Simple_observed_working

As you can see my new and improved function works as intended. So let’s trying something more rigorous to truly see whether or not there’s an improvement.

Here is the source code for my benchmark tester:

#include <stdio.h>
#include <string.h>
#include <time.h>
#include <stdlib.h>
 
void rand_str(char *dest, size_t length) {
    char charset[] = "0123456789"
                     "abcdefghijklmnopqrstuvwxyz"
                     "ABCDEFGHIJKLMNOPQRSTUVWXYZ";

    while (length-- > 0) {
        size_t index = (double) rand() / RAND_MAX * (sizeof charset - 1);
        *dest++ = charset[index];
    }
    *dest = '\0';
}

int main()
{
   struct timespec tstart={0,0}, tend={0,0};
   printf("entering test\n"); 

   char *src2;
   int bytes = (1024*1024);
   src2 = (char *) malloc(bytes);

   char *dest2;
   dest2 = (char *) malloc(bytes);   

   rand_str(src2, bytes);
   //printf("%s\n\n", src2);
   
   clock_gettime(CLOCK_MONOTONIC, &tstart);
   strcpy(dest2, src2);
   clock_gettime(CLOCK_MONOTONIC, &tend);

   printf("string copy took about %.5f seconds\n",
           ((double)tend.tv_sec + 1.0e-9*tend.tv_nsec) - 
           ((double)tstart.tv_sec + 1.0e-9*tstart.tv_nsec));

   //printf("%s\n\n", dest2);
   if(strcmp(src2, dest2) == 0) {
        printf("copied strings are the same\n");
   }
   return(0);
}

Essentially what this more advanced tester does is that it allocates two 1MB char arrays labelled as src2 and dest2. Then it calls the rand_str function which fills src2 with random alphanumeric characters. I then call the strcpy function to copy src2 into dest2, while measuring the time. It then prints the measured time and lastly, it calls strcmp to ensure that the two strings are the same.

Here is the expected output using the regular string copy function:

Complex_expected

Here is the observed output using my optimized version:

Complex_observed

As you can see the functionality is working as intended, as the string compare check printed the line. And on top of this, you can see that the original string copy took about 0.00116 seconds, while my new optimized version took about 0.00027 seconds. That is approximately a 4.3x speedup!

Initially, I allocated a 1MB string by setting the bytes variable to 1024*1024.
I changed the tester to allocate a 2MB string by setting the bytes variable to 2048*2048.

Upon testing, here are the results of the normal string copy:

Complex_expected_2mb

And here is my optimized version:

Complex_observed_2mb

As you can see that even with a 2MB string, the copying functionality still worked correctly AND the speed up was consistent, albeit not as fast. Here the speed up was about 3.44x the unoptimized version, which is still very significant!

At last my optimized function was working as intended. Just to reiterate what my optimization actually was, here is the basic outline:

1. load 16 bytes into vector register
2. find the minimum value in the vector register
3. move it to a 64 bit register to use with cmp instruction
4. cmp and if null is found then copy one byte at a time, 
   else continue copying 16 bytes at a time

So the two key points here are the fact that we are actually copying 16 bytes at a time AND we are simplifying the existing null detection which took A LOT of instructions. By using the UMIN instruction, we can easily find the lowest value in a vector register, which if null will be 0.

The interesting thing to note is that I actually only optimized one section of the assembly file. There are many other branches in the assembly file that use the old null detection system. Due to time constraints I was unfortunately not able to change the other places, but perhaps someone will take up the mantle after me and change the rest. As you can see, even a change in one branch and had huge improvements in the execution time, so I think there is still a lot of room for improvement.

The other thing to note is that I was only able to test on our little endian system. I have no idea whether or not this improvement will work on a big endian system. Clearly further testing is required as I think there may be some differences when running the function natively without the testrun script. As I am not fully confident in my testing results, I will not be submitting a pull request.

EDIT 1:30PM:

As stated at the top of the post, my suspicions have been confirmed. Here are the results of testing when using the testrun script for the unoptimized version already in glibc:

Complex_expected_testrun

Just to reiterate, here are the results of my version of the function:

Complex_observed_testrun

As you can see, the timings are basically the same. It is unfortunate that my optimizations did not make a difference. Perhaps adding the new null detection algorithm to other parts of the assembly file would potentially increase speed up, but as of right now, this is where I close my work on this project.

—————–

But as I said, I would love for future students to continue where I left off and as such will be leaving my code here:

https://github.com/jdesmond91/spo600-glibc/blob/strcpy/sysdeps/aarch64/strcpy.S

This has been a really interesting project and I will write some thoughts about that as well as the course in general in my next and final blog post.

SPO600 – Strcpy Optimization Testing

So in my previous post I discussed the implementation of my optimization and upon successfully compiling, I was ready to move on to testing.

The first test I did was a very simple strcpy test:

#include 
#include 

int main()
{
   char src[40];
   char dest[100];
  
   memset(dest, '\0', sizeof(dest));
   strcpy(src, "Very simple test");
   strcpy(dest, src);

   printf("Final copied string : %s\n", dest);
   
   return(0);
}

Compiled using the following command:

gcc -g simple.c -o simple

The expected output would be:

Final copied string : Very simple test

Using the testrun script in build directory, I ran the following command:

./testrun.sh simple

Bad news! The output I got was:

Final copied string : Very sim

I will need some time to hack away at this and will edit once I’m done.

EDIT:

So after some time of debugging and testing a variety of things, I’ve changed the code to look like this:

BEFORE

        
.p2align 6
        /* Aligning here ensures that the entry code and main loop all lies
           within one 64-byte cache line.  */
L(bulk_entry):
        sub     to_align, to_align, #16
        stp     data1, data2, [dstin]
        sub     src, srcin, to_align
        sub     dst, dstin, to_align

L(vector_entry):
        ld1     {v0.16b}, [src], #16    /*load 16 bytes into vector register*/ 
        uminv   B3, v0.16b              /*find the minimum value in the vector register*/
        umov    w10, v3.16b[0]          /*move it to a 32 bit register to use                 
                                          with cmp instruction*/
        cmp     w10, #0                 /*if null is found then enter byte by                
                                          byte copy*/
        b.eq    L(byte_copy)

L(vector_store):
        st1     {v0.16b}, [dst]      
        b       L(vector_entry)

L(byte_copy):
        ldrb    w1, [src], #1
        strb    w1, [dst], #1
        cmp     w1, #0
        b.eq    L(byte_copy)

AFTER

        
     .p2align 6
     /* Aligning here ensures that the entry code and main loop all lies
        within one 64-byte cache line.  */
L(bulk_entry):
     sub to_align, to_align, #16
     stp data1, data2, [dstin]
     sub src, srcin, to_align
     sub dst, dstin, to_align

L(vector_entry):
     ld1     {v0.16b}, [src], #16     
     uminv   B3, v0.16b
     umov    w10, v3.16b[0] 
     cmp     w10, #0 
     b.eq    L(byte_entry)

L(vector_store):
     st1 {v0.16b}, [dst]      
     b  L(vector_entry)

L(byte_entry):
     sub src, src, #16
     b  L(byte_copy)

L(byte_copy):
     ldrb       w1, [src], #1
     strb       w1, [dst], #1
     cmp w1, #0
     b.ne       L(byte_copy)

Things to note
1. I added another branch called byte_entry so that the src address can be moved back 16 bytes (which was done in the ld1 instruction), before initiating byte by byte copy

Expected Output:

Simple_expected

Observed Output:

Simple_observed

So far so good, but I’ve encountered an issue when the string is longer

Expected Output:

Simple_longer_expected

Observed Output:

Simple_longer_observed

The string is getting cut off, but I am unsure why. I will continue to hack away but more than likely I will require some outside guidance. Hopefully my next post will be more fruitful!

SPO600 – Strcpy Optimization Implementation

So last I checked in, I had some pseudo code for my potential optimization for strcpy. As stated previously, in order to implement this optimization for Aarch64, I had to delve into the assembly file, strcpy.S, located in the sysdeps/aarch64 directory. The code that I was interested in optimizing is listed below:

.p2align 6
/* Aligning here ensures that the entry code and main loop all lies
within one 64-byte cache line. */
L(bulk_entry):
sub to_align, to_align, #16
stp data1, data2, [dstin]
sub src, srcin, to_align
sub dst, dstin, to_align
b L(entry_no_page_cross)

/* The inner loop deals with two Dwords at a time. This has a
slightly higher start-up cost, but we should win quite quickly,
especially on cores with a high number of issue slots per
cycle, as we get much better parallelism out of the operations. */
L(main_loop):
stp data1, data2, [dst], #16
L(entry_no_page_cross):
ldp data1, data2, [src], #16
sub tmp1, data1, zeroones
orr tmp2, data1, #REP8_7f
sub tmp3, data2, zeroones
orr tmp4, data2, #REP8_7f
bic has_nul1, tmp1, tmp2
bics has_nul2, tmp3, tmp4
ccmp has_nul1, #0, #0, eq /* NZCV = 0000 */
b.eq L(main_loop)

/* Since we know we are copying at least 16 bytes, the fastest way
to deal with the tail is to determine the location of the
trailing NUL, then (re)copy the 16 bytes leading up to that. */
cmp has_nul1, #0

From my understanding, after alignment, the entry_no_page_cross loop loads 16 bytes at a time and if it doesn’t find null then it goes ahead and store those 16 bytes.

So I went ahead and changed a couple of things to allow for the use vector registers as well a new null detection methodology.

        
        .p2align 6
        /* Aligning here ensures that the entry code and main loop all lies
           within one 64-byte cache line.  */
L(bulk_entry):
        sub     to_align, to_align, #16
        stp     data1, data2, [dstin]
        sub     src, srcin, to_align
        sub     dst, dstin, to_align

L(vector_entry):
        ld1     {v0.16b}, [src], #16    /*load 16 bytes into vector register*/ 
        uminv   B3, v0.16b              /*find the minimum value in the vector register*/
        umov    w10, v3.16b[0]          /*move it to a 32 bit register to use                 
                                          with cmp instruction*/
        cmp     w10, #0                 /*if null is found then enter byte by                
                                          byte copy*/
        b.eq    L(byte_copy)

L(vector_store):
        st1     {v0.16b}, [dst]      
        b       L(vector_entry)

L(byte_copy):
        ldrb    w1, [src], #1
        strb    w1, [dst], #1
        cmp     w1, #0
        b.ne    L(byte_copy)

So in order to actually commence the compilation, I had a bit of preamble that had to be done:

mkdir $HOME/src
cd $HOME/src
git clone git://sourceware.org/git/glibc.git
mkdir -p $HOME/build/glibc
cd $HOME/build/glibc
$HOME/src/glibc/configure --prefix=/usr
make

After running the make command, my optimized version compiled successfully (after a few iterations and tweaking)! My next post will be about how I went about testing my new version using the included testrun script.

SPO600 Project – Strcpy – is it already optimized?

Continuing where I left off from my previous post, my task was clear: make string copy better! When I initially picked this function, what I didn’t realize that this function was clearly a good candidate for optimization, so good in fact that a couple of years ago, someone had already gone and ahead and made an Aarch64 specific version. There are actually quite a few functions already optimized for this architecture (no where near the number of x86_64 optimized ones) that were placed in a folder called sysdeps/aarch64. In this folder, I found an assembly file called strcpy.S which contained aarch64 specific optimizations. I won’t be going through the whole file, but here is a link to it for reference:

https://github.com/bminor/glibc/blob/master/sysdeps/aarch64/strcpy.S

So was all hope lost? Was that it for me and the possibility for making this function better? Was this version as good as it got? Initially I thought yes, but upon further inspection, I did in fact find that this version, despite being obviously better than the naive implementation, did lack utilization of SIMD to accomplish the copying. This was actually made pretty obvious by simply looking at the registers that were declared at the top of the file.

/* Locals and temporaries.  */
#define src		x2
#define dst		x3
#define data1		x4
#define data1w		w4
#define data2		x5
#define data2w		w5
#define has_nul1	x6
#define has_nul2	x7
#define tmp1		x8
#define tmp2		x9
#define tmp3		x10
#define tmp4		x11
#define zeroones	x12
#define data1a		x13
#define data2a		x14
#define pos		x15
#define len		x16
#define to_align	x17

What do you know, no vector registers in sight! This opened the door to my potential idea of vectorizing the code. This was the part that I was interesting in optimizing:

	.p2align 6
	/* Aligning here ensures that the entry code and main loop all lies
	   within one 64-byte cache line.  */
L(bulk_entry):
	sub	to_align, to_align, #16
	stp	data1, data2, [dstin]
	sub	src, srcin, to_align
	sub	dst, dstin, to_align
	b	L(entry_no_page_cross)

	/* The inner loop deals with two Dwords at a time.  This has a
	   slightly higher start-up cost, but we should win quite quickly,
	   especially on cores with a high number of issue slots per
	   cycle, as we get much better parallelism out of the operations.  */
L(main_loop):
	stp	data1, data2, [dst], #16
L(entry_no_page_cross):
	ldp	data1, data2, [src], #16
	sub	tmp1, data1, zeroones
	orr	tmp2, data1, #REP8_7f
	sub	tmp3, data2, zeroones
	orr	tmp4, data2, #REP8_7f
	bic	has_nul1, tmp1, tmp2
	bics	has_nul2, tmp3, tmp4
	ccmp	has_nul1, #0, #0, eq	/* NZCV = 0000  */
	b.eq	L(main_loop)

	/* Since we know we are copying at least 16 bytes, the fastest way
	   to deal with the tail is to determine the location of the
	   trailing NUL, then (re)copy the 16 bytes leading up to that.  */
	cmp	has_nul1, #0

But there was also another potential optimization that could be done that was actually pointed out by my professor, who is working on a similar function in strlen.

/* NUL detection works on the principle that (X - 1) & (~X) & 0x80
   (=> (X - 1) & ~(X | 0x7f)) is non-zero iff a byte is zero, and
   can be done in parallel across the entire word.  */

#define REP8_01 0x0101010101010101
#define REP8_7f 0x7f7f7f7f7f7f7f7f
#define REP8_80 0x8080808080808080

Looking at the above defines, we see that the author has actually used an interesting null detection system, where if the character is not null, it will overflow. I still need to further clarify the exact mechanics behind this detection system, but it seemed fairly apparent that there was probably an easier way to do this.

So looking at the potential optimizations above, with the help of my professor, we came up with some pseudo code to go ahead and allow for SIMD copying:

Loop:
     1. load 16 bytes at a time into a vector register
     2. Find the minimum value in the vector: if it is zero, then you know 
     you've reached the nullbyte somewhere in that vector, so you branch 
     and begin copying byte by byte. If the lowest value isn't zero, 
     then you know you haven't reached the end of the string and 
     it's safe to continue loop to store another 16 bytes.

Store back into pointer to destination array once entire loop was complete.

Looking at the pseudo code, we can see the potential speed up as well as the fact that his method seems more intuitive in null detection. But that’s all conjecture until we actually show some empirical evidence that SIMD copying is faster. The actually assembly implementation will have to wait till the next blog post!

SPO600 Project – Choosing a glibc function

After several weeks of working with assembler, that time of the semester is now upon us where we must use what we’ve learned and apply that theory into something practical. We have been tasked with picking a glibc function and optimizing it for Aarch64. Now whether we use assembly or pure C is totally up to us. We have been given nearly absolute freedom where normally we would be guided by strict instruction sets. As liberating as it is, I would still say it is a daunting task. Not often do we get to work with something so widely used as the glibc functions.

With so many functions, I would say actually picking function is just as hard as the implementation, which is actually the major reason it took me so long to get this post going. But after a lot of research, I think I’ve picked a decent function. Now whether or not I can actually go ahead and optimize is a whole other blog post.

So which function have I picked you ask? I’ve gone with strcpy, which by the name you can probably guess is the function to copy strings. So why this function over the rest? Well when I really sat down to think about the best way to make something better, it made sense to pick a function who’s instructions would be repeated multiple times. Even by simply looking at a portion of the naive implementation we can see that there is basically a copy operation that is done many times.

  do
    {
      c = *s++;
      s[off] = c;
    }
  while (c != '\0');

So I thought, after learning so much about vectorization and SIMD functions, why not take these repeated copy instructions and do them in parallel. Good in theory but the actual execution is quite challenging. I will discuss in further detail my thought processes and experiences in the next few blog posts!

SPO600 – Lab 6 -Vectorization Lab

In this lab, we were first asked to the following task:

  1. Write a short program that creates two 1000-element integer arrays and fills them with random numbers, then sums those two arrays to a third array, and finally sums the third array to a long int and prints the result.
  2. Compile the program on AArch64 in such a way that auto vectorization would be enabled
  3. Annotate the emitted code (i.e., obtain a dissassembly via objdump -d and add comments to the instructions in explaining what the code does).

Before I begin, I want to briefly touch on what it means to auto vectorize something. The idea is that normally, as displayed in the code below, we would iterate through both arrays at the same time, and make the addition one index at a time. With auto vectorization, we would actually add 8 elements at a time, which of course, is going to significantly increase your efficiency.

#include 
#include 
#include 
#define MAX 1000

int main() {

 //initialize three arrays of 1000 elements
 int array1[MAX];
 int array2[MAX];
 int sum[MAX];

 //variable to store total of the sum array
 long int total = 0;

 //seed psuedo random values
 srand(time(NULL));

 int i;
 for (i = 0; i < MAX; i++) {

 //generate random values
 array1[i] = rand() % 10;
 array2[i] = rand() % 10;

 //store the result in the sum array
 sum[i] = array1[i] + array2[i];

 //add to the running total
 total += sum[i];
 }

 printf("Total Sum: %d\n", total);
 return 0;
}

In order to actually tell the compiler to use auto vectorization rather than the “normal” way, we have to specify -O3 option when compiling:

gcc -O3 -o lab6 lab6.c

Then we use the following command to view the object dump:

objdump -d lab6

Below is the output which I have annotated so you that we can understand what’s happening under the hood:

0000000000400550 :
 // -- initialization of variables -- //
 400550: a9bd7bfd stp x29, x30, [sp,#-48]! //storing array1 as a pair
 400554: 910003fd mov x29, sp //storing the the beginning of the array
 400558: d2800000 mov x0, #0x0 //initialize i
 40055c: a9025bf5 stp x21, x22, [sp,#32] //storing array2 as a pair
 400560: a90153f3 stp x19, x20, [sp,#16] //storing sum array as a pair

 //setting (time(NULL) to be the seed
 400564: 97ffffdf bl 4004e0 <time@plt>
 400568: 97fffff2 bl 400530 <srand@plt>

 
 40056c: 52807d13 mov w19, #0x3e8 // #1000 // MAX of 1000
 400570: d2800015 mov x21, #0x0 // #0
 400574: 52800156 mov w22, #0xa // #10

 // -- start of the loop -- //

 //filling the array with random numbers from 1-10
 400578: 97ffffe2 bl 400500 <rand@plt> 
 40057c: 2a0003f4 mov w20, w0
 400580: 97ffffe0 bl 400500 <rand@plt>
 400584: 1ad60e83 sdiv w3, w20, w22
 400588: 1ad60c02 sdiv w2, w0, w22

 
 40058c: 0b030863 add w3, w3, w3, lsl #2
 400590: 0b020842 add w2, w2, w2, lsl #2
 400594: 4b030694 sub w20, w20, w3, lsl #1
 400598: 4b020400 sub w0, w0, w2, lsl #1
 40059c: 0b000280 add w0, w20, w0 //array1[i] + array2[i] and store into sum[i]
 4005a0: 71000673 subs w19, w19, #0x1
 4005a4: 8b20c2b5 add x21, x21, w0, sxtw //add sum[i] to total variable
 4005a8: 54fffe81 b.ne 400578 <main+0x28> //check if loop is done
 
 // -- end of the loop -- //

 // -- Print the result -- //
 4005ac: 90000000 adrp x0, 400000 
 4005b0: aa1503e1 mov x1, x21
 4005b4: 911f4000 add x0, x0, #0x7d0

 4005b8: 97ffffe2 bl 400540 <printf@plt>
 4005bc: 2a1303e0 mov w0, w19
 4005c0: a9425bf5 ldp x21, x22, [sp,#32]
 4005c4: a94153f3 ldp x19, x20, [sp,#16]
 4005c8: a8c37bfd ldp x29, x30, [sp],#48
 4005cc: d65f03c0 ret

Having gone through the assembly, it’s clear that the most important step is storing the registers as a pair through STP and iterating through the array 8 elements at a time, but there are still a couple of assembler instructions that I am unclear about.

The last part of this lab was to propose a solution that would enable inline assembler to be used auto vectorize our previous sound sample lab. In order to achieve this, here the rough thought process:

  1. Utilize LD1 to store the samples into a register
  2. Use DUP to duplicate the volume factor into a scalar vector register
  3. Use SQDMULH to multiply the volume factor by the sample and get the high half
  4. Use ST1 to store the results into an array