SPO600 – Lab 6 -Vectorization Lab

In this lab, we were first asked to the following task:

  1. Write a short program that creates two 1000-element integer arrays and fills them with random numbers, then sums those two arrays to a third array, and finally sums the third array to a long int and prints the result.
  2. Compile the program on AArch64 in such a way that auto vectorization would be enabled
  3. Annotate the emitted code (i.e., obtain a dissassembly via objdump -d and add comments to the instructions in explaining what the code does).

Before I begin, I want to briefly touch on what it means to auto vectorize something. The idea is that normally, as displayed in the code below, we would iterate through both arrays at the same time, and make the addition one index at a time. With auto vectorization, we would actually add 8 elements at a time, which of course, is going to significantly increase your efficiency.

#define MAX 1000

int main() {

 //initialize three arrays of 1000 elements
 int array1[MAX];
 int array2[MAX];
 int sum[MAX];

 //variable to store total of the sum array
 long int total = 0;

 //seed psuedo random values

 int i;
 for (i = 0; i < MAX; i++) {

 //generate random values
 array1[i] = rand() % 10;
 array2[i] = rand() % 10;

 //store the result in the sum array
 sum[i] = array1[i] + array2[i];

 //add to the running total
 total += sum[i];

 printf("Total Sum: %d\n", total);
 return 0;

In order to actually tell the compiler to use auto vectorization rather than the “normal” way, we have to specify -O3 option when compiling:

gcc -O3 -o lab6 lab6.c

Then we use the following command to view the object dump:

objdump -d lab6

Below is the output which I have annotated so you that we can understand what’s happening under the hood:

0000000000400550 :
 // -- initialization of variables -- //
 400550: a9bd7bfd stp x29, x30, [sp,#-48]! //storing array1 as a pair
 400554: 910003fd mov x29, sp //storing the the beginning of the array
 400558: d2800000 mov x0, #0x0 //initialize i
 40055c: a9025bf5 stp x21, x22, [sp,#32] //storing array2 as a pair
 400560: a90153f3 stp x19, x20, [sp,#16] //storing sum array as a pair

 //setting (time(NULL) to be the seed
 400564: 97ffffdf bl 4004e0 <time@plt>
 400568: 97fffff2 bl 400530 <srand@plt>

 40056c: 52807d13 mov w19, #0x3e8 // #1000 // MAX of 1000
 400570: d2800015 mov x21, #0x0 // #0
 400574: 52800156 mov w22, #0xa // #10

 // -- start of the loop -- //

 //filling the array with random numbers from 1-10
 400578: 97ffffe2 bl 400500 <rand@plt> 
 40057c: 2a0003f4 mov w20, w0
 400580: 97ffffe0 bl 400500 <rand@plt>
 400584: 1ad60e83 sdiv w3, w20, w22
 400588: 1ad60c02 sdiv w2, w0, w22

 40058c: 0b030863 add w3, w3, w3, lsl #2
 400590: 0b020842 add w2, w2, w2, lsl #2
 400594: 4b030694 sub w20, w20, w3, lsl #1
 400598: 4b020400 sub w0, w0, w2, lsl #1
 40059c: 0b000280 add w0, w20, w0 //array1[i] + array2[i] and store into sum[i]
 4005a0: 71000673 subs w19, w19, #0x1
 4005a4: 8b20c2b5 add x21, x21, w0, sxtw //add sum[i] to total variable
 4005a8: 54fffe81 b.ne 400578 <main+0x28> //check if loop is done
 // -- end of the loop -- //

 // -- Print the result -- //
 4005ac: 90000000 adrp x0, 400000 
 4005b0: aa1503e1 mov x1, x21
 4005b4: 911f4000 add x0, x0, #0x7d0

 4005b8: 97ffffe2 bl 400540 <printf@plt>
 4005bc: 2a1303e0 mov w0, w19
 4005c0: a9425bf5 ldp x21, x22, [sp,#32]
 4005c4: a94153f3 ldp x19, x20, [sp,#16]
 4005c8: a8c37bfd ldp x29, x30, [sp],#48
 4005cc: d65f03c0 ret

Having gone through the assembly, it’s clear that the most important step is storing the registers as a pair through STP and iterating through the array 8 elements at a time, but there are still a couple of assembler instructions that I am unclear about.

The last part of this lab was to propose a solution that would enable inline assembler to be used auto vectorize our previous sound sample lab. In order to achieve this, here the rough thought process:

  1. Utilize LD1 to store the samples into a register
  2. Use DUP to duplicate the volume factor into a scalar vector register
  3. Use SQDMULH to multiply the volume factor by the sample and get the high half
  4. Use ST1 to store the results into an array

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s