2. Base

2.1 Copying Data

For this task, we have been given a C++ driver that makes calls to C and assembly functions. The first C function copies 7 32-bit integers from an array to another. The second C function does the same, but with a variable number of integers which is passed as the first argument. The two assembly functions are supposed to have the same functionality as the C functions, however they were not implemented yet.

For context, the C functions are as follows:

copy_c.c
 1#include <stdint.h>
 2
 3#ifdef __cplusplus
 4extern "C" {
 5#endif
 6
 7void copy_c_0( int32_t * __restrict a,
 8               int32_t * __restrict b ) {
 9  b[0] = a[0];
10  b[1] = a[1];
11  b[2] = a[2];
12  b[3] = a[3];
13  b[4] = a[4];
14  b[5] = a[5];
15  b[6] = a[6];
16}
17
18void copy_c_1( int64_t              n,
19               int32_t * __restrict a,
20               int32_t * __restrict b ) {
21  for( uint64_t i = 0; i < n; ++i ) {
22    b[i] = a[i];
23  }
24}
25
26#ifdef __cplusplus
27}
28#endif

Task 2.1.1 & 2.1.2

Given the C functions, our task is to implement the assembly functions that match the C functions in functionality by using only base instructions. For the first function, we simply load and store the 7 32-bit integers from the source to the destination using the ldr and str instructions. We use the given addresses and an immediate offset that is incremented by 4 for each 32-bit integer (4 bytes = 32 bit).

For the second function, we need to use a loop to copy the values from the source to the destination. We use two registers to keep track of number of elements copied and the current byte offset. The loop then starts by copying the first 32-bit integer, and then increase the number of elements copied by 1 and the byte offset by 4 (since each integer is 4 bytes). Next, we use the cmp instruction to check if we have copied the specified number of integers. If not, we go back to the beginning of the loop and repeat the process. If we have copied the specified number of integers, we exit the loop and return from the function.

copy_asm.s
 1    .text
 2    .type copy_asm_0, %function
 3    .global copy_asm_0
 4    /*
 5    * Copies an array of 7 32-bit integers from one 
 6    * location to another.
 7    *
 8    * @param x0: address of the source array.
 9    * @param x1: address of the destination array.
10    */
11copy_asm_0:
12    // save frame pointer and link register
13    stp x29, x30, [sp, #-16]!
14    // update frame pointer to current stack pointer
15    mov x29, sp
16
17    // b[0] = a[0]
18    ldr w2, [x0]
19    str w2, [x1]
20    // b[1] = a[1]
21    ldr w2, [x0, #4]
22    str w2, [x1, #4]
23    // b[2] = a[2]
24    ldr w2, [x0, #8]
25    str w2, [x1, #8]
26    // b[3] = a[3]
27    ldr w2, [x0, #12]
28    str w2, [x1, #12]
29    // b[4] = a[4]
30    ldr w2, [x0, #16]
31    str w2, [x1, #16]
32    // b[5] = a[5]
33    ldr w2, [x0, #20]
34    str w2, [x1, #20]
35    // b[6] = a[6]
36    ldr w2, [x0, #24]
37    str w2, [x1, #24]
38
39    // restore frame pointer and link register
40    ldp x29, x30, [sp], #16
41
42    ret
43
44
45    .text
46    .type copy_asm_1, %function
47    .global copy_asm_1
48    /*
49    * Copies an array of n 32-bit integers from one 
50    * location to another.
51    *
52    * @param x0: number of elements to copy.
53    * @param x1: address of the source array.
54    * @param x2: address of the destination array.
55    */
56copy_asm_1:
57    // save frame pointer and link register
58    stp x29, x30, [sp, #-16]!
59    // update frame pointer to current stack pointer
60    mov x29, sp
61    // number of elements copied
62    mov x3, #0
63    // byte offset for array
64    mov x4, #0
65loop:
66    // b[i] = a[i]
67    ldr w5, [x1, x4]
68    str w5, [x2, x4]
69    // increment the number of elements copied
70    add x3, x3, #1
71    // increment the byte offset
72    add x4, x4, #4
73    // check if we have copied n elements
74    cmp x3, x0
75    // if not, loop again
76    blt loop
77    // restore frame pointer and link register
78    ldp x29, x30, [sp], #16
79
80    ret

After compiling and running the driver, we can see that the assembly functions work as expected. The output of the driver is as follows:

$ ./copy_driver
copy_c_0: copy succeeded
copy_c_1: copy succeeded
copy_asm_0: copy succeeded
copy_asm_1: copy succeeded

2.2 Instruction Throughput and Latency

For this task, we were supposed to write a micro-benchmarking script to measure the throughput and the latency of the ADD (shifted register) and the MUL instruction.

2.2.1 Throughput

To measure the throughput of the instructions we have developed an assembly function for each of the two instructions. The idea for measuring the throughput is that for every consecutive add instructions there shouldn’t be any dependencies on the add instruction from before. The loop that was executed several times for the ADD (shifted register) and the MUL instruction was:

loop in add_instr.s
 1loop:
 2    add x1, x27, x28
 3    add x2, x27, x28
 4    add x3, x27, x28
 5    add x4, x27, x28
 6    add x5, x27, x28
 7
 8    add x6, x27, x28
 9    add x7, x27, x28
10    add x8, x27, x28
11    add x9, x27, x28
12    add x10, x27, x28
13
14    add x11, x27, x28
15    add x12, x27, x28
16    add x13, x27, x28
17    add x14, x27, x28
18    add x15, x27, x28
19
20    add x16, x27, x28
21    add x17, x27, x28
22    add x19, x27, x28
23    add x20, x27, x28
24    add x21, x27, x28
25
26    add x22, x27, x28
27    add x23, x27, x28
28    add x24, x27, x28
29    add x25, x27, x28
30    add x26, x27, x28
31
32    subs x0, x0, #1          // decrement loop counter
33    b.gt loop                // branch if greater than zero
loop in mul_instr.s
 1    mul x1, x27, x28
 2    mul x2, x27, x28
 3    mul x3, x27, x28
 4    mul x4, x27, x28
 5    mul x5, x27, x28
 6
 7    mul x6, x27, x28
 8    mul x7, x27, x28
 9    mul x8, x27, x28
10    mul x9, x27, x28
11    mul x10, x27, x28
12
13    mul x11, x27, x28
14    mul x12, x27, x28
15    mul x13, x27, x28
16    mul x14, x27, x28
17    mul x15, x27, x28
18
19    mul x16, x27, x28
20    mul x17, x27, x28
21    mul x19, x27, x28
22    mul x20, x27, x28
23    mul x21, x27, x28
24
25    mul x22, x27, x28
26    mul x23, x27, x28
27    mul x24, x27, x28
28    mul x25, x27, x28
29    mul x26, x27, x28
30
31    subs x0, x0, #1          // decrement loop counter
32    b.gt loop                // branch if greater than zero

2.2.2 Latency

To measure the latency of the two instructions we have developed two more assembly functions. The idea for measuring the latency is to have consecutive instructions that dependent on the result from the last executed instruction. That meant we had to slightly adjust the loops from the throughput programs, resulting in the following loops for the ADD (shifted register) and the MUL instruction:

loop in add_lat_instr.s
 1loop:
 2    // repeat the following block 5 times
 3    .rept 5
 4    add x1, x1, x2
 5    add x1, x1, x2
 6    add x1, x1, x2
 7    add x1, x1, x2
 8    add x1, x1, x2
 9    .endr
10
11    subs x0, x0, #1          // decrement loop counter
12    b.gt loop                // branch if greater than zero
loop in mul_lat_instr.s
 1loop:
 2    // repeat the following block 5 times
 3    .rept 5
 4    mul x1, x1, x2
 5    mul x1, x1, x2
 6    mul x1, x1, x2
 7    mul x1, x1, x2
 8    mul x1, x1, x2
 9    .endr
10
11    subs x0, x0, #1          // decrement loop counter
12    b.gt loop                // branch if greater than zero

2.2.3 Results

To test our assembly functions we implemented a microbenchmark in C++ that would 1. call these functions several times 2. measure the time it took to complete these computations 3. calculate the GOPS obtained by these calculations

time measurement for add_instr in microbench.cpp
        auto l_start_time = std::chrono::high_resolution_clock::now();
        add_instr( n );
        auto l_end_time = std::chrono::high_resolution_clock::now();
        elapsedTime = std::chrono::duration_cast<std::chrono::microseconds>(l_end_time - l_start_time).count() / 1e6;
GOPS calculation for add_instr in microbench.cpp
    double totalOps = n * 25;
    double opsPerSec = totalOps / elapsedTime;
    double gops = opsPerSec / 1e9;

To compile and execute the microbenchmark we simply used:

$ g++ microbench.cpp add_instr.s add_lat_instr.s mul_instr.s mul_lat_instr.s -o microbench.o
$ ./microbench.o

When executing our benchmarking script we obtained the following results:

results
Benchmarking ADD throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   1.29126
Instructions per Second:   2.90414e+10
Estimated GOPS:   29.0414 GigaOps/sec
-----------------------------------------------

Benchmarking MUL throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.82838
Instructions per Second:   1.32584e+10
Estimated GOPS:   13.2584 GigaOps/sec
-----------------------------------------------

Benchmarking ADD latency ...
-----------------------------------------------
Measuring latency for Instruction
Total time (s):   0.856261
Instructions per Second:   4.37951e+09
Estimated GOPS:   4.37951 GigaOps/sec
-----------------------------------------------

Benchmarking MUL latency ...
-----------------------------------------------
Measuring latency for Instruction
Total time (s):   2.5642
Instructions per Second:   1.46244e+09
Estimated GOPS:   1.46244 GigaOps/sec
-----------------------------------------------

To understand the results, we did two things: 1. we searched for the clock speed which is about 4.4 GHz 2. we looked at the ARM Neoverse V2 Software Optimization Guide for the base instructions.

The guide specifies the following theoretical instruction throughput per cycle:

  • ADD: 6 instructions per cycle

  • MUL: 2 instructions per cycle

and the latency (until a result is available) of:

  • ADD: 1 clock cycle

  • MUL: 2 clock cycles

Looking at a these numbers, we can assume a clock cycle speed of:

\[\frac{29.0414 \text{ GOPS}}{6} = 4.84 \text{ GHz}\]
\[\frac{13.2584 \text{ GOPS}}{2} = 6.63 \text{ GHz}\]

For the ADD instruction, this aligns closely with the known clock speed of 4.4 GHz. For the MUL instruction on the other hand, the clock speed is higher than expected. It is not clear why exactly this is the case, but in order to verify the correctness of our code, we ran it on an M3 Pro Chip with a clock speed of 4.05 GHz. The result for ADD and MUL here was:

\[\frac{26.195 \text{ GOPS}}{6} = 4.37 \text{ GHz}\]
\[\frac{8.08751 \text{ GOPS}}{2} = 4.04 \text{ GHz}\]

We see that the calculated clock speeds are close to the expected clock speed of 4.05 GHz in both cases, which indicates that our code is correct.

For the latency we can make a similar calculation:

\[\frac{4.4 \text{ GHz}}{4.37951 \text{ GOPS}} ≈ 1 \text{ clock cycles per instr}\]
\[\frac{4.4 \text{ GHz}}{1.46244 \text{ GOPS}} ≈ 3 \text{ clock cycles per instr}\]

This shows that our obtained results for the ADD instruction correspond with the number that we obtained from the guide, whereas the latency for the MUL instruction is slightly higher than expected.