2. Base
2.1 Copying Data
For this task, we have been given a C++ driver that makes calls to C and assembly functions. The first C function copies 7 32-bit integers from an array to another. The second C function does the same, but with a variable number of integers which is passed as the first argument. The two assembly functions are supposed to have the same functionality as the C functions, however they were not implemented yet.
For context, the C functions are as follows:
1#include <stdint.h>
2
3#ifdef __cplusplus
4extern "C" {
5#endif
6
7void copy_c_0( int32_t * __restrict a,
8 int32_t * __restrict b ) {
9 b[0] = a[0];
10 b[1] = a[1];
11 b[2] = a[2];
12 b[3] = a[3];
13 b[4] = a[4];
14 b[5] = a[5];
15 b[6] = a[6];
16}
17
18void copy_c_1( int64_t n,
19 int32_t * __restrict a,
20 int32_t * __restrict b ) {
21 for( uint64_t i = 0; i < n; ++i ) {
22 b[i] = a[i];
23 }
24}
25
26#ifdef __cplusplus
27}
28#endif
Task 2.1.1 & 2.1.2
Given the C functions, our task is to implement the assembly functions that match the C functions in functionality by using only base instructions. For the first function, we simply load and store the 7 32-bit integers from the source to the destination using the ldr and str instructions. We use the given addresses and an immediate offset that is incremented by 4 for each 32-bit integer (4 bytes = 32 bit).
For the second function, we need to use a loop to copy the values from the source to the destination. We use two registers to keep track of number of elements copied and the current byte offset. The loop then starts by copying the first 32-bit integer, and then increase the number of elements copied by 1 and the byte offset by 4 (since each integer is 4 bytes). Next, we use the cmp instruction to check if we have copied the specified number of integers. If not, we go back to the beginning of the loop and repeat the process. If we have copied the specified number of integers, we exit the loop and return from the function.
1 .text
2 .type copy_asm_0, %function
3 .global copy_asm_0
4 /*
5 * Copies an array of 7 32-bit integers from one
6 * location to another.
7 *
8 * @param x0: address of the source array.
9 * @param x1: address of the destination array.
10 */
11copy_asm_0:
12 // save frame pointer and link register
13 stp x29, x30, [sp, #-16]!
14 // update frame pointer to current stack pointer
15 mov x29, sp
16
17 // b[0] = a[0]
18 ldr w2, [x0]
19 str w2, [x1]
20 // b[1] = a[1]
21 ldr w2, [x0, #4]
22 str w2, [x1, #4]
23 // b[2] = a[2]
24 ldr w2, [x0, #8]
25 str w2, [x1, #8]
26 // b[3] = a[3]
27 ldr w2, [x0, #12]
28 str w2, [x1, #12]
29 // b[4] = a[4]
30 ldr w2, [x0, #16]
31 str w2, [x1, #16]
32 // b[5] = a[5]
33 ldr w2, [x0, #20]
34 str w2, [x1, #20]
35 // b[6] = a[6]
36 ldr w2, [x0, #24]
37 str w2, [x1, #24]
38
39 // restore frame pointer and link register
40 ldp x29, x30, [sp], #16
41
42 ret
43
44
45 .text
46 .type copy_asm_1, %function
47 .global copy_asm_1
48 /*
49 * Copies an array of n 32-bit integers from one
50 * location to another.
51 *
52 * @param x0: number of elements to copy.
53 * @param x1: address of the source array.
54 * @param x2: address of the destination array.
55 */
56copy_asm_1:
57 // save frame pointer and link register
58 stp x29, x30, [sp, #-16]!
59 // update frame pointer to current stack pointer
60 mov x29, sp
61 // number of elements copied
62 mov x3, #0
63 // byte offset for array
64 mov x4, #0
65loop:
66 // b[i] = a[i]
67 ldr w5, [x1, x4]
68 str w5, [x2, x4]
69 // increment the number of elements copied
70 add x3, x3, #1
71 // increment the byte offset
72 add x4, x4, #4
73 // check if we have copied n elements
74 cmp x3, x0
75 // if not, loop again
76 blt loop
77 // restore frame pointer and link register
78 ldp x29, x30, [sp], #16
79
80 ret
After compiling and running the driver, we can see that the assembly functions work as expected. The output of the driver is as follows:
$ ./copy_driver
copy_c_0: copy succeeded
copy_c_1: copy succeeded
copy_asm_0: copy succeeded
copy_asm_1: copy succeeded
2.2 Instruction Throughput and Latency
For this task, we were supposed to write a micro-benchmarking script to measure the throughput and the latency of the ADD (shifted register) and the MUL instruction.
2.2.1 Throughput
To measure the throughput of the instructions we have developed an assembly function for each of the two instructions. The idea for measuring the throughput is that for every consecutive add instructions there shouldn’t be any dependencies on the add instruction from before. The loop that was executed several times for the ADD (shifted register) and the MUL instruction was:
1loop:
2 add x1, x27, x28
3 add x2, x27, x28
4 add x3, x27, x28
5 add x4, x27, x28
6 add x5, x27, x28
7
8 add x6, x27, x28
9 add x7, x27, x28
10 add x8, x27, x28
11 add x9, x27, x28
12 add x10, x27, x28
13
14 add x11, x27, x28
15 add x12, x27, x28
16 add x13, x27, x28
17 add x14, x27, x28
18 add x15, x27, x28
19
20 add x16, x27, x28
21 add x17, x27, x28
22 add x19, x27, x28
23 add x20, x27, x28
24 add x21, x27, x28
25
26 add x22, x27, x28
27 add x23, x27, x28
28 add x24, x27, x28
29 add x25, x27, x28
30 add x26, x27, x28
31
32 subs x0, x0, #1 // decrement loop counter
33 b.gt loop // branch if greater than zero
1 mul x1, x27, x28
2 mul x2, x27, x28
3 mul x3, x27, x28
4 mul x4, x27, x28
5 mul x5, x27, x28
6
7 mul x6, x27, x28
8 mul x7, x27, x28
9 mul x8, x27, x28
10 mul x9, x27, x28
11 mul x10, x27, x28
12
13 mul x11, x27, x28
14 mul x12, x27, x28
15 mul x13, x27, x28
16 mul x14, x27, x28
17 mul x15, x27, x28
18
19 mul x16, x27, x28
20 mul x17, x27, x28
21 mul x19, x27, x28
22 mul x20, x27, x28
23 mul x21, x27, x28
24
25 mul x22, x27, x28
26 mul x23, x27, x28
27 mul x24, x27, x28
28 mul x25, x27, x28
29 mul x26, x27, x28
30
31 subs x0, x0, #1 // decrement loop counter
32 b.gt loop // branch if greater than zero
2.2.2 Latency
To measure the latency of the two instructions we have developed two more assembly functions. The idea for measuring the latency is to have consecutive instructions that dependent on the result from the last executed instruction. That meant we had to slightly adjust the loops from the throughput programs, resulting in the following loops for the ADD (shifted register) and the MUL instruction:
1loop:
2 // repeat the following block 5 times
3 .rept 5
4 add x1, x1, x2
5 add x1, x1, x2
6 add x1, x1, x2
7 add x1, x1, x2
8 add x1, x1, x2
9 .endr
10
11 subs x0, x0, #1 // decrement loop counter
12 b.gt loop // branch if greater than zero
1loop:
2 // repeat the following block 5 times
3 .rept 5
4 mul x1, x1, x2
5 mul x1, x1, x2
6 mul x1, x1, x2
7 mul x1, x1, x2
8 mul x1, x1, x2
9 .endr
10
11 subs x0, x0, #1 // decrement loop counter
12 b.gt loop // branch if greater than zero
2.2.3 Results
To test our assembly functions we implemented a microbenchmark in C++ that would 1. call these functions several times 2. measure the time it took to complete these computations 3. calculate the GOPS obtained by these calculations
auto l_start_time = std::chrono::high_resolution_clock::now();
add_instr( n );
auto l_end_time = std::chrono::high_resolution_clock::now();
elapsedTime = std::chrono::duration_cast<std::chrono::microseconds>(l_end_time - l_start_time).count() / 1e6;
double totalOps = n * 25;
double opsPerSec = totalOps / elapsedTime;
double gops = opsPerSec / 1e9;
To compile and execute the microbenchmark we simply used:
$ g++ microbench.cpp add_instr.s add_lat_instr.s mul_instr.s mul_lat_instr.s -o microbench.o
$ ./microbench.o
When executing our benchmarking script we obtained the following results:
Benchmarking ADD throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s): 1.29126
Instructions per Second: 2.90414e+10
Estimated GOPS: 29.0414 GigaOps/sec
-----------------------------------------------
Benchmarking MUL throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s): 2.82838
Instructions per Second: 1.32584e+10
Estimated GOPS: 13.2584 GigaOps/sec
-----------------------------------------------
Benchmarking ADD latency ...
-----------------------------------------------
Measuring latency for Instruction
Total time (s): 0.856261
Instructions per Second: 4.37951e+09
Estimated GOPS: 4.37951 GigaOps/sec
-----------------------------------------------
Benchmarking MUL latency ...
-----------------------------------------------
Measuring latency for Instruction
Total time (s): 2.5642
Instructions per Second: 1.46244e+09
Estimated GOPS: 1.46244 GigaOps/sec
-----------------------------------------------
To understand the results, we did two things: 1. we searched for the clock speed which is about 4.4 GHz 2. we looked at the ARM Neoverse V2 Software Optimization Guide for the base instructions.
The guide specifies the following theoretical instruction throughput per cycle:
ADD: 6 instructions per cycle
MUL: 2 instructions per cycle
and the latency (until a result is available) of:
ADD: 1 clock cycle
MUL: 2 clock cycles
Looking at a these numbers, we can assume a clock cycle speed of:
For the ADD instruction, this aligns closely with the known clock speed of 4.4 GHz. For the MUL instruction on the other hand, the clock speed is higher than expected. It is not clear why exactly this is the case, but in order to verify the correctness of our code, we ran it on an M3 Pro Chip with a clock speed of 4.05 GHz. The result for ADD and MUL here was:
We see that the calculated clock speeds are close to the expected clock speed of 4.05 GHz in both cases, which indicates that our code is correct.
For the latency we can make a similar calculation:
This shows that our obtained results for the ADD instruction correspond with the number that we obtained from the guide, whereas the latency for the MUL instruction is slightly higher than expected.