2. Base

2.1 Copying Data

For this task, we have been given a C++ driver that makes calls to C and assembly functions. The first C function copies 7 32-bit integers from an array to another. The second C function does the same, but with a variable number of integers which is passed as the first argument. The two assembly functions are supposed to have the same functionality as the C functions, however they were not implemented yet.

For context, the C functions are as follows:

copy_c.c

#include <stdint.h>

#ifdef __cplusplus
extern "C" {
#endif

void copy_c_0( int32_t * __restrict a,
               int32_t * __restrict b ) {
  b[0] = a[0];
  b[1] = a[1];
  b[2] = a[2];
  b[3] = a[3];
  b[4] = a[4];
  b[5] = a[5];
  b[6] = a[6];
}

void copy_c_1( int64_t              n,
               int32_t * __restrict a,
               int32_t * __restrict b ) {
  for( uint64_t i = 0; i < n; ++i ) {
    b[i] = a[i];
  }
}

#ifdef __cplusplus
}
#endif

Task 2.1.1 & 2.1.2

Given the C functions, our task is to implement the assembly functions that match the C functions in functionality by using only base instructions. For the first function, we simply load and store the 7 32-bit integers from the source to the destination using the ldr and str instructions. We use the given addresses and an immediate offset that is incremented by 4 for each 32-bit integer (4 bytes = 32 bit).

For the second function, we need to use a loop to copy the values from the source to the destination. We use two registers to keep track of number of elements copied and the current byte offset. The loop then starts by copying the first 32-bit integer, and then increase the number of elements copied by 1 and the byte offset by 4 (since each integer is 4 bytes). Next, we use the cmp instruction to check if we have copied the specified number of integers. If not, we go back to the beginning of the loop and repeat the process. If we have copied the specified number of integers, we exit the loop and return from the function.

copy_asm.s

    .text
    .type copy_asm_0, %function
    .global copy_asm_0
    /*
    * Copies an array of 7 32-bit integers from one 
    * location to another.
    *
    * @param x0: address of the source array.
    * @param x1: address of the destination array.
    */
copy_asm_0:
    // save frame pointer and link register
    stp x29, x30, [sp, #-16]!
    // update frame pointer to current stack pointer
    mov x29, sp

    // b[0] = a[0]
    ldr w2, [x0]
    str w2, [x1]
    // b[1] = a[1]
    ldr w2, [x0, #4]
    str w2, [x1, #4]
    // b[2] = a[2]
    ldr w2, [x0, #8]
    str w2, [x1, #8]
    // b[3] = a[3]
    ldr w2, [x0, #12]
    str w2, [x1, #12]
    // b[4] = a[4]
    ldr w2, [x0, #16]
    str w2, [x1, #16]
    // b[5] = a[5]
    ldr w2, [x0, #20]
    str w2, [x1, #20]
    // b[6] = a[6]
    ldr w2, [x0, #24]
    str w2, [x1, #24]

    // restore frame pointer and link register
    ldp x29, x30, [sp], #16

    ret


    .text
    .type copy_asm_1, %function
    .global copy_asm_1
    /*
    * Copies an array of n 32-bit integers from one 
    * location to another.
    *
    * @param x0: number of elements to copy.
    * @param x1: address of the source array.
    * @param x2: address of the destination array.
    */
copy_asm_1:
    // save frame pointer and link register
    stp x29, x30, [sp, #-16]!
    // update frame pointer to current stack pointer
    mov x29, sp
    // number of elements copied
    mov x3, #0
    // byte offset for array
    mov x4, #0
loop:
    // b[i] = a[i]
    ldr w5, [x1, x4]
    str w5, [x2, x4]
    // increment the number of elements copied
    add x3, x3, #1
    // increment the byte offset
    add x4, x4, #4
    // check if we have copied n elements
    cmp x3, x0
    // if not, loop again
    blt loop
    // restore frame pointer and link register
    ldp x29, x30, [sp], #16

    ret

After compiling and running the driver, we can see that the assembly functions work as expected. The output of the driver is as follows:

$ ./copy_driver
copy_c_0: copy succeeded
copy_c_1: copy succeeded
copy_asm_0: copy succeeded
copy_asm_1: copy succeeded

2.2 Instruction Throughput and Latency

For this task, we were supposed to write a micro-benchmarking script to measure the throughput and the latency of the ADD (shifted register) and the MUL instruction.

2.2.1 Throughput

To measure the throughput of the instructions we have developed an assembly function for each of the two instructions. The idea for measuring the throughput is that for every consecutive add instructions there shouldn’t be any dependencies on the add instruction from before. The loop that was executed several times for the ADD (shifted register) and the MUL instruction was:

loop in add_instr.s

loop:
    add x1, x27, x28
    add x2, x27, x28
    add x3, x27, x28
    add x4, x27, x28
    add x5, x27, x28

    add x6, x27, x28
    add x7, x27, x28
    add x8, x27, x28
    add x9, x27, x28
    add x10, x27, x28

    add x11, x27, x28
    add x12, x27, x28
    add x13, x27, x28
    add x14, x27, x28
    add x15, x27, x28

    add x16, x27, x28
    add x17, x27, x28
    add x19, x27, x28
    add x20, x27, x28
    add x21, x27, x28

    add x22, x27, x28
    add x23, x27, x28
    add x24, x27, x28
    add x25, x27, x28
    add x26, x27, x28

    subs x0, x0, #1          // decrement loop counter
    b.gt loop                // branch if greater than zero

loop in mul_instr.s

    mul x1, x27, x28
    mul x2, x27, x28
    mul x3, x27, x28
    mul x4, x27, x28
    mul x5, x27, x28

    mul x6, x27, x28
    mul x7, x27, x28
    mul x8, x27, x28
    mul x9, x27, x28
    mul x10, x27, x28

    mul x11, x27, x28
    mul x12, x27, x28
    mul x13, x27, x28
    mul x14, x27, x28
    mul x15, x27, x28

    mul x16, x27, x28
    mul x17, x27, x28
    mul x19, x27, x28
    mul x20, x27, x28
    mul x21, x27, x28

    mul x22, x27, x28
    mul x23, x27, x28
    mul x24, x27, x28
    mul x25, x27, x28
    mul x26, x27, x28

    subs x0, x0, #1          // decrement loop counter
    b.gt loop                // branch if greater than zero

2.2.2 Latency

To measure the latency of the two instructions we have developed two more assembly functions. The idea for measuring the latency is to have consecutive instructions that dependent on the result from the last executed instruction. That meant we had to slightly adjust the loops from the throughput programs, resulting in the following loops for the ADD (shifted register) and the MUL instruction:

loop in add_lat_instr.s

loop:
    // repeat the following block 5 times
    .rept 5
    add x1, x1, x2
    add x1, x1, x2
    add x1, x1, x2
    add x1, x1, x2
    add x1, x1, x2
    .endr

    subs x0, x0, #1          // decrement loop counter
    b.gt loop                // branch if greater than zero

loop in mul_lat_instr.s

loop:
    // repeat the following block 5 times
    .rept 5
    mul x1, x1, x2
    mul x1, x1, x2
    mul x1, x1, x2
    mul x1, x1, x2
    mul x1, x1, x2
    .endr

    subs x0, x0, #1          // decrement loop counter
    b.gt loop                // branch if greater than zero

2.2.3 Results

To test our assembly functions we implemented a microbenchmark in C++ that would 1. call these functions several times 2. measure the time it took to complete these computations 3. calculate the GOPS obtained by these calculations

time measurement for add_instr in microbench.cpp

        auto l_start_time = std::chrono::high_resolution_clock::now();
        add_instr( n );
        auto l_end_time = std::chrono::high_resolution_clock::now();
        elapsedTime = std::chrono::duration_cast<std::chrono::microseconds>(l_end_time - l_start_time).count() / 1e6;

GOPS calculation for add_instr in microbench.cpp

    double totalOps = n * 25;
    double opsPerSec = totalOps / elapsedTime;
    double gops = opsPerSec / 1e9;

To compile and execute the microbenchmark we simply used:

$ g++ microbench.cpp add_instr.s add_lat_instr.s mul_instr.s mul_lat_instr.s -o microbench.o
$ ./microbench.o

When executing our benchmarking script we obtained the following results:

results

Benchmarking ADD throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   1.29126
Instructions per Second:   2.90414e+10
Estimated GOPS:   29.0414 GigaOps/sec
-----------------------------------------------

Benchmarking MUL throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.82838
Instructions per Second:   1.32584e+10
Estimated GOPS:   13.2584 GigaOps/sec
-----------------------------------------------

Benchmarking ADD latency ...
-----------------------------------------------
Measuring latency for Instruction
Total time (s):   0.856261
Instructions per Second:   4.37951e+09
Estimated GOPS:   4.37951 GigaOps/sec
-----------------------------------------------

Benchmarking MUL latency ...
-----------------------------------------------
Measuring latency for Instruction
Total time (s):   2.5642
Instructions per Second:   1.46244e+09
Estimated GOPS:   1.46244 GigaOps/sec
-----------------------------------------------

To understand the results, we did two things: 1. we searched for the clock speed which is about 4.4 GHz 2. we looked at the ARM Neoverse V2 Software Optimization Guide for the base instructions.

The guide specifies the following theoretical instruction throughput per cycle:

ADD: 6 instructions per cycle
MUL: 2 instructions per cycle

and the latency (until a result is available) of:

ADD: 1 clock cycle
MUL: 2 clock cycles

Looking at a these numbers, we can assume a clock cycle speed of:

\[\frac{29.0414 \text{ GOPS}}{6} = 4.84 \text{ GHz}\]

\[\frac{13.2584 \text{ GOPS}}{2} = 6.63 \text{ GHz}\]

For the ADD instruction, this aligns closely with the known clock speed of 4.4 GHz. For the MUL instruction on the other hand, the clock speed is higher than expected. It is not clear why exactly this is the case, but in order to verify the correctness of our code, we ran it on an M3 Pro Chip with a clock speed of 4.05 GHz. The result for ADD and MUL here was:

\[\frac{26.195 \text{ GOPS}}{6} = 4.37 \text{ GHz}\]

\[\frac{8.08751 \text{ GOPS}}{2} = 4.04 \text{ GHz}\]

We see that the calculated clock speeds are close to the expected clock speed of 4.05 GHz in both cases, which indicates that our code is correct.

For the latency we can make a similar calculation:

\[\frac{4.4 \text{ GHz}}{4.37951 \text{ GOPS}} ≈ 1 \text{ clock cycles per instr}\]

\[\frac{4.4 \text{ GHz}}{1.46244 \text{ GOPS}} ≈ 3 \text{ clock cycles per instr}\]

This shows that our obtained results for the ADD instruction correspond with the number that we obtained from the guide, whereas the latency for the MUL instruction is slightly higher than expected.