3. Neon

3.1 Execution Throughput and Latency

For this task, we were benchmarking the execution throughput and latency for FP32 neon instructions. Specifically, we were looking at:

FMLA (vector) instruction
FMADD (scalar) instruction

3.1.1 Throughput

As a first step we were comparing the throughput of:

FMLA (vector) with arrangement specifier 4S
FMLA (vector) with arrangement specifier 2S
FMADD (scalar), single-precision variant

To compare the throughput, we created three assembly programs, that were executing several of these operations. To get proper results we were looking for any dependencies regarding the source or destination registers of the operations. The calculations that we were ending up with were:

FMLA (vector) with arrangement specifier 4S

loop:
    .rept 100
    fmla  v0.4s,  v8.4s, v16.4s
    fmla  v1.4s,  v9.4s, v17.4s
    fmla  v2.4s, v10.4s, v18.4s
    fmla  v3.4s, v11.4s, v19.4s
    fmla  v4.4s, v12.4s, v20.4s

    fmla  v5.4s, v13.4s, v21.4s
    fmla  v6.4s, v14.4s, v22.4s
    fmla  v7.4s, v15.4s, v23.4s
    fmla  v8.4s, v16.4s, v24.4s
    fmla  v9.4s, v17.4s, v25.4s

    fmla v10.4s, v18.4s, v26.4s
    fmla v11.4s, v19.4s, v27.4s
    fmla v12.4s, v20.4s, v28.4s
    fmla v13.4s, v21.4s, v29.4s
    fmla v14.4s, v22.4s, v30.4s

    fmla v15.4s, v23.4s, v31.4s
    fmla v16.4s, v24.4s,  v0.4s
    fmla v17.4s, v25.4s,  v1.4s
    fmla v18.4s, v26.4s,  v2.4s
    fmla v19.4s, v27.4s,  v3.4s

    fmla v20.4s, v28.4s,  v4.4s
    fmla v21.4s, v29.4s,  v5.4s
    fmla v22.4s, v30.4s,  v6.4s
    fmla v23.4s, v31.4s,  v7.4s
    fmla v24.4s,  v0.4s,  v8.4s

    fmla v25.4s,  v1.4s,  v9.4s
    fmla v26.4s,  v2.4s, v10.4s
    fmla v27.4s,  v3.4s, v11.4s
    fmla v28.4s,  v4.4s, v12.4s
    fmla v29.4s,  v5.4s, v13.4s

    fmla v30.4s,  v6.4s, v14.4s
    fmla v31.4s,  v7.4s, v15.4s
    .endr

FMLA (vector) with arrangement specifier 2S

loop:
    .rept 100
    fmla  v0.2s,  v8.2s, v16.2s
    fmla  v1.2s,  v9.2s, v17.2s
    fmla  v2.2s, v10.2s, v18.2s
    fmla  v3.2s, v11.2s, v19.2s
    fmla  v4.2s, v12.2s, v20.2s

    fmla  v5.2s, v13.2s, v21.2s
    fmla  v6.2s, v12.2s, v22.2s
    fmla  v7.2s, v15.2s, v23.2s
    fmla  v8.2s, v16.2s, v24.2s
    fmla  v9.2s, v17.2s, v25.2s

    fmla v10.2s, v18.2s, v26.2s
    fmla v11.2s, v19.2s, v27.2s
    fmla v12.2s, v20.2s, v28.2s
    fmla v13.2s, v21.2s, v29.2s
    fmla v12.2s, v22.2s, v30.2s

    fmla v15.2s, v23.2s, v31.2s
    fmla v16.2s, v24.2s,  v0.2s
    fmla v17.2s, v25.2s,  v1.2s
    fmla v18.2s, v26.2s,  v2.2s
    fmla v19.2s, v27.2s,  v3.2s

    fmla v20.2s, v28.2s,  v4.2s
    fmla v21.2s, v29.2s,  v5.2s
    fmla v22.2s, v30.2s,  v6.2s
    fmla v23.2s, v31.2s,  v7.2s
    fmla v24.2s,  v0.2s,  v8.2s

    fmla v25.2s,  v1.2s,  v9.2s
    fmla v26.2s,  v2.2s, v10.2s
    fmla v27.2s,  v3.2s, v11.2s
    fmla v28.2s,  v4.2s, v12.2s
    fmla v29.2s,  v5.2s, v13.2s

    fmla v30.2s,  v6.2s, v12.2s
    fmla v31.2s,  v7.2s, v15.2s
    .endr

FMADD (scalar), single-precision variant

loop:
    .rept 100
    fmadd  s0,  s8, s16, s24
    fmadd  s1,  s9, s17, s25
    fmadd  s2, s10, s18, s26
    fmadd  s3, s11, s19, s27
    fmadd  s4, s12, s20, s28

    fmadd  s5, s13, s21, s29
    fmadd  s6, s14, s22, s30
    fmadd  s7, s15, s23, s31
    fmadd  s8, s16, s24, s0
    fmadd  s9, s17, s25, s1

    fmadd s10, s18, s26, s2
    fmadd s11, s19, s27, s3
    fmadd s12, s20, s28, s4
    fmadd s13, s21, s29, s5
    fmadd s14, s22, s30, s6

    fmadd s15, s23, s31, s7
    fmadd s16, s24,  s0, s8
    fmadd s17, s25,  s1, s9
    fmadd s18, s26,  s2, s10
    fmadd s19, s27,  s3, s11

    fmadd s20, s28,  s4, s12
    fmadd s21, s29,  s5, s13
    fmadd s22, s30,  s6, s14
    fmadd s23, s31,  s7, s15
    fmadd s24,  s0,  s8, s16

    fmadd s25,  s1,  s9, s17
    fmadd s26,  s2, s10, s18
    fmadd s27,  s3, s11, s19
    fmadd s28,  s4, s12, s20
    fmadd s29,  s5, s13, s21

    fmadd s30,  s6, s14, s22
    fmadd s31,  s7, s15, s23
    .endr

    subs x0, x0, #1
    b.gt loop

In order to measure the throughput of these instructions we developed a C++ microbenchmark. For each instruction we firstly performed a warm up, measured the time, counted the operations and then calculated the GFLOPs.

Example benchmark for FMLA (vector) with arrangement specifier 4S

        // Warmup
        fmla_4s_instr( 100, g_4s_registers );

        auto l_start_time = std::chrono::high_resolution_clock::now();
        fmla_4s_instr( n, g_4s_registers );
        auto l_end_time = std::chrono::high_resolution_clock::now();
        elapsedTime = std::chrono::duration_cast<std::chrono::microseconds>( l_end_time - l_start_time ).count() / 1e6;

        // per FMLA: 4 Muls, 4 Adds
        // 32 fmla
        // rept 100
        // n: loop iterations
        totalOps = (2 * 4) * 32 * 100 * n;

For the 2S and the FMADD (scalar) instructions, we simply adjusted the calculations for the operations slightly:

Calculations for FMLA (vector) with arrangement specifier 2S

        // per FMLA: 2 Muls, 2 Adds
        // 32 fmla
        // rept 100
        // n: loop iterations
        totalOps = (2 * 2) * 32 * 100 * n;

Calculations for FMLA (vector) with arrangement specifier 2S

        // per FMADD: 1 Mul, 1 Add
        // 32 fmadd
        // rept 100
        // n: loop iterations
        totalOps = (2 * 1) * 32 * 100 * n;

For this benchmarking task we obtained the following results:

Throughput results for the three instructions

Benchmarking FMLA 4s throughput ...
-----------------------------------------------
Measuring throughput for FMLA_4sInstruction
Total time (s):   1.96706
Instructions per Second:   1.30144e+11
Estimated GOPS:   130.144 GFLOPs/sec
-----------------------------------------------

Benchmarking FMLA 2s throughput ...
-----------------------------------------------
Measuring throughput for FMLA_2sInstruction
Total time (s):   2.53647
Instructions per Second:   5.04638e+10
Estimated GOPS:   50.4638 GFLOPs/sec
-----------------------------------------------

Benchmarking FMADD throughput ...
-----------------------------------------------
Measuring throughput for FMADDInstruction
Total time (s):   3.52918
Instructions per Second:   1.81345e+10
Estimated GOPS:   18.1345 GFLOPs/sec
-----------------------------------------------

It can be seen that the FMLA (vector) with arrangement specifier 4S instruction performs approximately 2.5 times as many floating point operations than the FMLA (vector) with arrangement specifier 2S instruction. Further the FMLA (vector) with arrangement specifier 2S instructions performs at approximately 2.5 times more floating point operations than the FMADD (scalar) instruction.

This shows that leveraging data-level parallelism (vector-based) can yield a much higher throughput, than using only scalar operations.

3.1.2 Latency

To measure the execution latency for FMLA (vector) instructions with arrangement specifier 4S, we examined two scenarios:

Each instruction depends on the destination register and one of the source register of the previous instruction

fmla instructions with dependencies on the destination register and one of the source registers

    fmla v0.4s, v0.4s,  v1.4s
    fmla v0.4s, v0.4s,  v2.4s
    fmla v0.4s, v0.4s,  v3.4s
    fmla v0.4s, v0.4s,  v4.4s

Each instruction depends only on the destination register of the previous instruction

fmla instructions with dependencies on the destination register

    fmla v0.4s,  v1.4s,  v9.4s
    fmla v0.4s,  v2.4s, v10.4s
    fmla v0.4s,  v3.4s, v11.4s
    fmla v0.4s,  v4.4s, v12.4s

Both files contain 32 fmla instructions each, which are executed 100 times. The results of our benchmark is shown below:

Latency results for the two scenarios

Benchmarking FMLA 4s source register latency ...
-----------------------------------------------
Measuring latency for FMLA_SourceInstruction
Total time (s):   3.30277
Instructions per Second:   1.16266e+10
Estimated GOPS:   11.6266 GFLOPs/sec
-----------------------------------------------

Benchmarking FMLA 4s destination register latency ...
-----------------------------------------------
Measuring latency for FMLA_DestinationInstruction
Total time (s):   3.30207
Instructions per Second:   1.16291e+10
Estimated GOPS:   11.6291 GFLOPs/sec
-----------------------------------------------

We can see that both scenarios have similar results, which is why we computed the latency only for the first scenario.

We measured \(1.16266 \times 10^{10}\) instructions per second, which means that the latency of the FMLA (vector) instruction with arrangement specifier 4S is approximately \(\frac{1}{1.16266 \times 10^{10}} \approx 8.6 \times 10^{-11}\) seconds. Using a known clock frequency of 4.4 GHz, we computed the latency as \(8.6 \times 10^{-11} \times 4.4 \times 10^9 = 0.3784\) cycles.

3.2 Microkernel

For the second task we were implementing a microkernel to execute a matrix multiplication for matrices with the dimensions:

Matrix A: 16 x 1
Matrix B: 1 x 6
Matrix C: 16 x 6

3.2.1 Neon Microkernel

We developed three different versions of this microkernel in order to optimize its performance.

In the first version we:

Load matrix A (16 x 1)
Load three columns (1 x 1) of matrix B
Load matrix C (16 x 6)

In the second version we:

Load matrix A (16 x 1)
Load one column of matrix B
Load matrix C (16 x 6)

In the third version we:

Load matrix A (16 x 1)
Load one column of matrix B
Load one column of matrix C (16 x 1)

3.2.2 Testing and Benchmarking

To test and compare our versions with one another we:

implemented a microkernel that would give us a visual indication of the results
implemented a test using Catch2 to test the correctness of our implementations
implemented a microbenchmark that would calculate the GFLOPs for each of the three versions

The GFLOPs were calculated using the following formula:

GFLOPs calculations

                              16 );
        }
        auto l_end_time = std::chrono::high_resolution_clock::now();
        elapsedTime = std::chrono::duration_cast<std::chrono::microseconds>( l_end_time - l_start_time ).count() / 1e6;
    }

For each version we would perform 50,000 iterations as a warmup to guarantee similar results for each execution of the benchmark. Using this approach we obtained the following results:

GFLOPs calculations

Benchmarking V1 Matmul throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.93595
Instructions per Second:   3.26981e+10
Estimated GFLOPS:   32.6981 GFLOPS/sec
-----------------------------------------------

Benchmarking V2 Matmul throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.90291
Instructions per Second:   3.30702e+10
Estimated GFLOPS:   33.0702 GFLOPS/sec
-----------------------------------------------

Benchmarking V3 Matmul throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.79132
Instructions per Second:   3.43923e+10
Estimated GFLOPS:   34.3923 GFLOPS/sec
-----------------------------------------------

The GLFOPs results indicate that with every version we obtained slightly better results, resulting in about 1.7 GLOPs in difference comparing our best with our worst approach.

3.3 Loops

In this task, we had to add loops to the matrix multiplication kernel which we wrote in the previous task. The goal was to enable the 16x6x1 kernel to be used for larger matrices.

The first step was to implement a loop in the K dimension, resulting in a 16x6x64 kernel. The loading and storing of matrix C was left unchanged. The relevant code is shown below:

Looping matmul_16_6_1 over K dimension

    //  K loop counter
    mov x6, #64
    // set start of A
    mov x7, x0
    // set start of B
    mov x8, x1
    // init row count of B
    mov x9, #0

_k_loop:
    // load column of A
    ldp q24, q25, [x7] // 4 + 4 values
    ldp q26, q27, [x7, #32] // 4 + 4 values

    // B: COLUMN 0
    ldr s29, [x8]
    fmla v0.4s, v24.4s, v29.s[0]
    fmla v1.4s, v25.4s, v29.s[0]
    fmla v2.4s, v26.4s, v29.s[0]
    fmla v3.4s, v27.4s, v29.s[0]
    // B: COLUMN 1
    add x8, x8, x4
    ldr s29, [x8]
    fmla v4.4s, v24.4s, v29.s[0]
    fmla v5.4s, v25.4s, v29.s[0]
    fmla v6.4s, v26.4s, v29.s[0]
    fmla v7.4s, v27.4s, v29.s[0]
    // B: COLUMN 2
    add x8, x8, x4
    ldr s29, [x8]
    fmla  v8.4s, v24.4s, v29.s[0]
    fmla  v9.4s, v25.4s, v29.s[0]
    fmla v10.4s, v26.4s, v29.s[0]
    fmla v11.4s, v27.4s, v29.s[0]
    // B: COLUMN 3
    add x8, x8, x4
    ldr s29, [x8]
    fmla v12.4s, v24.4s, v29.s[0]
    fmla v13.4s, v25.4s, v29.s[0]
    fmla v14.4s, v26.4s, v29.s[0]
    fmla v15.4s, v27.4s, v29.s[0]
    // B: COLUMN 4
    add x8, x8, x4
    ldr s29, [x8]
    fmla v16.4s, v24.4s, v29.s[0]
    fmla v17.4s, v25.4s, v29.s[0]
    fmla v18.4s, v26.4s, v29.s[0]
    fmla v19.4s, v27.4s, v29.s[0]
    // B: COLUMN 5
    add x8, x8, x4
    ldr s29, [x8]
    fmla v20.4s, v24.4s, v29.s[0]
    fmla v21.4s, v25.4s, v29.s[0]
    fmla v22.4s, v26.4s, v29.s[0]
    fmla v23.4s, v27.4s, v29.s[0]

    // move to next column of A
    add x7, x7, x3
    // move to next row of B
    mov x8, x1
    add x9, x9, #4
    add x8, x8, x9

    // decrement loop counter
    sub x6, x6, #1
    // check if loop counter is zero
    cbnz x6, _k_loop

The matmul_16_6_1 kernel mostly stayed the same, except that for each K loop, we now need to adjust the pointers to the input matrices A and B. At the end of each loop, we move the pointers to A to the next column by adding the given stride. In B, we need to move the pointer to the next row. Therefore, we jump by 4 Bytes (since we are using 32-bit floats) from the starting address of B. To keep jumping to the next row in each loop, we accumulate the offset of 4 Bytes in the register x9.

The second step was to implement a loop in the M dimension, resulting in a 64x6x64 kernel. To keep the code examples shorter, we exclude the K loop from the code snippets. The relevant code is shown below:

First part of looping over M dimension

    // save base matrix pointers
    mov x7, x0 // A
    mov x8, x1 // B
    mov x9, x2 // C

    // M loop counter
    mov x11, #4 // 64/16 = 4 blocks

_m_loop:
// ------------------------------------------
// START matmul_16_6_64
// ------------------------------------------

    // LOAD MATRIX C
    mov x12, x9
    // first column
    ldp q0, q1, [x12]
    ldp q2, q3, [x12, #32]
    // second column
    add x12, x12, x5
    ldp q4, q5, [x12]
    ldp q6, q7, [x12, #32]
    // third column
    add x12, x12, x5
    ldp q8, q9, [x12]
    ldp q10, q11, [x12, #32]
    // fourth column
    add x12, x12, x5
    ldp q12, q13, [x12]
    ldp q14, q15, [x12, #32]
    // fifth column
    add x12, x12, x5
    ldp q16, q17, [x12]
    ldp q18, q19, [x12, #32]
    // sixth column
    add x12, x12, x5
    ldp q20, q21, [x12]
    ldp q22, q23, [x12, #32]

    // K loop counter
    mov x14, #64
    // set start of A
    mov x15, x7
    // set start of B
    mov x16, x8
    // init row count of B
    mov x17, #0
_k_loop:

Second part of looping over M dimension

    // check if loop counter is zero
    cbnz x14, _k_loop

    // STORE MATRIX C
    mov x12, x9
    // first column
    stp q0, q1, [x12]
    stp q2, q3, [x12, #32]
    // second column
    add x12, x12, x5
    stp q4, q5, [x12]
    stp q6, q7, [x12, #32]
    // third column
    add x12, x12, x5
    stp q8, q9, [x12]
    stp q10, q11, [x12, #32]
    // fourth column
    add x12, x12, x5
    stp q12, q13, [x12]
    stp q14, q15, [x12, #32]
    // fifth column
    add x12, x12, x5
    stp q16, q17, [x12]
    stp q18, q19, [x12, #32]
    // sixth column
    add x12, x12, x5
    stp q20, q21, [x12]
    stp q22, q23, [x12, #32]

// ------------------------------------------
// END matmul_16_6_64
// ------------------------------------------

    // increase A and C pointers for next block
    add x7, x7, #16*4
    add x9, x9, #16*4

    // decrement m loop counter
    sub x11, x11, #1
    // check if loop counter is zero
    cbnz x11, _m_loop

The M loop needs only 4 iterations, since we are extending the kernel from 16 to 64 in the M dimension by dividing the M dimension into 4 blocks of 16 elements. At the end of the M loop, we move the pointers of A and C to the next block. We jump by 16 elements in the M dimension, which means adding 16*4 Bytes to the pointer of A and C.

The last step was to implement a loop in the N dimension, resulting in a 64x48x64 kernel. The relevant code is shown below:

First part of looping over N dimension

    // set base matrix pointers
    mov x20, x1 // B
    mov x21, x2 // C

    // N loop counter
    mov x19, #8 // 48/6 = 8 blocks

_n_loop:

    // M loop counter
    mov x11, #4 // 64/16 = 4 blocks

    // set matrix pointers
    mov x7, x0 // A
    mov x8, x20 // B
    mov x9, x21 // C

_m_loop:

Second part of looping over N dimension

    // decrement m loop counter
    sub x11, x11, #1
    // check if loop counter is zero
    cbnz x11, _m_loop
// END M LOOP

    // increase B and C pointers for next block
    // (jump 6 columns) 6*x4, 6*x5
    add x20, x20, x22
    add x21, x21, x23

    // decrement n loop counter
    sub x19, x19, #1
    // check if loop counter is zero
    cbnz x19, _n_loop
// END N LOOP

Since we are extending the kernel from 6 to 48 in the N dimension, we need to divide the N dimension into 8 blocks of 6 elements. This means that the loop will have 8 iterations. For each N loop, it is important to first reset the pointer of A to the original address. After each iteration, we need to move the pointers of B and C to the next block. To do this, we jump by elements in the N dimension, that is specifically 6 columns of B and C. We do this by adding 6 times the stride of B and C to the pointers.

3.3.1 Testing and Benchmarking

For all three kernels we have written unit tests. To execute the tests, one first needs to compile the code by invoking make from within the src/submissions/03_neon/03_loops directory. This will create an executable that can be run with ./build/test.

We also calculated the GFLOPs for each of these matrix multiplications. To calculate them we followed the simple formula:

\[M \cdot N \cdot K \cdot \text{Ops Per FMLA}\]

The results that we obtained were:

GFLOPs calculations for MatMuls

Benchmarking Matmul_16_6_64 throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   1.89248
Instructions per Second:   1.29861e+11
Estimated GFLOPS:   129.861 GFLOPS/sec
-----------------------------------------------

Benchmarking Matmul_64_6_64 Matmul throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   1.84635
Instructions per Second:   1.33106e+11
Estimated GFLOPS:   133.106 GFLOPS/sec
-----------------------------------------------

Benchmarking Matmul_64_48_64 Matmul throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   1.49743
Instructions per Second:   1.31297e+11
Estimated GFLOPS:   131.297 GFLOPS/sec
-----------------------------------------------

Our results indicate that the number of GFLOPs is very consistent, even when scaling the size of our matrix.

3.4 SIMD Lanes

For this task we were supposed to create two kernels, that should be able to function, even if we don’t have a multiple of 4 for the M dimension. We created several versions for both:

the M=14, N=6 and K=64, and
the M=15, N=6 and K=64

3.4.1 Matmul_14_6_64

For the case M=14 we considered four different versions:

Our first approach was to use two loops. The first loop was used to calculate a (12 x 64) block of matrix C. That means, we would load 12 column elements of matrix A. The second loop was then used to calculate the remaining (2 x 64) block of matrix C.

Second loop for the (2 x 64) matrix calculation

_k2_loop:
    // load column of A
    ldr d24, [x7]   // 2 values

    // B: COLUMN 0
    ldr s29, [x8]
    fmla v3.2s, v24.2s, v29.s[0]

    // B: COLUMN 1
    add x8, x8, x4
    ldr s29, [x8]
    fmla v7.2s, v24.2s, v29.s[0]

    // B: COLUMN 2
    add x8, x8, x4
    ldr s29, [x8]
    fmla v11.2s, v24.2s, v29.s[0]

    // B: COLUMN 3
    add x8, x8, x4
    ldr s29, [x8]
    fmla v15.2s, v24.2s, v29.s[0]

    // B: COLUMN 4
    add x8, x8, x4
    ldr s29, [x8]
    fmla v19.2s, v24.2s, v29.s[0]
    
    // B: COLUMN 5
    add x8, x8, x4
    ldr s29, [x8]
    fmla v23.2s, v24.2s, v29.s[0]

    // move to next column of A
    add x7, x7, x3
    // move to next row of B
    mov x8, x1
    add x9, x9, #4
    add x8, x8, x9

    // decrement loop counter
    sub x6, x6, #1
    // check if loop counter is zero
    cbnz x6, _k2_loop

The second approach was to use a single loop. We would load the whole matrix C, and matrix A column-wise using one ldp qXX, qXX, [x7], one ldr qXX, [x7, #32] and one ldr dXX, [x7, #48] instruction.

Calculate matrix C with a single loop using four loads

_k1_loop:
    // load column of A
    ldp q24, q25, [x7] // 4 + 4 values
    ldr q26, [x7, #32] // 4 values
    ldr d27, [x7, #48] // 2 values - possible memory leak 

    // B: COLUMN 0
    ldr s29, [x8]
    fmla v0.4s, v24.4s, v29.s[0]
    fmla v1.4s, v25.4s, v29.s[0]
    fmla v2.4s, v26.4s, v29.s[0]
    fmla v3.2s, v27.2s, v29.s[0]
    // B: COLUMN 1
    add x8, x8, x4
    ldr s29, [x8]
    fmla v4.4s, v24.4s, v29.s[0]
    fmla v5.4s, v25.4s, v29.s[0]
    fmla v6.4s, v26.4s, v29.s[0]
    fmla v7.2s, v27.2s, v29.s[0]
    // B: COLUMN 2
    add x8, x8, x4
    ldr s29, [x8]
    fmla  v8.4s, v24.4s, v29.s[0]
    fmla  v9.4s, v25.4s, v29.s[0]
    fmla v10.4s, v26.4s, v29.s[0]
    fmla v11.2s, v27.2s, v29.s[0]
    // B: COLUMN 3
    add x8, x8, x4
    ldr s29, [x8]
    fmla v12.4s, v24.4s, v29.s[0]
    fmla v13.4s, v25.4s, v29.s[0]
    fmla v14.4s, v26.4s, v29.s[0]
    fmla v15.2s, v27.2s, v29.s[0]
    // B: COLUMN 4
    add x8, x8, x4
    ldr s29, [x8]
    fmla v16.4s, v24.4s, v29.s[0]
    fmla v17.4s, v25.4s, v29.s[0]
    fmla v18.4s, v26.4s, v29.s[0]
    fmla v19.2s, v27.2s, v29.s[0]
    // B: COLUMN 5
    add x8, x8, x4
    ldr s29, [x8]
    fmla v20.4s, v24.4s, v29.s[0]
    fmla v21.4s, v25.4s, v29.s[0]
    fmla v22.4s, v26.4s, v29.s[0]
    fmla v23.2s, v27.2s, v29.s[0]

    // move to next column of A
    add x7, x7, x3
    // move to next row of B
    mov x8, x1
    add x9, x9, #4
    add x8, x8, x9

    // decrement loop counter
    sub x6, x6, #1
    // check if loop counter is zero
    cbnz x6, _k1_loop

Our third approach was again to use a single loop. But this time we would load matrix A column-wise using two ldp qXX, qXX, [x7] instructions and then set the last two elements to zero using mov v27.s[2], wzr and mov v27.s[3], wzr.

Calculate matrix C with a single loop using ldp loads

_k1_loop:
    // load column of A
    ldp q24, q25, [x7] // 4 + 4 values
    ldp q26, q27, [x7, #32] // 4 + 4 values - possible memory leak
    mov v27.s[2], wzr
    mov v27.s[3], wzr

    // B: COLUMN 0
    ldr s29, [x8]
    
    fmla v0.4s, v24.4s, v29.s[0]
    fmla v1.4s, v25.4s, v29.s[0]
    fmla v2.4s, v26.4s, v29.s[0]
    fmla v3.4s, v27.4s, v29.s[0]
    // B: COLUMN 1
    add x8, x8, x4
    ldr s29, [x8]
    fmla v4.4s, v24.4s, v29.s[0]
    fmla v5.4s, v25.4s, v29.s[0]
    fmla v6.4s, v26.4s, v29.s[0]
    fmla v7.4s, v27.4s, v29.s[0]
    // B: COLUMN 2
    add x8, x8, x4
    ldr s29, [x8]
    fmla  v8.4s, v24.4s, v29.s[0]
    fmla  v9.4s, v25.4s, v29.s[0]
    fmla v10.4s, v26.4s, v29.s[0]
    fmla v11.4s, v27.4s, v29.s[0]
    // B: COLUMN 3
    add x8, x8, x4
    ldr s29, [x8]
    fmla v12.4s, v24.4s, v29.s[0]
    fmla v13.4s, v25.4s, v29.s[0]
    fmla v14.4s, v26.4s, v29.s[0]
    fmla v15.4s, v27.4s, v29.s[0]
    // B: COLUMN 4
    add x8, x8, x4
    ldr s29, [x8]
    fmla v16.4s, v24.4s, v29.s[0]
    fmla v17.4s, v25.4s, v29.s[0]
    fmla v18.4s, v26.4s, v29.s[0]
    fmla v19.4s, v27.4s, v29.s[0]
    // B: COLUMN 5
    add x8, x8, x4
    ldr s29, [x8]
    fmla v20.4s, v24.4s, v29.s[0]
    fmla v21.4s, v25.4s, v29.s[0]
    fmla v22.4s, v26.4s, v29.s[0]
    fmla v23.4s, v27.4s, v29.s[0]

    // move to next column of A
    add x7, x7, x3
    // move to next row of B
    mov x8, x1
    add x9, x9, #4
    add x8, x8, x9

    // decrement loop counter
    sub x6, x6, #1
    // check if loop counter is zero
    cbnz x6, _k1_loop

In our fourth approach we simply copied the second version and changed our loads for matrix A and C. We used ld1 instead of ldp.

Calculate matrix C with a single loop and ld1 loads

_k1_loop:
    // load column of A
    ld1 {v24.4s-v27.4s}, [x7]
    ldr d27, [x7, #48] // 2 values

    // B: COLUMN 0
    ldr s29, [x8]
    fmla v0.4s, v24.4s, v29.s[0]
    fmla v1.4s, v25.4s, v29.s[0]
    fmla v2.4s, v26.4s, v29.s[0]
    fmla v3.2s, v27.2s, v29.s[0]
    // B: COLUMN 1
    add x8, x8, x4
    ldr s29, [x8]
    fmla v4.4s, v24.4s, v29.s[0]
    fmla v5.4s, v25.4s, v29.s[0]
    fmla v6.4s, v26.4s, v29.s[0]
    fmla v7.2s, v27.2s, v29.s[0]
    // B: COLUMN 2
    add x8, x8, x4
    ldr s29, [x8]
    fmla  v8.4s, v24.4s, v29.s[0]
    fmla  v9.4s, v25.4s, v29.s[0]
    fmla v10.4s, v26.4s, v29.s[0]
    fmla v11.2s, v27.2s, v29.s[0]
    // B: COLUMN 3
    add x8, x8, x4
    ldr s29, [x8]
    fmla v12.4s, v24.4s, v29.s[0]
    fmla v13.4s, v25.4s, v29.s[0]
    fmla v14.4s, v26.4s, v29.s[0]
    fmla v15.2s, v27.2s, v29.s[0]
    // B: COLUMN 4
    add x8, x8, x4
    ldr s29, [x8]
    fmla v16.4s, v24.4s, v29.s[0]
    fmla v17.4s, v25.4s, v29.s[0]
    fmla v18.4s, v26.4s, v29.s[0]
    fmla v19.2s, v27.2s, v29.s[0]
    // B: COLUMN 5
    add x8, x8, x4
    ldr s29, [x8]
    fmla v20.4s, v24.4s, v29.s[0]
    fmla v21.4s, v25.4s, v29.s[0]
    fmla v22.4s, v26.4s, v29.s[0]
    fmla v23.2s, v27.2s, v29.s[0]

    // move to next column of A
    add x7, x7, x3
    // move to next row of B
    mov x8, x1
    add x9, x9, #4
    add x8, x8, x9

    // decrement loop counter
    sub x6, x6, #1
    // check if loop counter is zero
    cbnz x6, _k1_loop

When benchmarking our approaches we obtained the following results:

Benchmarking results for matmul_14_6_64 approaches

Benchmarking V1_Matmul_14_6_64 throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.54314
Instructions per Second:   8.45568e+10
Estimated GFLOPS:   84.5568 GFLOPS/sec
-----------------------------------------------

Benchmarking V2_Matmul_14_6_64 throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   1.88243
Instructions per Second:   1.14235e+11
Estimated GFLOPS:   114.235 GFLOPS/sec
-----------------------------------------------

Benchmarking V3_Matmul_14_6_64 throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.08544
Instructions per Second:   1.03115e+11
Estimated GFLOPS:   103.115 GFLOPS/sec
-----------------------------------------------

Benchmarking V4_Matmul_14_6_64 throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   1.89444
Instructions per Second:   1.13511e+11
Estimated GFLOPS:   113.511 GFLOPS/sec
-----------------------------------------------

The results indicate that the version with three different loads performed best, with an increase of about 10 GFLOPs. The switch from ldp to ld1 however, didn’t show any significant changes in the number of GFLOPs.

3.4.2 Matmul_15_6_64

For the case M=15 we considered three different versions:

For our first approach we again considered two loops. Again, the first loop was used to calculate a (12 x 64) block of matrix C. The second loop was then used to calculate the remaining (3 x 64) block of matrix C.

Second loop for the (3 x 64) matrix calculation

_k2_loop:
    // load column of A
    ldr d24, [x7]   // 2 values
    ldr s25, [x7, #8]

    // B: COLUMN 0
    ldr s29, [x8]
    fmla v0.2s, v24.2s, v29.s[0]
    fmadd s1, s25, s29, s1
    // B: COLUMN 1
    add x8, x8, x4
    ldr s29, [x8]
    fmla v2.2s, v24.2s, v29.s[0]
    fmadd s3, s25, s29, s3
    // B: COLUMN 2
    add x8, x8, x4
    ldr s29, [x8]
    fmla v4.2s, v24.2s, v29.s[0]
    fmadd s5, s25, s29, s5
    // B: COLUMN 3
    add x8, x8, x4
    ldr s29, [x8]
    fmla v6.2s, v24.2s, v29.s[0]
    fmadd s7, s25, s29, s7
    // B: COLUMN 4
    add x8, x8, x4
    ldr s29, [x8]
    fmla v8.2s, v24.2s, v29.s[0]
    fmadd s9, s25, s29, s9
    // B: COLUMN 5
    add x8, x8, x4
    ldr s29, [x8]
    fmla v10.2s, v24.2s, v29.s[0]
    fmadd s11, s25, s29, s11

    // move to next column of A
    add x7, x7, x3
    // move to next row of B
    mov x8, x1
    add x9, x9, #4
    add x8, x8, x9

    // decrement loop counter
    sub x6, x6, #1
    // check if loop counter is zero
    cbnz x6, _k2_loop

In the second approach we use one loop. We load matrix A column-wise using two two ldp qXX, qXX, [x7] instructions and then set the last element to zero using mov v27.s[3], wzr.

Calculate matrix C with a single loop using ldp loads

_k1_loop:
    // load column of A
    ldp q24, q25, [x7] // 4 + 4 values
    ldp q26, q27, [x7, #32] // 4 + 4 values - possible memory leak
    mov v27.s[3], wzr

    // B: COLUMN 0
    ldr s29, [x8]
    
    fmla v0.4s, v24.4s, v29.s[0]
    fmla v1.4s, v25.4s, v29.s[0]
    fmla v2.4s, v26.4s, v29.s[0]
    fmla v3.4s, v27.4s, v29.s[0]
    // B: COLUMN 1
    add x8, x8, x4
    ldr s29, [x8]
    fmla v4.4s, v24.4s, v29.s[0]
    fmla v5.4s, v25.4s, v29.s[0]
    fmla v6.4s, v26.4s, v29.s[0]
    fmla v7.4s, v27.4s, v29.s[0]
    // B: COLUMN 2
    add x8, x8, x4
    ldr s29, [x8]
    fmla  v8.4s, v24.4s, v29.s[0]
    fmla  v9.4s, v25.4s, v29.s[0]
    fmla v10.4s, v26.4s, v29.s[0]
    fmla v11.4s, v27.4s, v29.s[0]
    // B: COLUMN 3
    add x8, x8, x4
    ldr s29, [x8]
    fmla v12.4s, v24.4s, v29.s[0]
    fmla v13.4s, v25.4s, v29.s[0]
    fmla v14.4s, v26.4s, v29.s[0]
    fmla v15.4s, v27.4s, v29.s[0]
    // B: COLUMN 4
    add x8, x8, x4
    ldr s29, [x8]
    fmla v16.4s, v24.4s, v29.s[0]
    fmla v17.4s, v25.4s, v29.s[0]
    fmla v18.4s, v26.4s, v29.s[0]
    fmla v19.4s, v27.4s, v29.s[0]
    // B: COLUMN 5
    add x8, x8, x4
    ldr s29, [x8]
    fmla v20.4s, v24.4s, v29.s[0]
    fmla v21.4s, v25.4s, v29.s[0]
    fmla v22.4s, v26.4s, v29.s[0]
    fmla v23.4s, v27.4s, v29.s[0]

    // move to next column of A
    add x7, x7, x3
    // move to next row of B
    mov x8, x1
    add x9, x9, #4
    add x8, x8, x9

    // decrement loop counter
    sub x6, x6, #1
    // check if loop counter is zero
    cbnz x6, _k1_loop

In the third approach we again changed the load instructions from ldp to ld1.

Calculate matrix C with a single loop using ld1 loads

_k1_loop:
    // load column of A
    ld1 {v24.4s-v27.4s}, [x7]
    mov v27.s[3], wzr

    // B: COLUMN 0
    ldr s29, [x8]
    
    fmla v0.4s, v24.4s, v29.s[0]
    fmla v1.4s, v25.4s, v29.s[0]
    fmla v2.4s, v26.4s, v29.s[0]
    fmla v3.4s, v27.4s, v29.s[0]
    // B: COLUMN 1
    add x8, x8, x4
    ldr s29, [x8]
    fmla v4.4s, v24.4s, v29.s[0]
    fmla v5.4s, v25.4s, v29.s[0]
    fmla v6.4s, v26.4s, v29.s[0]
    fmla v7.4s, v27.4s, v29.s[0]
    // B: COLUMN 2
    add x8, x8, x4
    ldr s29, [x8]
    fmla  v8.4s, v24.4s, v29.s[0]
    fmla  v9.4s, v25.4s, v29.s[0]
    fmla v10.4s, v26.4s, v29.s[0]
    fmla v11.4s, v27.4s, v29.s[0]
    // B: COLUMN 3
    add x8, x8, x4
    ldr s29, [x8]
    fmla v12.4s, v24.4s, v29.s[0]
    fmla v13.4s, v25.4s, v29.s[0]
    fmla v14.4s, v26.4s, v29.s[0]
    fmla v15.4s, v27.4s, v29.s[0]
    // B: COLUMN 4
    add x8, x8, x4
    ldr s29, [x8]
    fmla v16.4s, v24.4s, v29.s[0]
    fmla v17.4s, v25.4s, v29.s[0]
    fmla v18.4s, v26.4s, v29.s[0]
    fmla v19.4s, v27.4s, v29.s[0]
    // B: COLUMN 5
    add x8, x8, x4
    ldr s29, [x8]
    fmla v20.4s, v24.4s, v29.s[0]
    fmla v21.4s, v25.4s, v29.s[0]
    fmla v22.4s, v26.4s, v29.s[0]
    fmla v23.4s, v27.4s, v29.s[0]

    // move to next column of A
    add x7, x7, x3
    // move to next row of B
    mov x8, x1
    add x9, x9, #4
    add x8, x8, x9

    // decrement loop counter
    sub x6, x6, #1
    // check if loop counter is zero
    cbnz x6, _k1_loop

Again, we performed some benchmarks:

Benchmarking results for matmul_15_6_64 approaches

Benchmarking V1_Matmul_15_6_64 throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.80305
Instructions per Second:   8.21963e+10
Estimated GFLOPS:   82.1963 GFLOPS/sec
-----------------------------------------------

Benchmarking V2_Matmul_15_6_64 throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.18569
Instructions per Second:   1.05413e+11
Estimated GFLOPS:   105.413 GFLOPS/sec
-----------------------------------------------

Benchmarking V3_Matmul_15_6_64 throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.18658
Instructions per Second:   1.0537e+11
Estimated GFLOPS:   105.37 GFLOPS/sec
-----------------------------------------------

Similar to the benchmarks for the matmul_14_6_64 the approach with the single loop performs much better than the other approach. This time, we even gain about 23 GFLOPs with this approach.

3.4.3 Generic Approach

Simply as a proof of concept we also implemented a generic approach for the matmul_14_6_64 and matmul_15_6_64 kernels. This kernel works for any M > 0. The idea is to write specific kernels for M = 1, 2, ..., 8. We then divide M by 8 (shift right by 3) and use that to loop the kernel for M = 8. Basically we split the M dimension into blocks of 8 elements and compute the result using a matmul_8_6_64 kernel. If there is a remainder, it is >=1 and <=7, which we handle with specific kernels. The selection of the specific kernels is done using a jump table.

We also benchmarked the performance of this generic kernel:

Benchmarking results for matmul_M_6_64 (M = 14) approach

Benchmarking Matmul_M_6_64 M=14 throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.4066
Instructions per Second:   8.93543e+10
Estimated GFLOPS:   89.3543 GFLOPS/sec
-----------------------------------------------

Benchmarking results for matmul_M_6_64 (M = 15) approach

Benchmarking Matmul_M_6_64 M=15 throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   3.68161
Instructions per Second:   6.25814e+10
Estimated GFLOPS:   62.5814 GFLOPS/sec
-----------------------------------------------

Compared to our other approaches our obtained GFLOPs are slightly worse, losing about 30 GFLOPs to our best approach for the matmul_14_6_64 and about 40 GFLOPs to our best approach for the matmul_15_6_64.

3.5 Accumulator Block Shapes

In this task we were supposed to implement a microkernel that computes C+=AB for M=64, N=64 and K=64. Recalling our matmul_64_48_64 kernel, we only need to change the N dimension to 64. This kernel uses the matmul_16_6_64 internally, which we changed to matmul_16_4_64. Changing N from 6 to 4 allows us to divide the N dimension into 16 blocks of 4 elements. N = 8 was not suitable, as we ran into issues with the number of available SIMD lanes. We do not think it is necessary to show the code for this kernel, as it is very similar to the matmul_64_48_64 kernel. The only difference is that we removed the logic for 2 of the 6 columns and increased the loop counter constant.

Benchmarking this kernel we obtained the following results:

Benchmarking results for matmul_64_64_64 approaches

Benchmarking V1 Matmul throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   1.27442
Instructions per Second:   1.23418e+11
Estimated GFLOPS:   123.418 GFLOPS/sec
-----------------------------------------------

Benchmarking V2 Matmul throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   1.246
Instructions per Second:   1.26233e+11
Estimated GFLOPS:   126.233 GFLOPS/sec
-----------------------------------------------

V1 is the first version which we obtained by converting our best performing matmul_64_48_64 kernel. Trying to squeeze out more performance, we made some minor changes to the computations of the strides (as shown below). We also removed loads and stores of callee-saved registers that were not used. This resulted in a performance increase of about 2-3 GFLOPs in V2 across multiple runs.

Naive stride calculations

    // multiply strides with float size
    mov x6, #4
    mul x3, x3, x6 // lda
    mul x4, x4, x6 // ldb
    mul x5, x5, x6 // ldc

    mov x6, #4
    mul x22, x4, x6 // ldb * 4 columns
    mul x23, x5, x6 // ldc * 4 columns

Optimized stride calculations

    // multiply strides with float size
    // *4 = lsl #2
    lsl x3, x3, #2 // lda
    lsl x4, x4, #2 // ldb
    lsl x5, x5, #2 // ldc

    lsl x22, x4, #2 // ldb * 4 columns
    lsl x23, x5, #2 // ldc * 4 columns

3.6 Batch-Reduce GEMM

Based on the previous tasks, we are now implementing a batch-reduce GEMM kernel. The goal is to implement a kernel that computes \(C+=\sum_i A_i B_i\) for M=64, N=48 and K=64 matrices. The kernel should be able to handle batches of matrices. For now we are only implementing the case where the batch size is 16.

Similar to the previous tasks we implemented several versions of this kernel to optimize the performance.

In our first version we simply used our matmul_64_48_64 kernel from our loops task and looped 16 times around that kernel. Key points that we needed to consider were the following:

Setting the batch counter

    // Batch counter
    mov x24, #16

_n_batch:

Jumping to the next matrix A and B in the batch

    // next A matrix
    add x0, x0, x6 // A
    mov x8, x0     // A

    // next B matrix
    add x1, x1, x7 // B
    mov x20, x1    // B

    // restore Pointer for matrix C
    mov x21, x2    // C
    mov x10, x21   // C

    sub x24, x24, #1

    cbnz x24, _n_batch

In our second version we made some optimizations to the kernel. The changes we made were:

Replacing MUL’s with LSL’s

    // multiply strides with float size
    lsl x3, x3, #2 // lda in bytes
    lsl x4, x4, #2 // ldb in bytes
    lsl x5, x5, #2 // ldc in bytes
    lsl x6, x6, #2 // br_stride_a in bytes
    lsl x7, x7, #2 // br_stride_b in bytes

Replacing all LDP’s with LD1’s and STP’s with ST1’s

    // LOAD MATRIX C
    mov x12, x10
    // first column
    ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x12]
    // second column
    add x12, x12, x5
    ld1 {v4.4s, v5.4s, v6.4s, v7.4s}, [x12]
    // third column
    add x12, x12, x5
    ld1 {v8.4s, v9.4s, v10.4s, v11.4s}, [x12]
    // fourth column
    add x12, x12, x5
    ld1 {v12.4s, v13.4s, v14.4s, v15.4s}, [x12]
    // fifth column
    add x12, x12, x5
    ld1 {v16.4s, v17.4s, v18.4s, v19.4s}, [x12]
    // sixth column
    add x12, x12, x5
    ld1 {v20.4s, v21.4s, v22.4s, v23.4s}, [x12]

These optimizations resulted in a performance increase of about 3-4 GFLOPs.

Benchmarking results for the batch-reduce GEMM kernels

Benchmarking V1_Matmul_64_48_64_16 throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   1.22446
Instructions per Second:   1.28453e+11
Estimated GFLOPS:   128.453 GFLOPS/sec
-----------------------------------------------

Benchmarking V2_Matmul_64_48_64_16 throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   1.19946
Instructions per Second:   1.31131e+11
Estimated GFLOPS:   131.131 GFLOPS/sec

3.7 Transposition

In this task we were looking at how to transpose a given 8x8 matrix using assembly. Our approach was to firstly look at the 4x4 case.

3.7.1 Transposition Implementation

We would first load the all 4 columns of matrix A using ldr qX, [x0] to have the whole matrix in our registers. The second step would be to transpose the matrix:

trans_4_4 implementation

    /*
    * Part 2.1:
    * Transpose 4x4 block.
    */
    trn1 v4.4s, v0.4s, v2.4s
    trn1 v5.4s, v1.4s, v3.4s

    trn2 v6.4s, v0.4s, v2.4s
    trn2 v7.4s, v1.4s, v3.4s

    /*
    * Part 2.2:
    * Transpose 4x4 block.
    */
    zip1 v8.4s, v4.4s, v5.4s    // B "column" 0
    zip1 v9.4s, v6.4s, v7.4s    // B "column" 1

    zip2 v10.4s, v4.4s, v5.4s   // B "column" 2
    zip2 v11.4s, v6.4s, v7.4s   // B "column" 3

The idea of trn1 and trn2 were to prepare the elements for each column in such a way, that we could then leverage the new structure using zip1 and zip2.

Now, looking at the 8x8 matrix our initial approach would be very simple:

We would separate our transposition task in 4 subtasks. Each of the 4 4x4 submatrices would be transposed using our trans_4_4 kernel.

The upper left matrix (in the image A) would be transposed and stored at the loading position.
The upper right matrix (in the image B) would be transposed and stored at the “loading” position of the bottom left matrix.
The bottom left matrix (in the image C) would be transposed and stored at the “loading” position of the upper right matrix.

loading, transposition and storing of upper right and bottom left matrix

    /*
    * Part 1.2:
    * Load 4x4 block of A (Left bottom, top right).
    */
    mov x4, x0 // A
    mov x5, x1 // B

    add x4, x4, #16
    add x5, x5, #16

    ldr q0, [x4]
    add x4, x4, x2
    ldr q1, [x4]
    add x4, x4, x2
    ldr q2, [x4]
    add x4, x4, x2
    ldr q3, [x4]

    // Right top
    mov x4, x0 // A
    mov x5, x1 // B

    add x4, x4, #128
    add x5, x5, #128
    
    ldr q12, [x4]
    add x4, x4, x2
    ldr q13, [x4]
    add x4, x4, x2
    ldr q14, [x4]
    add x4, x4, x2
    ldr q15, [x4]

    /*
    * Part 2.1:
    * Transpose 4x4 block.
    */
    // Left Bottom
    trn1 v4.4s, v0.4s, v2.4s
    trn1 v5.4s, v1.4s, v3.4s

    trn2 v6.4s, v0.4s, v2.4s
    trn2 v7.4s, v1.4s, v3.4s

    // Right Top
    trn1 v16.4s, v12.4s, v14.4s
    trn1 v17.4s, v13.4s, v15.4s

    trn2 v18.4s, v12.4s, v14.4s
    trn2 v19.4s, v13.4s, v15.4s

    /*
    * Part 2.2:
    * Transpose 4x4 block.
    */
    // Left Bottom
    zip1 v8.4s, v4.4s, v5.4s    
    zip1 v9.4s, v6.4s, v7.4s    

    zip2 v10.4s, v4.4s, v5.4s   
    zip2 v11.4s, v6.4s, v7.4s   

    // Right Top
    zip1 v20.4s, v16.4s, v17.4s 
    zip1 v21.4s, v18.4s, v19.4s 

    zip2 v22.4s, v16.4s, v17.4s 
    zip2 v23.4s, v18.4s, v19.4s

    /*
    * Part 3:
    * Store 4x4 block of Submatrix A''' into A''.
    */
    // Left Bottom (values from right top)
    mov x5, x1
    add x5, x5, #16

    str q20, [x5]
    add x5, x5, x3
    str q21, [x5]
    add x5, x5, x3
    str q22, [x5]
    add x5, x5, x3
    str q23, [x5]

    // Right top (values from left bottom)
    mov x5, x1
    add x5, x5, #128

    str q8, [x5]
    add x5, x5, x3
    str q9, [x5]
    add x5, x5, x3
    str q10, [x5]
    add x5, x5, x3
    str q11, [x5]

The bottom right matrix (in the image D) would be transposed and stored at the loading position.

To optimize our first version we removed the PCS to all registers that we did not use. We also wrote our kernel in a more compact manner.

Optimized second transposition kernel

    mov x9, #2 // n loop

n_loop:

    mov x6, #2 // m loop

m_loop:
    /*
     * Part 1:
     * Transpose 4x4 block.
     */
    mov x7, x4
    mov x8, x5

    ldr q0, [x7]
    add x7, x7, x2

    ldr q1, [x7]
    add x7, x7, x2

    ldr q2, [x7]
    add x7, x7, x2

    ldr q3, [x7]

    /*
    * Part 2.1:
    * Transpose 4x4 block.
    */
    trn1 v4.4s, v0.4s, v2.4s
    trn1 v5.4s, v1.4s, v3.4s

    trn2 v6.4s, v0.4s, v2.4s
    trn2 v7.4s, v1.4s, v3.4s

    /*
    * Part 2.2:
    * Transpose 4x4 block.
    */
    zip1 v16.4s, v4.4s, v5.4s
    zip1 v17.4s, v6.4s, v7.4s

    zip2 v18.4s, v4.4s, v5.4s
    zip2 v19.4s, v6.4s, v7.4s

    /*
    * Part 3:
    * Store 4x4 block of A into B.
    */
    str q16, [x8]
    add x8, x8, x3

    str q17, [x8]
    add x8, x8, x3

    str q18, [x8]
    add x8, x8, x3

    str q19, [x8]

    // Jump 4 rows in A
    add x4, x4, x25

    // Jump 4 columns in B
    add x5, x5, x27

    sub x6, x6, #1
    cbnz x6, m_loop


    // Restore Pointer for A and B
    mov x4, x0
    mov x5, x1

    add x12, x12, x26
    add x13, x13, x25

    add x4, x4, x12
    add x5, x5, x13

    sub x9, x9, #1
    cbnz x9, n_loop

3.7.2 Performance Measuring

We also wanted to know the performance of our transposition kernel, which in our case would be the loading and storing of elements.

trans_8_8 performance in GiB/s

Benchmarking trans_neon_8_8 performance ...
-----------------------------------------------
Measuring throughput for transposition in GiB/s
Total time (s):   1.26545
Data movements per Second:   8.09199e+10
Estimated GiB/s:   80.9199 GiB/s
-----------------------------------------------

Benchmarking v2_trans_neon_8_8 performance ...
-----------------------------------------------
Measuring throughput for transposition in GiB/s
Total time (s):   0.902975
Data movements per Second:   1.13403e+11
Estimated GiB/s:   113.403 GiB/s
-----------------------------------------------

Our results showed that we for our first version we are transferring about 80 GiB/s. In our optimized version we increase this performance by roughly 33 GiB/s.