3. Neon

3.1 Execution Throughput and Latency

For this task, we were benchmarking the execution throughput and latency for FP32 neon instructions. Specifically, we were looking at:

  1. FMLA (vector) instruction

  2. FMADD (scalar) instruction

3.1.1 Throughput

As a first step we were comparing the throughput of:

  1. FMLA (vector) with arrangement specifier 4S

  2. FMLA (vector) with arrangement specifier 2S

  3. FMADD (scalar), single-precision variant

To compare the throughput, we created three assembly programs, that were executing several of these operations. To get proper results we were looking for any dependencies regarding the source or destination registers of the operations. The calculations that we were ending up with were:

FMLA (vector) with arrangement specifier 4S
31loop:
32    .rept 100
33    fmla  v0.4s,  v8.4s, v16.4s
34    fmla  v1.4s,  v9.4s, v17.4s
35    fmla  v2.4s, v10.4s, v18.4s
36    fmla  v3.4s, v11.4s, v19.4s
37    fmla  v4.4s, v12.4s, v20.4s
38
39    fmla  v5.4s, v13.4s, v21.4s
40    fmla  v6.4s, v14.4s, v22.4s
41    fmla  v7.4s, v15.4s, v23.4s
42    fmla  v8.4s, v16.4s, v24.4s
43    fmla  v9.4s, v17.4s, v25.4s
44
45    fmla v10.4s, v18.4s, v26.4s
46    fmla v11.4s, v19.4s, v27.4s
47    fmla v12.4s, v20.4s, v28.4s
48    fmla v13.4s, v21.4s, v29.4s
49    fmla v14.4s, v22.4s, v30.4s
50
51    fmla v15.4s, v23.4s, v31.4s
52    fmla v16.4s, v24.4s,  v0.4s
53    fmla v17.4s, v25.4s,  v1.4s
54    fmla v18.4s, v26.4s,  v2.4s
55    fmla v19.4s, v27.4s,  v3.4s
56
57    fmla v20.4s, v28.4s,  v4.4s
58    fmla v21.4s, v29.4s,  v5.4s
59    fmla v22.4s, v30.4s,  v6.4s
60    fmla v23.4s, v31.4s,  v7.4s
61    fmla v24.4s,  v0.4s,  v8.4s
62
63    fmla v25.4s,  v1.4s,  v9.4s
64    fmla v26.4s,  v2.4s, v10.4s
65    fmla v27.4s,  v3.4s, v11.4s
66    fmla v28.4s,  v4.4s, v12.4s
67    fmla v29.4s,  v5.4s, v13.4s
68
69    fmla v30.4s,  v6.4s, v14.4s
70    fmla v31.4s,  v7.4s, v15.4s
71    .endr
FMLA (vector) with arrangement specifier 2S
31loop:
32    .rept 100
33    fmla  v0.2s,  v8.2s, v16.2s
34    fmla  v1.2s,  v9.2s, v17.2s
35    fmla  v2.2s, v10.2s, v18.2s
36    fmla  v3.2s, v11.2s, v19.2s
37    fmla  v4.2s, v12.2s, v20.2s
38
39    fmla  v5.2s, v13.2s, v21.2s
40    fmla  v6.2s, v12.2s, v22.2s
41    fmla  v7.2s, v15.2s, v23.2s
42    fmla  v8.2s, v16.2s, v24.2s
43    fmla  v9.2s, v17.2s, v25.2s
44
45    fmla v10.2s, v18.2s, v26.2s
46    fmla v11.2s, v19.2s, v27.2s
47    fmla v12.2s, v20.2s, v28.2s
48    fmla v13.2s, v21.2s, v29.2s
49    fmla v12.2s, v22.2s, v30.2s
50
51    fmla v15.2s, v23.2s, v31.2s
52    fmla v16.2s, v24.2s,  v0.2s
53    fmla v17.2s, v25.2s,  v1.2s
54    fmla v18.2s, v26.2s,  v2.2s
55    fmla v19.2s, v27.2s,  v3.2s
56
57    fmla v20.2s, v28.2s,  v4.2s
58    fmla v21.2s, v29.2s,  v5.2s
59    fmla v22.2s, v30.2s,  v6.2s
60    fmla v23.2s, v31.2s,  v7.2s
61    fmla v24.2s,  v0.2s,  v8.2s
62
63    fmla v25.2s,  v1.2s,  v9.2s
64    fmla v26.2s,  v2.2s, v10.2s
65    fmla v27.2s,  v3.2s, v11.2s
66    fmla v28.2s,  v4.2s, v12.2s
67    fmla v29.2s,  v5.2s, v13.2s
68
69    fmla v30.2s,  v6.2s, v12.2s
70    fmla v31.2s,  v7.2s, v15.2s
71    .endr
FMADD (scalar), single-precision variant
39loop:
40    .rept 100
41    fmadd  s0,  s8, s16, s24
42    fmadd  s1,  s9, s17, s25
43    fmadd  s2, s10, s18, s26
44    fmadd  s3, s11, s19, s27
45    fmadd  s4, s12, s20, s28
46
47    fmadd  s5, s13, s21, s29
48    fmadd  s6, s14, s22, s30
49    fmadd  s7, s15, s23, s31
50    fmadd  s8, s16, s24, s0
51    fmadd  s9, s17, s25, s1
52
53    fmadd s10, s18, s26, s2
54    fmadd s11, s19, s27, s3
55    fmadd s12, s20, s28, s4
56    fmadd s13, s21, s29, s5
57    fmadd s14, s22, s30, s6
58
59    fmadd s15, s23, s31, s7
60    fmadd s16, s24,  s0, s8
61    fmadd s17, s25,  s1, s9
62    fmadd s18, s26,  s2, s10
63    fmadd s19, s27,  s3, s11
64
65    fmadd s20, s28,  s4, s12
66    fmadd s21, s29,  s5, s13
67    fmadd s22, s30,  s6, s14
68    fmadd s23, s31,  s7, s15
69    fmadd s24,  s0,  s8, s16
70
71    fmadd s25,  s1,  s9, s17
72    fmadd s26,  s2, s10, s18
73    fmadd s27,  s3, s11, s19
74    fmadd s28,  s4, s12, s20
75    fmadd s29,  s5, s13, s21
76
77    fmadd s30,  s6, s14, s22
78    fmadd s31,  s7, s15, s23
79    .endr
80
81    subs x0, x0, #1
82    b.gt loop

In order to measure the throughput of these instructions we developed a C++ microbenchmark. For each instruction we firstly performed a warm up, measured the time, counted the operations and then calculated the GFLOPs.

Example benchmark for FMLA (vector) with arrangement specifier 4S
61        // Warmup
62        fmla_4s_instr( 100, g_4s_registers );
63
64        auto l_start_time = std::chrono::high_resolution_clock::now();
65        fmla_4s_instr( n, g_4s_registers );
66        auto l_end_time = std::chrono::high_resolution_clock::now();
67        elapsedTime = std::chrono::duration_cast<std::chrono::microseconds>( l_end_time - l_start_time ).count() / 1e6;
68
69        // per FMLA: 4 Muls, 4 Adds
70        // 32 fmla
71        // rept 100
72        // n: loop iterations
73        totalOps = (2 * 4) * 32 * 100 * n;

For the 2S and the FMADD (scalar) instructions, we simply adjusted the calculations for the operations slightly:

Calculations for FMLA (vector) with arrangement specifier 2S
85        // per FMLA: 2 Muls, 2 Adds
86        // 32 fmla
87        // rept 100
88        // n: loop iterations
89        totalOps = (2 * 2) * 32 * 100 * n;
Calculations for FMLA (vector) with arrangement specifier 2S
101        // per FMADD: 1 Mul, 1 Add
102        // 32 fmadd
103        // rept 100
104        // n: loop iterations
105        totalOps = (2 * 1) * 32 * 100 * n;

For this benchmarking task we obtained the following results:

Throughput results for the three instructions
 1Benchmarking FMLA 4s throughput ...
 2-----------------------------------------------
 3Measuring throughput for FMLA_4sInstruction
 4Total time (s):   1.96706
 5Instructions per Second:   1.30144e+11
 6Estimated GOPS:   130.144 GFLOPs/sec
 7-----------------------------------------------
 8
 9Benchmarking FMLA 2s throughput ...
10-----------------------------------------------
11Measuring throughput for FMLA_2sInstruction
12Total time (s):   2.53647
13Instructions per Second:   5.04638e+10
14Estimated GOPS:   50.4638 GFLOPs/sec
15-----------------------------------------------
16
17Benchmarking FMADD throughput ...
18-----------------------------------------------
19Measuring throughput for FMADDInstruction
20Total time (s):   3.52918
21Instructions per Second:   1.81345e+10
22Estimated GOPS:   18.1345 GFLOPs/sec
23-----------------------------------------------

It can be seen that the FMLA (vector) with arrangement specifier 4S instruction performs approximately 2.5 times as many floating point operations than the FMLA (vector) with arrangement specifier 2S instruction. Further the FMLA (vector) with arrangement specifier 2S instructions performs at approximately 2.5 times more floating point operations than the FMADD (scalar) instruction.

This shows that leveraging data-level parallelism (vector-based) can yield a much higher throughput, than using only scalar operations.

3.1.2 Latency

To measure the execution latency for FMLA (vector) instructions with arrangement specifier 4S, we examined two scenarios:

  1. Each instruction depends on the destination register and one of the source register of the previous instruction

fmla instructions with dependencies on the destination register and one of the source registers
33    fmla v0.4s, v0.4s,  v1.4s
34    fmla v0.4s, v0.4s,  v2.4s
35    fmla v0.4s, v0.4s,  v3.4s
36    fmla v0.4s, v0.4s,  v4.4s
  1. Each instruction depends only on the destination register of the previous instruction

fmla instructions with dependencies on the destination register
33    fmla v0.4s,  v1.4s,  v9.4s
34    fmla v0.4s,  v2.4s, v10.4s
35    fmla v0.4s,  v3.4s, v11.4s
36    fmla v0.4s,  v4.4s, v12.4s

Both files contain 32 fmla instructions each, which are executed 100 times. The results of our benchmark is shown below:

Latency results for the two scenarios
25Benchmarking FMLA 4s source register latency ...
26-----------------------------------------------
27Measuring latency for FMLA_SourceInstruction
28Total time (s):   3.30277
29Instructions per Second:   1.16266e+10
30Estimated GOPS:   11.6266 GFLOPs/sec
31-----------------------------------------------
32
33Benchmarking FMLA 4s destination register latency ...
34-----------------------------------------------
35Measuring latency for FMLA_DestinationInstruction
36Total time (s):   3.30207
37Instructions per Second:   1.16291e+10
38Estimated GOPS:   11.6291 GFLOPs/sec
39-----------------------------------------------

We can see that both scenarios have similar results, which is why we computed the latency only for the first scenario.

We measured \(1.16266 \times 10^{10}\) instructions per second, which means that the latency of the FMLA (vector) instruction with arrangement specifier 4S is approximately \(\frac{1}{1.16266 \times 10^{10}} \approx 8.6 \times 10^{-11}\) seconds. Using a known clock frequency of 4.4 GHz, we computed the latency as \(8.6 \times 10^{-11} \times 4.4 \times 10^9 = 0.3784\) cycles.

3.2 Microkernel

For the second task we were implementing a microkernel to execute a matrix multiplication for matrices with the dimensions:

  1. Matrix A: 16 x 1

  2. Matrix B: 1 x 6

  3. Matrix C: 16 x 6

3.2.1 Neon Microkernel

We developed three different versions of this microkernel in order to optimize its performance.

In the first version we:

  1. Load matrix A (16 x 1)

  2. Load three columns (1 x 1) of matrix B

  3. Load matrix C (16 x 6)

In the second version we:

  1. Load matrix A (16 x 1)

  2. Load one column of matrix B

  3. Load matrix C (16 x 6)

In the third version we:

  1. Load matrix A (16 x 1)

  2. Load one column of matrix B

  3. Load one column of matrix C (16 x 1)

3.2.2 Testing and Benchmarking

To test and compare our versions with one another we:

  1. implemented a microkernel that would give us a visual indication of the results

  2. implemented a test using Catch2 to test the correctness of our implementations

  3. implemented a microbenchmark that would calculate the GFLOPs for each of the three versions

The GFLOPs were calculated using the following formula:

GFLOPs calculations
138                              16 );
139        }
140        auto l_end_time = std::chrono::high_resolution_clock::now();
141        elapsedTime = std::chrono::duration_cast<std::chrono::microseconds>( l_end_time - l_start_time ).count() / 1e6;
142    }

For each version we would perform 50,000 iterations as a warmup to guarantee similar results for each execution of the benchmark. Using this approach we obtained the following results:

GFLOPs calculations
Benchmarking V1 Matmul throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.93595
Instructions per Second:   3.26981e+10
Estimated GFLOPS:   32.6981 GFLOPS/sec
-----------------------------------------------

Benchmarking V2 Matmul throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.90291
Instructions per Second:   3.30702e+10
Estimated GFLOPS:   33.0702 GFLOPS/sec
-----------------------------------------------

Benchmarking V3 Matmul throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s):   2.79132
Instructions per Second:   3.43923e+10
Estimated GFLOPS:   34.3923 GFLOPS/sec
-----------------------------------------------

The GLFOPs results indicate that with every version we obtained slightly better results, resulting in about 1.7 GLOPs in difference comparing our best with our worst approach.

3.3 Loops

In this task, we had to add loops to the matrix multiplication kernel which we wrote in the previous task. The goal was to enable the 16x6x1 kernel to be used for larger matrices.

The first step was to implement a loop in the K dimension, resulting in a 16x6x64 kernel. The loading and storing of matrix C was left unchanged. The relevant code is shown below:

Looping matmul_16_6_1 over K dimension
 67    //  K loop counter
 68    mov x6, #64
 69    // set start of A
 70    mov x7, x0
 71    // set start of B
 72    mov x8, x1
 73    // init row count of B
 74    mov x9, #0
 75
 76_k_loop:
 77    // load column of A
 78    ldp q24, q25, [x7] // 4 + 4 values
 79    ldp q26, q27, [x7, #32] // 4 + 4 values
 80
 81    // B: COLUMN 0
 82    ldr s29, [x8]
 83    fmla v0.4s, v24.4s, v29.s[0]
 84    fmla v1.4s, v25.4s, v29.s[0]
 85    fmla v2.4s, v26.4s, v29.s[0]
 86    fmla v3.4s, v27.4s, v29.s[0]
 87    // B: COLUMN 1
 88    add x8, x8, x4
 89    ldr s29, [x8]
 90    fmla v4.4s, v24.4s, v29.s[0]
 91    fmla v5.4s, v25.4s, v29.s[0]
 92    fmla v6.4s, v26.4s, v29.s[0]
 93    fmla v7.4s, v27.4s, v29.s[0]
 94    // B: COLUMN 2
 95    add x8, x8, x4
 96    ldr s29, [x8]
 97    fmla  v8.4s, v24.4s, v29.s[0]
 98    fmla  v9.4s, v25.4s, v29.s[0]
 99    fmla v10.4s, v26.4s, v29.s[0]
100    fmla v11.4s, v27.4s, v29.s[0]
101    // B: COLUMN 3
102    add x8, x8, x4
103    ldr s29, [x8]
104    fmla v12.4s, v24.4s, v29.s[0]
105    fmla v13.4s, v25.4s, v29.s[0]
106    fmla v14.4s, v26.4s, v29.s[0]
107    fmla v15.4s, v27.4s, v29.s[0]
108    // B: COLUMN 4
109    add x8, x8, x4
110    ldr s29, [x8]
111    fmla v16.4s, v24.4s, v29.s[0]
112    fmla v17.4s, v25.4s, v29.s[0]
113    fmla v18.4s, v26.4s, v29.s[0]
114    fmla v19.4s, v27.4s, v29.s[0]
115    // B: COLUMN 5
116    add x8, x8, x4
117    ldr s29, [x8]
118    fmla v20.4s, v24.4s, v29.s[0]
119    fmla v21.4s, v25.4s, v29.s[0]
120    fmla v22.4s, v26.4s, v29.s[0]
121    fmla v23.4s, v27.4s, v29.s[0]
122
123    // move to next column of A
124    add x7, x7, x3
125    // move to next row of B
126    mov x8, x1
127    add x9, x9, #4
128    add x8, x8, x9
129
130    // decrement loop counter
131    sub x6, x6, #1
132    // check if loop counter is zero
133    cbnz x6, _k_loop

The matmul_16_6_1 kernel mostly stayed the same, except that for each K loop, we now need to adjust the pointers to the input matrices A and B. At the end of each loop, we move the pointers to A to the next column by adding the given stride. In B, we need to move the pointer to the next row. Therefore, we jump by 4 Bytes (since we are using 32-bit floats) from the starting address of B. To keep jumping to the next row in each loop, we accumulate the offset of 4 Bytes in the register x9.

The second step was to implement a loop in the M dimension, resulting in a 64x6x64 kernel. To keep the code examples shorter, we exclude the K loop from the code snippets. The relevant code is shown below:

First part of looping over M dimension
45    // save base matrix pointers
46    mov x7, x0 // A
47    mov x8, x1 // B
48    mov x9, x2 // C
49
50    // M loop counter
51    mov x11, #4 // 64/16 = 4 blocks
52
53_m_loop:
54// ------------------------------------------
55// START matmul_16_6_64
56// ------------------------------------------
57
58    // LOAD MATRIX C
59    mov x12, x9
60    // first column
61    ldp q0, q1, [x12]
62    ldp q2, q3, [x12, #32]
63    // second column
64    add x12, x12, x5
65    ldp q4, q5, [x12]
66    ldp q6, q7, [x12, #32]
67    // third column
68    add x12, x12, x5
69    ldp q8, q9, [x12]
70    ldp q10, q11, [x12, #32]
71    // fourth column
72    add x12, x12, x5
73    ldp q12, q13, [x12]
74    ldp q14, q15, [x12, #32]
75    // fifth column
76    add x12, x12, x5
77    ldp q16, q17, [x12]
78    ldp q18, q19, [x12, #32]
79    // sixth column
80    add x12, x12, x5
81    ldp q20, q21, [x12]
82    ldp q22, q23, [x12, #32]
83
84    // K loop counter
85    mov x14, #64
86    // set start of A
87    mov x15, x7
88    // set start of B
89    mov x16, x8
90    // init row count of B
91    mov x17, #0
92_k_loop:
Second part of looping over M dimension
153    // check if loop counter is zero
154    cbnz x14, _k_loop
155
156    // STORE MATRIX C
157    mov x12, x9
158    // first column
159    stp q0, q1, [x12]
160    stp q2, q3, [x12, #32]
161    // second column
162    add x12, x12, x5
163    stp q4, q5, [x12]
164    stp q6, q7, [x12, #32]
165    // third column
166    add x12, x12, x5
167    stp q8, q9, [x12]
168    stp q10, q11, [x12, #32]
169    // fourth column
170    add x12, x12, x5
171    stp q12, q13, [x12]
172    stp q14, q15, [x12, #32]
173    // fifth column
174    add x12, x12, x5
175    stp q16, q17, [x12]
176    stp q18, q19, [x12, #32]
177    // sixth column
178    add x12, x12, x5
179    stp q20, q21, [x12]
180    stp q22, q23, [x12, #32]
181
182// ------------------------------------------
183// END matmul_16_6_64
184// ------------------------------------------
185
186    // increase A and C pointers for next block
187    add x7, x7, #16*4
188    add x9, x9, #16*4
189
190    // decrement m loop counter
191    sub x11, x11, #1
192    // check if loop counter is zero
193    cbnz x11, _m_loop

The M loop needs only 4 iterations, since we are extending the kernel from 16 to 64 in the M dimension by dividing the M dimension into 4 blocks of 16 elements. At the end of the M loop, we move the pointers of A and C to the next block. We jump by 16 elements in the M dimension, which means adding 16*4 Bytes to the pointer of A and C.

The last step was to implement a loop in the N dimension, resulting in a 64x48x64 kernel. The relevant code is shown below:

First part of looping over N dimension
49    // set base matrix pointers
50    mov x20, x1 // B
51    mov x21, x2 // C
52
53    // N loop counter
54    mov x19, #8 // 48/6 = 8 blocks
55
56_n_loop:
57
58    // M loop counter
59    mov x11, #4 // 64/16 = 4 blocks
60
61    // set matrix pointers
62    mov x7, x0 // A
63    mov x8, x20 // B
64    mov x9, x21 // C
65
66_m_loop:
Second part of looping over N dimension
205    // decrement m loop counter
206    sub x11, x11, #1
207    // check if loop counter is zero
208    cbnz x11, _m_loop
209// END M LOOP
210
211    // increase B and C pointers for next block
212    // (jump 6 columns) 6*x4, 6*x5
213    add x20, x20, x22
214    add x21, x21, x23
215
216    // decrement n loop counter
217    sub x19, x19, #1
218    // check if loop counter is zero
219    cbnz x19, _n_loop
220// END N LOOP

Since we are extending the kernel from 6 to 48 in the N dimension, we need to divide the N dimension into 8 blocks of 6 elements. This means that the loop will have 8 iterations. For each N loop, it is important to first reset the pointer of A to the original address. After each iteration, we need to move the pointers of B and C to the next block. To do this, we jump by elements in the N dimension, that is specifically 6 columns of B and C. We do this by adding 6 times the stride of B and C to the pointers.

3.3.1 Testing and Benchmarking

For all three kernels we have written unit tests. To execute the tests, one first needs to compile the code by invoking make from within the src/submissions/03_neon/03_loops directory. This will create an executable that can be run with ./build/test.

We also calculated the GFLOPs for each of these matrix multiplications. To calculate them we followed the simple formula:

\[M \cdot N \cdot K \cdot \text{Ops Per FMLA}\]

The results that we obtained were:

GFLOPs calculations for MatMuls
 1Benchmarking Matmul_16_6_64 throughput ...
 2-----------------------------------------------
 3Measuring throughput for Instruction
 4Total time (s):   1.89248
 5Instructions per Second:   1.29861e+11
 6Estimated GFLOPS:   129.861 GFLOPS/sec
 7-----------------------------------------------
 8
 9Benchmarking Matmul_64_6_64 Matmul throughput ...
10-----------------------------------------------
11Measuring throughput for Instruction
12Total time (s):   1.84635
13Instructions per Second:   1.33106e+11
14Estimated GFLOPS:   133.106 GFLOPS/sec
15-----------------------------------------------
16
17Benchmarking Matmul_64_48_64 Matmul throughput ...
18-----------------------------------------------
19Measuring throughput for Instruction
20Total time (s):   1.49743
21Instructions per Second:   1.31297e+11
22Estimated GFLOPS:   131.297 GFLOPS/sec
23-----------------------------------------------

Our results indicate that the number of GFLOPs is very consistent, even when scaling the size of our matrix.

3.4 SIMD Lanes

For this task we were supposed to create two kernels, that should be able to function, even if we don’t have a multiple of 4 for the M dimension. We created several versions for both:

  1. the M=14, N=6 and K=64, and

  2. the M=15, N=6 and K=64

3.4.1 Matmul_14_6_64

For the case M=14 we considered four different versions:

Our first approach was to use two loops. The first loop was used to calculate a (12 x 64) block of matrix C. That means, we would load 12 column elements of matrix A. The second loop was then used to calculate the remaining (2 x 64) block of matrix C.

Second loop for the (2 x 64) matrix calculation
157_k2_loop:
158    // load column of A
159    ldr d24, [x7]   // 2 values
160
161    // B: COLUMN 0
162    ldr s29, [x8]
163    fmla v3.2s, v24.2s, v29.s[0]
164
165    // B: COLUMN 1
166    add x8, x8, x4
167    ldr s29, [x8]
168    fmla v7.2s, v24.2s, v29.s[0]
169
170    // B: COLUMN 2
171    add x8, x8, x4
172    ldr s29, [x8]
173    fmla v11.2s, v24.2s, v29.s[0]
174
175    // B: COLUMN 3
176    add x8, x8, x4
177    ldr s29, [x8]
178    fmla v15.2s, v24.2s, v29.s[0]
179
180    // B: COLUMN 4
181    add x8, x8, x4
182    ldr s29, [x8]
183    fmla v19.2s, v24.2s, v29.s[0]
184    
185    // B: COLUMN 5
186    add x8, x8, x4
187    ldr s29, [x8]
188    fmla v23.2s, v24.2s, v29.s[0]
189
190    // move to next column of A
191    add x7, x7, x3
192    // move to next row of B
193    mov x8, x1
194    add x9, x9, #4
195    add x8, x8, x9
196
197    // decrement loop counter
198    sub x6, x6, #1
199    // check if loop counter is zero
200    cbnz x6, _k2_loop

The second approach was to use a single loop. We would load the whole matrix C, and matrix A column-wise using one ldp qXX, qXX, [x7], one ldr qXX, [x7, #32] and one ldr dXX, [x7, #48] instruction.

Calculate matrix C with a single loop using four loads
 86_k1_loop:
 87    // load column of A
 88    ldp q24, q25, [x7] // 4 + 4 values
 89    ldr q26, [x7, #32] // 4 values
 90    ldr d27, [x7, #48] // 2 values - possible memory leak 
 91
 92    // B: COLUMN 0
 93    ldr s29, [x8]
 94    fmla v0.4s, v24.4s, v29.s[0]
 95    fmla v1.4s, v25.4s, v29.s[0]
 96    fmla v2.4s, v26.4s, v29.s[0]
 97    fmla v3.2s, v27.2s, v29.s[0]
 98    // B: COLUMN 1
 99    add x8, x8, x4
100    ldr s29, [x8]
101    fmla v4.4s, v24.4s, v29.s[0]
102    fmla v5.4s, v25.4s, v29.s[0]
103    fmla v6.4s, v26.4s, v29.s[0]
104    fmla v7.2s, v27.2s, v29.s[0]
105    // B: COLUMN 2
106    add x8, x8, x4
107    ldr s29, [x8]
108    fmla  v8.4s, v24.4s, v29.s[0]
109    fmla  v9.4s, v25.4s, v29.s[0]
110    fmla v10.4s, v26.4s, v29.s[0]
111    fmla v11.2s, v27.2s, v29.s[0]
112    // B: COLUMN 3
113    add x8, x8, x4
114    ldr s29, [x8]
115    fmla v12.4s, v24.4s, v29.s[0]
116    fmla v13.4s, v25.4s, v29.s[0]
117    fmla v14.4s, v26.4s, v29.s[0]
118    fmla v15.2s, v27.2s, v29.s[0]
119    // B: COLUMN 4
120    add x8, x8, x4
121    ldr s29, [x8]
122    fmla v16.4s, v24.4s, v29.s[0]
123    fmla v17.4s, v25.4s, v29.s[0]
124    fmla v18.4s, v26.4s, v29.s[0]
125    fmla v19.2s, v27.2s, v29.s[0]
126    // B: COLUMN 5
127    add x8, x8, x4
128    ldr s29, [x8]
129    fmla v20.4s, v24.4s, v29.s[0]
130    fmla v21.4s, v25.4s, v29.s[0]
131    fmla v22.4s, v26.4s, v29.s[0]
132    fmla v23.2s, v27.2s, v29.s[0]
133
134    // move to next column of A
135    add x7, x7, x3
136    // move to next row of B
137    mov x8, x1
138    add x9, x9, #4
139    add x8, x8, x9
140
141    // decrement loop counter
142    sub x6, x6, #1
143    // check if loop counter is zero
144    cbnz x6, _k1_loop

Our third approach was again to use a single loop. But this time we would load matrix A column-wise using two ldp qXX, qXX, [x7] instructions and then set the last two elements to zero using mov v27.s[2], wzr and mov v27.s[3], wzr.

Calculate matrix C with a single loop using ldp loads
 88_k1_loop:
 89    // load column of A
 90    ldp q24, q25, [x7] // 4 + 4 values
 91    ldp q26, q27, [x7, #32] // 4 + 4 values - possible memory leak
 92    mov v27.s[2], wzr
 93    mov v27.s[3], wzr
 94
 95    // B: COLUMN 0
 96    ldr s29, [x8]
 97    
 98    fmla v0.4s, v24.4s, v29.s[0]
 99    fmla v1.4s, v25.4s, v29.s[0]
100    fmla v2.4s, v26.4s, v29.s[0]
101    fmla v3.4s, v27.4s, v29.s[0]
102    // B: COLUMN 1
103    add x8, x8, x4
104    ldr s29, [x8]
105    fmla v4.4s, v24.4s, v29.s[0]
106    fmla v5.4s, v25.4s, v29.s[0]
107    fmla v6.4s, v26.4s, v29.s[0]
108    fmla v7.4s, v27.4s, v29.s[0]
109    // B: COLUMN 2
110    add x8, x8, x4
111    ldr s29, [x8]
112    fmla  v8.4s, v24.4s, v29.s[0]
113    fmla  v9.4s, v25.4s, v29.s[0]
114    fmla v10.4s, v26.4s, v29.s[0]
115    fmla v11.4s, v27.4s, v29.s[0]
116    // B: COLUMN 3
117    add x8, x8, x4
118    ldr s29, [x8]
119    fmla v12.4s, v24.4s, v29.s[0]
120    fmla v13.4s, v25.4s, v29.s[0]
121    fmla v14.4s, v26.4s, v29.s[0]
122    fmla v15.4s, v27.4s, v29.s[0]
123    // B: COLUMN 4
124    add x8, x8, x4
125    ldr s29, [x8]
126    fmla v16.4s, v24.4s, v29.s[0]
127    fmla v17.4s, v25.4s, v29.s[0]
128    fmla v18.4s, v26.4s, v29.s[0]
129    fmla v19.4s, v27.4s, v29.s[0]
130    // B: COLUMN 5
131    add x8, x8, x4
132    ldr s29, [x8]
133    fmla v20.4s, v24.4s, v29.s[0]
134    fmla v21.4s, v25.4s, v29.s[0]
135    fmla v22.4s, v26.4s, v29.s[0]
136    fmla v23.4s, v27.4s, v29.s[0]
137
138    // move to next column of A
139    add x7, x7, x3
140    // move to next row of B
141    mov x8, x1
142    add x9, x9, #4
143    add x8, x8, x9
144
145    // decrement loop counter
146    sub x6, x6, #1
147    // check if loop counter is zero
148    cbnz x6, _k1_loop

In our fourth approach we simply copied the second version and changed our loads for matrix A and C. We used ld1 instead of ldp.

Calculate matrix C with a single loop and ld1 loads
 73_k1_loop:
 74    // load column of A
 75    ld1 {v24.4s-v27.4s}, [x7]
 76    ldr d27, [x7, #48] // 2 values
 77
 78    // B: COLUMN 0
 79    ldr s29, [x8]
 80    fmla v0.4s, v24.4s, v29.s[0]
 81    fmla v1.4s, v25.4s, v29.s[0]
 82    fmla v2.4s, v26.4s, v29.s[0]
 83    fmla v3.2s, v27.2s, v29.s[0]
 84    // B: COLUMN 1
 85    add x8, x8, x4
 86    ldr s29, [x8]
 87    fmla v4.4s, v24.4s, v29.s[0]
 88    fmla v5.4s, v25.4s, v29.s[0]
 89    fmla v6.4s, v26.4s, v29.s[0]
 90    fmla v7.2s, v27.2s, v29.s[0]
 91    // B: COLUMN 2
 92    add x8, x8, x4
 93    ldr s29, [x8]
 94    fmla  v8.4s, v24.4s, v29.s[0]
 95    fmla  v9.4s, v25.4s, v29.s[0]
 96    fmla v10.4s, v26.4s, v29.s[0]
 97    fmla v11.2s, v27.2s, v29.s[0]
 98    // B: COLUMN 3
 99    add x8, x8, x4
100    ldr s29, [x8]
101    fmla v12.4s, v24.4s, v29.s[0]
102    fmla v13.4s, v25.4s, v29.s[0]
103    fmla v14.4s, v26.4s, v29.s[0]
104    fmla v15.2s, v27.2s, v29.s[0]
105    // B: COLUMN 4
106    add x8, x8, x4
107    ldr s29, [x8]
108    fmla v16.4s, v24.4s, v29.s[0]
109    fmla v17.4s, v25.4s, v29.s[0]
110    fmla v18.4s, v26.4s, v29.s[0]
111    fmla v19.2s, v27.2s, v29.s[0]
112    // B: COLUMN 5
113    add x8, x8, x4
114    ldr s29, [x8]
115    fmla v20.4s, v24.4s, v29.s[0]
116    fmla v21.4s, v25.4s, v29.s[0]
117    fmla v22.4s, v26.4s, v29.s[0]
118    fmla v23.2s, v27.2s, v29.s[0]
119
120    // move to next column of A
121    add x7, x7, x3
122    // move to next row of B
123    mov x8, x1
124    add x9, x9, #4
125    add x8, x8, x9
126
127    // decrement loop counter
128    sub x6, x6, #1
129    // check if loop counter is zero
130    cbnz x6, _k1_loop

When benchmarking our approaches we obtained the following results:

Benchmarking results for matmul_14_6_64 approaches
 1Benchmarking V1_Matmul_14_6_64 throughput ...
 2-----------------------------------------------
 3Measuring throughput for Instruction
 4Total time (s):   2.54314
 5Instructions per Second:   8.45568e+10
 6Estimated GFLOPS:   84.5568 GFLOPS/sec
 7-----------------------------------------------
 8
 9Benchmarking V2_Matmul_14_6_64 throughput ...
10-----------------------------------------------
11Measuring throughput for Instruction
12Total time (s):   1.88243
13Instructions per Second:   1.14235e+11
14Estimated GFLOPS:   114.235 GFLOPS/sec
15-----------------------------------------------
16
17Benchmarking V3_Matmul_14_6_64 throughput ...
18-----------------------------------------------
19Measuring throughput for Instruction
20Total time (s):   2.08544
21Instructions per Second:   1.03115e+11
22Estimated GFLOPS:   103.115 GFLOPS/sec
23-----------------------------------------------
24
25Benchmarking V4_Matmul_14_6_64 throughput ...
26-----------------------------------------------
27Measuring throughput for Instruction
28Total time (s):   1.89444
29Instructions per Second:   1.13511e+11
30Estimated GFLOPS:   113.511 GFLOPS/sec
31-----------------------------------------------

The results indicate that the version with three different loads performed best, with an increase of about 10 GFLOPs. The switch from ldp to ld1 however, didn’t show any significant changes in the number of GFLOPs.

3.4.2 Matmul_15_6_64

For the case M=15 we considered three different versions:

For our first approach we again considered two loops. Again, the first loop was used to calculate a (12 x 64) block of matrix C. The second loop was then used to calculate the remaining (3 x 64) block of matrix C.

Second loop for the (3 x 64) matrix calculation
211_k2_loop:
212    // load column of A
213    ldr d24, [x7]   // 2 values
214    ldr s25, [x7, #8]
215
216    // B: COLUMN 0
217    ldr s29, [x8]
218    fmla v0.2s, v24.2s, v29.s[0]
219    fmadd s1, s25, s29, s1
220    // B: COLUMN 1
221    add x8, x8, x4
222    ldr s29, [x8]
223    fmla v2.2s, v24.2s, v29.s[0]
224    fmadd s3, s25, s29, s3
225    // B: COLUMN 2
226    add x8, x8, x4
227    ldr s29, [x8]
228    fmla v4.2s, v24.2s, v29.s[0]
229    fmadd s5, s25, s29, s5
230    // B: COLUMN 3
231    add x8, x8, x4
232    ldr s29, [x8]
233    fmla v6.2s, v24.2s, v29.s[0]
234    fmadd s7, s25, s29, s7
235    // B: COLUMN 4
236    add x8, x8, x4
237    ldr s29, [x8]
238    fmla v8.2s, v24.2s, v29.s[0]
239    fmadd s9, s25, s29, s9
240    // B: COLUMN 5
241    add x8, x8, x4
242    ldr s29, [x8]
243    fmla v10.2s, v24.2s, v29.s[0]
244    fmadd s11, s25, s29, s11
245
246    // move to next column of A
247    add x7, x7, x3
248    // move to next row of B
249    mov x8, x1
250    add x9, x9, #4
251    add x8, x8, x9
252
253    // decrement loop counter
254    sub x6, x6, #1
255    // check if loop counter is zero
256    cbnz x6, _k2_loop

In the second approach we use one loop. We load matrix A column-wise using two two ldp qXX, qXX, [x7] instructions and then set the last element to zero using mov v27.s[3], wzr.

Calculate matrix C with a single loop using ldp loads
 99_k1_loop:
100    // load column of A
101    ldp q24, q25, [x7] // 4 + 4 values
102    ldp q26, q27, [x7, #32] // 4 + 4 values - possible memory leak
103    mov v27.s[3], wzr
104
105    // B: COLUMN 0
106    ldr s29, [x8]
107    
108    fmla v0.4s, v24.4s, v29.s[0]
109    fmla v1.4s, v25.4s, v29.s[0]
110    fmla v2.4s, v26.4s, v29.s[0]
111    fmla v3.4s, v27.4s, v29.s[0]
112    // B: COLUMN 1
113    add x8, x8, x4
114    ldr s29, [x8]
115    fmla v4.4s, v24.4s, v29.s[0]
116    fmla v5.4s, v25.4s, v29.s[0]
117    fmla v6.4s, v26.4s, v29.s[0]
118    fmla v7.4s, v27.4s, v29.s[0]
119    // B: COLUMN 2
120    add x8, x8, x4
121    ldr s29, [x8]
122    fmla  v8.4s, v24.4s, v29.s[0]
123    fmla  v9.4s, v25.4s, v29.s[0]
124    fmla v10.4s, v26.4s, v29.s[0]
125    fmla v11.4s, v27.4s, v29.s[0]
126    // B: COLUMN 3
127    add x8, x8, x4
128    ldr s29, [x8]
129    fmla v12.4s, v24.4s, v29.s[0]
130    fmla v13.4s, v25.4s, v29.s[0]
131    fmla v14.4s, v26.4s, v29.s[0]
132    fmla v15.4s, v27.4s, v29.s[0]
133    // B: COLUMN 4
134    add x8, x8, x4
135    ldr s29, [x8]
136    fmla v16.4s, v24.4s, v29.s[0]
137    fmla v17.4s, v25.4s, v29.s[0]
138    fmla v18.4s, v26.4s, v29.s[0]
139    fmla v19.4s, v27.4s, v29.s[0]
140    // B: COLUMN 5
141    add x8, x8, x4
142    ldr s29, [x8]
143    fmla v20.4s, v24.4s, v29.s[0]
144    fmla v21.4s, v25.4s, v29.s[0]
145    fmla v22.4s, v26.4s, v29.s[0]
146    fmla v23.4s, v27.4s, v29.s[0]
147
148    // move to next column of A
149    add x7, x7, x3
150    // move to next row of B
151    mov x8, x1
152    add x9, x9, #4
153    add x8, x8, x9
154
155    // decrement loop counter
156    sub x6, x6, #1
157    // check if loop counter is zero
158    cbnz x6, _k1_loop

In the third approach we again changed the load instructions from ldp to ld1.

Calculate matrix C with a single loop using ld1 loads
 81_k1_loop:
 82    // load column of A
 83    ld1 {v24.4s-v27.4s}, [x7]
 84    mov v27.s[3], wzr
 85
 86    // B: COLUMN 0
 87    ldr s29, [x8]
 88    
 89    fmla v0.4s, v24.4s, v29.s[0]
 90    fmla v1.4s, v25.4s, v29.s[0]
 91    fmla v2.4s, v26.4s, v29.s[0]
 92    fmla v3.4s, v27.4s, v29.s[0]
 93    // B: COLUMN 1
 94    add x8, x8, x4
 95    ldr s29, [x8]
 96    fmla v4.4s, v24.4s, v29.s[0]
 97    fmla v5.4s, v25.4s, v29.s[0]
 98    fmla v6.4s, v26.4s, v29.s[0]
 99    fmla v7.4s, v27.4s, v29.s[0]
100    // B: COLUMN 2
101    add x8, x8, x4
102    ldr s29, [x8]
103    fmla  v8.4s, v24.4s, v29.s[0]
104    fmla  v9.4s, v25.4s, v29.s[0]
105    fmla v10.4s, v26.4s, v29.s[0]
106    fmla v11.4s, v27.4s, v29.s[0]
107    // B: COLUMN 3
108    add x8, x8, x4
109    ldr s29, [x8]
110    fmla v12.4s, v24.4s, v29.s[0]
111    fmla v13.4s, v25.4s, v29.s[0]
112    fmla v14.4s, v26.4s, v29.s[0]
113    fmla v15.4s, v27.4s, v29.s[0]
114    // B: COLUMN 4
115    add x8, x8, x4
116    ldr s29, [x8]
117    fmla v16.4s, v24.4s, v29.s[0]
118    fmla v17.4s, v25.4s, v29.s[0]
119    fmla v18.4s, v26.4s, v29.s[0]
120    fmla v19.4s, v27.4s, v29.s[0]
121    // B: COLUMN 5
122    add x8, x8, x4
123    ldr s29, [x8]
124    fmla v20.4s, v24.4s, v29.s[0]
125    fmla v21.4s, v25.4s, v29.s[0]
126    fmla v22.4s, v26.4s, v29.s[0]
127    fmla v23.4s, v27.4s, v29.s[0]
128
129    // move to next column of A
130    add x7, x7, x3
131    // move to next row of B
132    mov x8, x1
133    add x9, x9, #4
134    add x8, x8, x9
135
136    // decrement loop counter
137    sub x6, x6, #1
138    // check if loop counter is zero
139    cbnz x6, _k1_loop

Again, we performed some benchmarks:

Benchmarking results for matmul_15_6_64 approaches
41Benchmarking V1_Matmul_15_6_64 throughput ...
42-----------------------------------------------
43Measuring throughput for Instruction
44Total time (s):   2.80305
45Instructions per Second:   8.21963e+10
46Estimated GFLOPS:   82.1963 GFLOPS/sec
47-----------------------------------------------
48
49Benchmarking V2_Matmul_15_6_64 throughput ...
50-----------------------------------------------
51Measuring throughput for Instruction
52Total time (s):   2.18569
53Instructions per Second:   1.05413e+11
54Estimated GFLOPS:   105.413 GFLOPS/sec
55-----------------------------------------------
56
57Benchmarking V3_Matmul_15_6_64 throughput ...
58-----------------------------------------------
59Measuring throughput for Instruction
60Total time (s):   2.18658
61Instructions per Second:   1.0537e+11
62Estimated GFLOPS:   105.37 GFLOPS/sec
63-----------------------------------------------

Similar to the benchmarks for the matmul_14_6_64 the approach with the single loop performs much better than the other approach. This time, we even gain about 23 GFLOPs with this approach.

3.4.3 Generic Approach

Simply as a proof of concept we also implemented a generic approach for the matmul_14_6_64 and matmul_15_6_64 kernels. This kernel works for any M > 0. The idea is to write specific kernels for M = 1, 2, ..., 8. We then divide M by 8 (shift right by 3) and use that to loop the kernel for M = 8. Basically we split the M dimension into blocks of 8 elements and compute the result using a matmul_8_6_64 kernel. If there is a remainder, it is >=1 and <=7, which we handle with specific kernels. The selection of the specific kernels is done using a jump table.

We also benchmarked the performance of this generic kernel:

Benchmarking results for matmul_M_6_64 (M = 14) approach
33Benchmarking Matmul_M_6_64 M=14 throughput ...
34-----------------------------------------------
35Measuring throughput for Instruction
36Total time (s):   2.4066
37Instructions per Second:   8.93543e+10
38Estimated GFLOPS:   89.3543 GFLOPS/sec
39-----------------------------------------------
Benchmarking results for matmul_M_6_64 (M = 15) approach
65Benchmarking Matmul_M_6_64 M=15 throughput ...
66-----------------------------------------------
67Measuring throughput for Instruction
68Total time (s):   3.68161
69Instructions per Second:   6.25814e+10
70Estimated GFLOPS:   62.5814 GFLOPS/sec
71-----------------------------------------------

Compared to our other approaches our obtained GFLOPs are slightly worse, losing about 30 GFLOPs to our best approach for the matmul_14_6_64 and about 40 GFLOPs to our best approach for the matmul_15_6_64.

3.5 Accumulator Block Shapes

In this task we were supposed to implement a microkernel that computes C+=AB for M=64, N=64 and K=64. Recalling our matmul_64_48_64 kernel, we only need to change the N dimension to 64. This kernel uses the matmul_16_6_64 internally, which we changed to matmul_16_4_64. Changing N from 6 to 4 allows us to divide the N dimension into 16 blocks of 4 elements. N = 8 was not suitable, as we ran into issues with the number of available SIMD lanes. We do not think it is necessary to show the code for this kernel, as it is very similar to the matmul_64_48_64 kernel. The only difference is that we removed the logic for 2 of the 6 columns and increased the loop counter constant.

Benchmarking this kernel we obtained the following results:

Benchmarking results for matmul_64_64_64 approaches
 1Benchmarking V1 Matmul throughput ...
 2-----------------------------------------------
 3Measuring throughput for Instruction
 4Total time (s):   1.27442
 5Instructions per Second:   1.23418e+11
 6Estimated GFLOPS:   123.418 GFLOPS/sec
 7-----------------------------------------------
 8
 9Benchmarking V2 Matmul throughput ...
10-----------------------------------------------
11Measuring throughput for Instruction
12Total time (s):   1.246
13Instructions per Second:   1.26233e+11
14Estimated GFLOPS:   126.233 GFLOPS/sec
15-----------------------------------------------

V1 is the first version which we obtained by converting our best performing matmul_64_48_64 kernel. Trying to squeeze out more performance, we made some minor changes to the computations of the strides (as shown below). We also removed loads and stores of callee-saved registers that were not used. This resulted in a performance increase of about 2-3 GFLOPs in V2 across multiple runs.

Naive stride calculations
39    // multiply strides with float size
40    mov x6, #4
41    mul x3, x3, x6 // lda
42    mul x4, x4, x6 // ldb
43    mul x5, x5, x6 // ldc
44
45    mov x6, #4
46    mul x22, x4, x6 // ldb * 4 columns
47    mul x23, x5, x6 // ldc * 4 columns
Optimized stride calculations
37    // multiply strides with float size
38    // *4 = lsl #2
39    lsl x3, x3, #2 // lda
40    lsl x4, x4, #2 // ldb
41    lsl x5, x5, #2 // ldc
42
43    lsl x22, x4, #2 // ldb * 4 columns
44    lsl x23, x5, #2 // ldc * 4 columns

3.6 Batch-Reduce GEMM

Based on the previous tasks, we are now implementing a batch-reduce GEMM kernel. The goal is to implement a kernel that computes \(C+=\sum_i A_i B_i\) for M=64, N=48 and K=64 matrices. The kernel should be able to handle batches of matrices. For now we are only implementing the case where the batch size is 16.

Similar to the previous tasks we implemented several versions of this kernel to optimize the performance.

In our first version we simply used our matmul_64_48_64 kernel from our loops task and looped 16 times around that kernel. Key points that we needed to consider were the following:

Setting the batch counter
56    // Batch counter
57    mov x24, #16
58
59_n_batch:
Jumping to the next matrix A and B in the batch
230    // next A matrix
231    add x0, x0, x6 // A
232    mov x8, x0     // A
233
234    // next B matrix
235    add x1, x1, x7 // B
236    mov x20, x1    // B
237
238    // restore Pointer for matrix C
239    mov x21, x2    // C
240    mov x10, x21   // C
241
242    sub x24, x24, #1
243
244    cbnz x24, _n_batch

In our second version we made some optimizations to the kernel. The changes we made were:

Replacing MUL’s with LSL’s
39    // multiply strides with float size
40    lsl x3, x3, #2 // lda in bytes
41    lsl x4, x4, #2 // ldb in bytes
42    lsl x5, x5, #2 // ldc in bytes
43    lsl x6, x6, #2 // br_stride_a in bytes
44    lsl x7, x7, #2 // br_stride_b in bytes
Replacing all LDP’s with LD1’s and STP’s with ST1’s
78    // LOAD MATRIX C
79    mov x12, x10
80    // first column
81    ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x12]
82    // second column
83    add x12, x12, x5
84    ld1 {v4.4s, v5.4s, v6.4s, v7.4s}, [x12]
85    // third column
86    add x12, x12, x5
87    ld1 {v8.4s, v9.4s, v10.4s, v11.4s}, [x12]
88    // fourth column
89    add x12, x12, x5
90    ld1 {v12.4s, v13.4s, v14.4s, v15.4s}, [x12]
91    // fifth column
92    add x12, x12, x5
93    ld1 {v16.4s, v17.4s, v18.4s, v19.4s}, [x12]
94    // sixth column
95    add x12, x12, x5
96    ld1 {v20.4s, v21.4s, v22.4s, v23.4s}, [x12]

These optimizations resulted in a performance increase of about 3-4 GFLOPs.

Benchmarking results for the batch-reduce GEMM kernels
 1Benchmarking V1_Matmul_64_48_64_16 throughput ...
 2-----------------------------------------------
 3Measuring throughput for Instruction
 4Total time (s):   1.22446
 5Instructions per Second:   1.28453e+11
 6Estimated GFLOPS:   128.453 GFLOPS/sec
 7-----------------------------------------------
 8
 9Benchmarking V2_Matmul_64_48_64_16 throughput ...
10-----------------------------------------------
11Measuring throughput for Instruction
12Total time (s):   1.19946
13Instructions per Second:   1.31131e+11
14Estimated GFLOPS:   131.131 GFLOPS/sec

3.7 Transposition

In this task we were looking at how to transpose a given 8x8 matrix using assembly. Our approach was to firstly look at the 4x4 case.

3.7.1 Transposition Implementation

We would first load the all 4 columns of matrix A using ldr qX, [x0] to have the whole matrix in our registers. The second step would be to transpose the matrix:

trans_4_4 implementation
56    /*
57    * Part 2.1:
58    * Transpose 4x4 block.
59    */
60    trn1 v4.4s, v0.4s, v2.4s
61    trn1 v5.4s, v1.4s, v3.4s
62
63    trn2 v6.4s, v0.4s, v2.4s
64    trn2 v7.4s, v1.4s, v3.4s
65
66    /*
67    * Part 2.2:
68    * Transpose 4x4 block.
69    */
70    zip1 v8.4s, v4.4s, v5.4s    // B "column" 0
71    zip1 v9.4s, v6.4s, v7.4s    // B "column" 1
72
73    zip2 v10.4s, v4.4s, v5.4s   // B "column" 2
74    zip2 v11.4s, v6.4s, v7.4s   // B "column" 3

The idea of trn1 and trn2 were to prepare the elements for each column in such a way, that we could then leverage the new structure using zip1 and zip2.

Now, looking at the 8x8 matrix our initial approach would be very simple:

Matrix divided in 4 quadrants

We would separate our transposition task in 4 subtasks. Each of the 4 4x4 submatrices would be transposed using our trans_4_4 kernel.

  1. The upper left matrix (in the image A) would be transposed and stored at the loading position.

  2. The upper right matrix (in the image B) would be transposed and stored at the “loading” position of the bottom left matrix.

  3. The bottom left matrix (in the image C) would be transposed and stored at the “loading” position of the upper right matrix.

loading, transposition and storing of upper right and bottom left matrix
 89    /*
 90    * Part 1.2:
 91    * Load 4x4 block of A (Left bottom, top right).
 92    */
 93    mov x4, x0 // A
 94    mov x5, x1 // B
 95
 96    add x4, x4, #16
 97    add x5, x5, #16
 98
 99    ldr q0, [x4]
100    add x4, x4, x2
101    ldr q1, [x4]
102    add x4, x4, x2
103    ldr q2, [x4]
104    add x4, x4, x2
105    ldr q3, [x4]
106
107    // Right top
108    mov x4, x0 // A
109    mov x5, x1 // B
110
111    add x4, x4, #128
112    add x5, x5, #128
113    
114    ldr q12, [x4]
115    add x4, x4, x2
116    ldr q13, [x4]
117    add x4, x4, x2
118    ldr q14, [x4]
119    add x4, x4, x2
120    ldr q15, [x4]
121
122    /*
123    * Part 2.1:
124    * Transpose 4x4 block.
125    */
126    // Left Bottom
127    trn1 v4.4s, v0.4s, v2.4s
128    trn1 v5.4s, v1.4s, v3.4s
129
130    trn2 v6.4s, v0.4s, v2.4s
131    trn2 v7.4s, v1.4s, v3.4s
132
133    // Right Top
134    trn1 v16.4s, v12.4s, v14.4s
135    trn1 v17.4s, v13.4s, v15.4s
136
137    trn2 v18.4s, v12.4s, v14.4s
138    trn2 v19.4s, v13.4s, v15.4s
139
140    /*
141    * Part 2.2:
142    * Transpose 4x4 block.
143    */
144    // Left Bottom
145    zip1 v8.4s, v4.4s, v5.4s    
146    zip1 v9.4s, v6.4s, v7.4s    
147
148    zip2 v10.4s, v4.4s, v5.4s   
149    zip2 v11.4s, v6.4s, v7.4s   
150
151    // Right Top
152    zip1 v20.4s, v16.4s, v17.4s 
153    zip1 v21.4s, v18.4s, v19.4s 
154
155    zip2 v22.4s, v16.4s, v17.4s 
156    zip2 v23.4s, v18.4s, v19.4s
157
158    /*
159    * Part 3:
160    * Store 4x4 block of Submatrix A''' into A''.
161    */
162    // Left Bottom (values from right top)
163    mov x5, x1
164    add x5, x5, #16
165
166    str q20, [x5]
167    add x5, x5, x3
168    str q21, [x5]
169    add x5, x5, x3
170    str q22, [x5]
171    add x5, x5, x3
172    str q23, [x5]
173
174    // Right top (values from left bottom)
175    mov x5, x1
176    add x5, x5, #128
177
178    str q8, [x5]
179    add x5, x5, x3
180    str q9, [x5]
181    add x5, x5, x3
182    str q10, [x5]
183    add x5, x5, x3
184    str q11, [x5]
  1. The bottom right matrix (in the image D) would be transposed and stored at the loading position.

To optimize our first version we removed the PCS to all registers that we did not use. We also wrote our kernel in a more compact manner.

Optimized second transposition kernel
 40    mov x9, #2 // n loop
 41
 42n_loop:
 43
 44    mov x6, #2 // m loop
 45
 46m_loop:
 47    /*
 48     * Part 1:
 49     * Transpose 4x4 block.
 50     */
 51    mov x7, x4
 52    mov x8, x5
 53
 54    ldr q0, [x7]
 55    add x7, x7, x2
 56
 57    ldr q1, [x7]
 58    add x7, x7, x2
 59
 60    ldr q2, [x7]
 61    add x7, x7, x2
 62
 63    ldr q3, [x7]
 64
 65    /*
 66    * Part 2.1:
 67    * Transpose 4x4 block.
 68    */
 69    trn1 v4.4s, v0.4s, v2.4s
 70    trn1 v5.4s, v1.4s, v3.4s
 71
 72    trn2 v6.4s, v0.4s, v2.4s
 73    trn2 v7.4s, v1.4s, v3.4s
 74
 75    /*
 76    * Part 2.2:
 77    * Transpose 4x4 block.
 78    */
 79    zip1 v16.4s, v4.4s, v5.4s
 80    zip1 v17.4s, v6.4s, v7.4s
 81
 82    zip2 v18.4s, v4.4s, v5.4s
 83    zip2 v19.4s, v6.4s, v7.4s
 84
 85    /*
 86    * Part 3:
 87    * Store 4x4 block of A into B.
 88    */
 89    str q16, [x8]
 90    add x8, x8, x3
 91
 92    str q17, [x8]
 93    add x8, x8, x3
 94
 95    str q18, [x8]
 96    add x8, x8, x3
 97
 98    str q19, [x8]
 99
100    // Jump 4 rows in A
101    add x4, x4, x25
102
103    // Jump 4 columns in B
104    add x5, x5, x27
105
106    sub x6, x6, #1
107    cbnz x6, m_loop
108
109
110    // Restore Pointer for A and B
111    mov x4, x0
112    mov x5, x1
113
114    add x12, x12, x26
115    add x13, x13, x25
116
117    add x4, x4, x12
118    add x5, x5, x13
119
120    sub x9, x9, #1
121    cbnz x9, n_loop

3.7.2 Performance Measuring

We also wanted to know the performance of our transposition kernel, which in our case would be the loading and storing of elements.

trans_8_8 performance in GiB/s
 1Benchmarking trans_neon_8_8 performance ...
 2-----------------------------------------------
 3Measuring throughput for transposition in GiB/s
 4Total time (s):   1.26545
 5Data movements per Second:   8.09199e+10
 6Estimated GiB/s:   80.9199 GiB/s
 7-----------------------------------------------
 8
 9Benchmarking v2_trans_neon_8_8 performance ...
10-----------------------------------------------
11Measuring throughput for transposition in GiB/s
12Total time (s):   0.902975
13Data movements per Second:   1.13403e+11
14Estimated GiB/s:   113.403 GiB/s
15-----------------------------------------------

Our results showed that we for our first version we are transferring about 80 GiB/s. In our optimized version we increase this performance by roughly 33 GiB/s.