3. Neon
3.1 Execution Throughput and Latency
For this task, we were benchmarking the execution throughput and latency for FP32 neon instructions. Specifically, we were looking at:
FMLA (vector) instruction
FMADD (scalar) instruction
3.1.1 Throughput
As a first step we were comparing the throughput of:
FMLA (vector) with arrangement specifier
4S
FMLA (vector) with arrangement specifier
2S
FMADD (scalar), single-precision variant
To compare the throughput, we created three assembly programs, that were executing several of these operations. To get proper results we were looking for any dependencies regarding the source or destination registers of the operations. The calculations that we were ending up with were:
4S
31loop:
32 .rept 100
33 fmla v0.4s, v8.4s, v16.4s
34 fmla v1.4s, v9.4s, v17.4s
35 fmla v2.4s, v10.4s, v18.4s
36 fmla v3.4s, v11.4s, v19.4s
37 fmla v4.4s, v12.4s, v20.4s
38
39 fmla v5.4s, v13.4s, v21.4s
40 fmla v6.4s, v14.4s, v22.4s
41 fmla v7.4s, v15.4s, v23.4s
42 fmla v8.4s, v16.4s, v24.4s
43 fmla v9.4s, v17.4s, v25.4s
44
45 fmla v10.4s, v18.4s, v26.4s
46 fmla v11.4s, v19.4s, v27.4s
47 fmla v12.4s, v20.4s, v28.4s
48 fmla v13.4s, v21.4s, v29.4s
49 fmla v14.4s, v22.4s, v30.4s
50
51 fmla v15.4s, v23.4s, v31.4s
52 fmla v16.4s, v24.4s, v0.4s
53 fmla v17.4s, v25.4s, v1.4s
54 fmla v18.4s, v26.4s, v2.4s
55 fmla v19.4s, v27.4s, v3.4s
56
57 fmla v20.4s, v28.4s, v4.4s
58 fmla v21.4s, v29.4s, v5.4s
59 fmla v22.4s, v30.4s, v6.4s
60 fmla v23.4s, v31.4s, v7.4s
61 fmla v24.4s, v0.4s, v8.4s
62
63 fmla v25.4s, v1.4s, v9.4s
64 fmla v26.4s, v2.4s, v10.4s
65 fmla v27.4s, v3.4s, v11.4s
66 fmla v28.4s, v4.4s, v12.4s
67 fmla v29.4s, v5.4s, v13.4s
68
69 fmla v30.4s, v6.4s, v14.4s
70 fmla v31.4s, v7.4s, v15.4s
71 .endr
2S
31loop:
32 .rept 100
33 fmla v0.2s, v8.2s, v16.2s
34 fmla v1.2s, v9.2s, v17.2s
35 fmla v2.2s, v10.2s, v18.2s
36 fmla v3.2s, v11.2s, v19.2s
37 fmla v4.2s, v12.2s, v20.2s
38
39 fmla v5.2s, v13.2s, v21.2s
40 fmla v6.2s, v12.2s, v22.2s
41 fmla v7.2s, v15.2s, v23.2s
42 fmla v8.2s, v16.2s, v24.2s
43 fmla v9.2s, v17.2s, v25.2s
44
45 fmla v10.2s, v18.2s, v26.2s
46 fmla v11.2s, v19.2s, v27.2s
47 fmla v12.2s, v20.2s, v28.2s
48 fmla v13.2s, v21.2s, v29.2s
49 fmla v12.2s, v22.2s, v30.2s
50
51 fmla v15.2s, v23.2s, v31.2s
52 fmla v16.2s, v24.2s, v0.2s
53 fmla v17.2s, v25.2s, v1.2s
54 fmla v18.2s, v26.2s, v2.2s
55 fmla v19.2s, v27.2s, v3.2s
56
57 fmla v20.2s, v28.2s, v4.2s
58 fmla v21.2s, v29.2s, v5.2s
59 fmla v22.2s, v30.2s, v6.2s
60 fmla v23.2s, v31.2s, v7.2s
61 fmla v24.2s, v0.2s, v8.2s
62
63 fmla v25.2s, v1.2s, v9.2s
64 fmla v26.2s, v2.2s, v10.2s
65 fmla v27.2s, v3.2s, v11.2s
66 fmla v28.2s, v4.2s, v12.2s
67 fmla v29.2s, v5.2s, v13.2s
68
69 fmla v30.2s, v6.2s, v12.2s
70 fmla v31.2s, v7.2s, v15.2s
71 .endr
39loop:
40 .rept 100
41 fmadd s0, s8, s16, s24
42 fmadd s1, s9, s17, s25
43 fmadd s2, s10, s18, s26
44 fmadd s3, s11, s19, s27
45 fmadd s4, s12, s20, s28
46
47 fmadd s5, s13, s21, s29
48 fmadd s6, s14, s22, s30
49 fmadd s7, s15, s23, s31
50 fmadd s8, s16, s24, s0
51 fmadd s9, s17, s25, s1
52
53 fmadd s10, s18, s26, s2
54 fmadd s11, s19, s27, s3
55 fmadd s12, s20, s28, s4
56 fmadd s13, s21, s29, s5
57 fmadd s14, s22, s30, s6
58
59 fmadd s15, s23, s31, s7
60 fmadd s16, s24, s0, s8
61 fmadd s17, s25, s1, s9
62 fmadd s18, s26, s2, s10
63 fmadd s19, s27, s3, s11
64
65 fmadd s20, s28, s4, s12
66 fmadd s21, s29, s5, s13
67 fmadd s22, s30, s6, s14
68 fmadd s23, s31, s7, s15
69 fmadd s24, s0, s8, s16
70
71 fmadd s25, s1, s9, s17
72 fmadd s26, s2, s10, s18
73 fmadd s27, s3, s11, s19
74 fmadd s28, s4, s12, s20
75 fmadd s29, s5, s13, s21
76
77 fmadd s30, s6, s14, s22
78 fmadd s31, s7, s15, s23
79 .endr
80
81 subs x0, x0, #1
82 b.gt loop
In order to measure the throughput of these instructions we developed a C++ microbenchmark. For each instruction we firstly performed a warm up, measured the time, counted the operations and then calculated the GFLOPs.
4S
61 // Warmup
62 fmla_4s_instr( 100, g_4s_registers );
63
64 auto l_start_time = std::chrono::high_resolution_clock::now();
65 fmla_4s_instr( n, g_4s_registers );
66 auto l_end_time = std::chrono::high_resolution_clock::now();
67 elapsedTime = std::chrono::duration_cast<std::chrono::microseconds>( l_end_time - l_start_time ).count() / 1e6;
68
69 // per FMLA: 4 Muls, 4 Adds
70 // 32 fmla
71 // rept 100
72 // n: loop iterations
73 totalOps = (2 * 4) * 32 * 100 * n;
For the 2S
and the FMADD (scalar) instructions, we simply adjusted the calculations for the operations slightly:
2S
85 // per FMLA: 2 Muls, 2 Adds
86 // 32 fmla
87 // rept 100
88 // n: loop iterations
89 totalOps = (2 * 2) * 32 * 100 * n;
2S
101 // per FMADD: 1 Mul, 1 Add
102 // 32 fmadd
103 // rept 100
104 // n: loop iterations
105 totalOps = (2 * 1) * 32 * 100 * n;
For this benchmarking task we obtained the following results:
1Benchmarking FMLA 4s throughput ...
2-----------------------------------------------
3Measuring throughput for FMLA_4sInstruction
4Total time (s): 1.96706
5Instructions per Second: 1.30144e+11
6Estimated GOPS: 130.144 GFLOPs/sec
7-----------------------------------------------
8
9Benchmarking FMLA 2s throughput ...
10-----------------------------------------------
11Measuring throughput for FMLA_2sInstruction
12Total time (s): 2.53647
13Instructions per Second: 5.04638e+10
14Estimated GOPS: 50.4638 GFLOPs/sec
15-----------------------------------------------
16
17Benchmarking FMADD throughput ...
18-----------------------------------------------
19Measuring throughput for FMADDInstruction
20Total time (s): 3.52918
21Instructions per Second: 1.81345e+10
22Estimated GOPS: 18.1345 GFLOPs/sec
23-----------------------------------------------
It can be seen that the FMLA (vector) with arrangement specifier 4S
instruction performs approximately
2.5 times as many floating point operations than the FMLA (vector) with arrangement specifier 2S
instruction.
Further the FMLA (vector) with arrangement specifier 2S
instructions performs at approximately 2.5 times more
floating point operations than the FMADD (scalar) instruction.
This shows that leveraging data-level parallelism (vector-based) can yield a much higher throughput, than using only scalar operations.
3.1.2 Latency
To measure the execution latency for FMLA (vector) instructions with arrangement specifier 4S
, we examined two scenarios:
Each instruction depends on the destination register and one of the source register of the previous instruction
33 fmla v0.4s, v0.4s, v1.4s
34 fmla v0.4s, v0.4s, v2.4s
35 fmla v0.4s, v0.4s, v3.4s
36 fmla v0.4s, v0.4s, v4.4s
Each instruction depends only on the destination register of the previous instruction
33 fmla v0.4s, v1.4s, v9.4s
34 fmla v0.4s, v2.4s, v10.4s
35 fmla v0.4s, v3.4s, v11.4s
36 fmla v0.4s, v4.4s, v12.4s
Both files contain 32 fmla instructions each, which are executed 100 times. The results of our benchmark is shown below:
25Benchmarking FMLA 4s source register latency ...
26-----------------------------------------------
27Measuring latency for FMLA_SourceInstruction
28Total time (s): 3.30277
29Instructions per Second: 1.16266e+10
30Estimated GOPS: 11.6266 GFLOPs/sec
31-----------------------------------------------
32
33Benchmarking FMLA 4s destination register latency ...
34-----------------------------------------------
35Measuring latency for FMLA_DestinationInstruction
36Total time (s): 3.30207
37Instructions per Second: 1.16291e+10
38Estimated GOPS: 11.6291 GFLOPs/sec
39-----------------------------------------------
We can see that both scenarios have similar results, which is why we computed the latency only for the first scenario.
We measured \(1.16266 \times 10^{10}\) instructions per second, which means that the latency of the FMLA (vector) instruction with arrangement specifier 4S
is approximately \(\frac{1}{1.16266 \times 10^{10}} \approx 8.6 \times 10^{-11}\) seconds. Using a known clock frequency of 4.4 GHz, we computed the latency as \(8.6 \times 10^{-11} \times 4.4 \times 10^9 = 0.3784\) cycles.
3.2 Microkernel
For the second task we were implementing a microkernel to execute a matrix multiplication for matrices with the dimensions:
Matrix A: 16 x 1
Matrix B: 1 x 6
Matrix C: 16 x 6
3.2.1 Neon Microkernel
We developed three different versions of this microkernel in order to optimize its performance.
In the first version we:
Load matrix A (16 x 1)
Load three columns (1 x 1) of matrix B
Load matrix C (16 x 6)
In the second version we:
Load matrix A (16 x 1)
Load one column of matrix B
Load matrix C (16 x 6)
In the third version we:
Load matrix A (16 x 1)
Load one column of matrix B
Load one column of matrix C (16 x 1)
3.2.2 Testing and Benchmarking
To test and compare our versions with one another we:
implemented a microkernel that would give us a visual indication of the results
implemented a test using Catch2 to test the correctness of our implementations
implemented a microbenchmark that would calculate the GFLOPs for each of the three versions
The GFLOPs were calculated using the following formula:
138 16 );
139 }
140 auto l_end_time = std::chrono::high_resolution_clock::now();
141 elapsedTime = std::chrono::duration_cast<std::chrono::microseconds>( l_end_time - l_start_time ).count() / 1e6;
142 }
For each version we would perform 50,000
iterations as a warmup to guarantee similar results for each execution of the benchmark.
Using this approach we obtained the following results:
Benchmarking V1 Matmul throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s): 2.93595
Instructions per Second: 3.26981e+10
Estimated GFLOPS: 32.6981 GFLOPS/sec
-----------------------------------------------
Benchmarking V2 Matmul throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s): 2.90291
Instructions per Second: 3.30702e+10
Estimated GFLOPS: 33.0702 GFLOPS/sec
-----------------------------------------------
Benchmarking V3 Matmul throughput ...
-----------------------------------------------
Measuring throughput for Instruction
Total time (s): 2.79132
Instructions per Second: 3.43923e+10
Estimated GFLOPS: 34.3923 GFLOPS/sec
-----------------------------------------------
The GLFOPs results indicate that with every version we obtained slightly better results, resulting in
about 1.7
GLOPs in difference comparing our best with our worst approach.
3.3 Loops
In this task, we had to add loops to the matrix multiplication kernel which we wrote in the previous task. The goal was to enable the 16x6x1 kernel to be used for larger matrices.
The first step was to implement a loop in the K dimension, resulting in a 16x6x64 kernel. The loading and storing of matrix C was left unchanged. The relevant code is shown below:
67 // K loop counter
68 mov x6, #64
69 // set start of A
70 mov x7, x0
71 // set start of B
72 mov x8, x1
73 // init row count of B
74 mov x9, #0
75
76_k_loop:
77 // load column of A
78 ldp q24, q25, [x7] // 4 + 4 values
79 ldp q26, q27, [x7, #32] // 4 + 4 values
80
81 // B: COLUMN 0
82 ldr s29, [x8]
83 fmla v0.4s, v24.4s, v29.s[0]
84 fmla v1.4s, v25.4s, v29.s[0]
85 fmla v2.4s, v26.4s, v29.s[0]
86 fmla v3.4s, v27.4s, v29.s[0]
87 // B: COLUMN 1
88 add x8, x8, x4
89 ldr s29, [x8]
90 fmla v4.4s, v24.4s, v29.s[0]
91 fmla v5.4s, v25.4s, v29.s[0]
92 fmla v6.4s, v26.4s, v29.s[0]
93 fmla v7.4s, v27.4s, v29.s[0]
94 // B: COLUMN 2
95 add x8, x8, x4
96 ldr s29, [x8]
97 fmla v8.4s, v24.4s, v29.s[0]
98 fmla v9.4s, v25.4s, v29.s[0]
99 fmla v10.4s, v26.4s, v29.s[0]
100 fmla v11.4s, v27.4s, v29.s[0]
101 // B: COLUMN 3
102 add x8, x8, x4
103 ldr s29, [x8]
104 fmla v12.4s, v24.4s, v29.s[0]
105 fmla v13.4s, v25.4s, v29.s[0]
106 fmla v14.4s, v26.4s, v29.s[0]
107 fmla v15.4s, v27.4s, v29.s[0]
108 // B: COLUMN 4
109 add x8, x8, x4
110 ldr s29, [x8]
111 fmla v16.4s, v24.4s, v29.s[0]
112 fmla v17.4s, v25.4s, v29.s[0]
113 fmla v18.4s, v26.4s, v29.s[0]
114 fmla v19.4s, v27.4s, v29.s[0]
115 // B: COLUMN 5
116 add x8, x8, x4
117 ldr s29, [x8]
118 fmla v20.4s, v24.4s, v29.s[0]
119 fmla v21.4s, v25.4s, v29.s[0]
120 fmla v22.4s, v26.4s, v29.s[0]
121 fmla v23.4s, v27.4s, v29.s[0]
122
123 // move to next column of A
124 add x7, x7, x3
125 // move to next row of B
126 mov x8, x1
127 add x9, x9, #4
128 add x8, x8, x9
129
130 // decrement loop counter
131 sub x6, x6, #1
132 // check if loop counter is zero
133 cbnz x6, _k_loop
The matmul_16_6_1
kernel mostly stayed the same, except that for each K loop, we now need to adjust the pointers to the input matrices A and B. At the end of each loop, we move the pointers to A to the next column by adding the given stride. In B, we need to move the pointer to the next row. Therefore, we jump by 4 Bytes (since we are using 32-bit floats) from the starting address of B. To keep jumping to the next row in each loop, we accumulate the offset of 4 Bytes in the register x9
.
The second step was to implement a loop in the M dimension, resulting in a 64x6x64 kernel. To keep the code examples shorter, we exclude the K loop from the code snippets. The relevant code is shown below:
45 // save base matrix pointers
46 mov x7, x0 // A
47 mov x8, x1 // B
48 mov x9, x2 // C
49
50 // M loop counter
51 mov x11, #4 // 64/16 = 4 blocks
52
53_m_loop:
54// ------------------------------------------
55// START matmul_16_6_64
56// ------------------------------------------
57
58 // LOAD MATRIX C
59 mov x12, x9
60 // first column
61 ldp q0, q1, [x12]
62 ldp q2, q3, [x12, #32]
63 // second column
64 add x12, x12, x5
65 ldp q4, q5, [x12]
66 ldp q6, q7, [x12, #32]
67 // third column
68 add x12, x12, x5
69 ldp q8, q9, [x12]
70 ldp q10, q11, [x12, #32]
71 // fourth column
72 add x12, x12, x5
73 ldp q12, q13, [x12]
74 ldp q14, q15, [x12, #32]
75 // fifth column
76 add x12, x12, x5
77 ldp q16, q17, [x12]
78 ldp q18, q19, [x12, #32]
79 // sixth column
80 add x12, x12, x5
81 ldp q20, q21, [x12]
82 ldp q22, q23, [x12, #32]
83
84 // K loop counter
85 mov x14, #64
86 // set start of A
87 mov x15, x7
88 // set start of B
89 mov x16, x8
90 // init row count of B
91 mov x17, #0
92_k_loop:
153 // check if loop counter is zero
154 cbnz x14, _k_loop
155
156 // STORE MATRIX C
157 mov x12, x9
158 // first column
159 stp q0, q1, [x12]
160 stp q2, q3, [x12, #32]
161 // second column
162 add x12, x12, x5
163 stp q4, q5, [x12]
164 stp q6, q7, [x12, #32]
165 // third column
166 add x12, x12, x5
167 stp q8, q9, [x12]
168 stp q10, q11, [x12, #32]
169 // fourth column
170 add x12, x12, x5
171 stp q12, q13, [x12]
172 stp q14, q15, [x12, #32]
173 // fifth column
174 add x12, x12, x5
175 stp q16, q17, [x12]
176 stp q18, q19, [x12, #32]
177 // sixth column
178 add x12, x12, x5
179 stp q20, q21, [x12]
180 stp q22, q23, [x12, #32]
181
182// ------------------------------------------
183// END matmul_16_6_64
184// ------------------------------------------
185
186 // increase A and C pointers for next block
187 add x7, x7, #16*4
188 add x9, x9, #16*4
189
190 // decrement m loop counter
191 sub x11, x11, #1
192 // check if loop counter is zero
193 cbnz x11, _m_loop
The M loop needs only 4 iterations, since we are extending the kernel from 16 to 64 in the M dimension by dividing the M dimension into 4 blocks of 16 elements. At the end of the M loop, we move the pointers of A and C to the next block. We jump by 16 elements in the M dimension, which means adding 16*4 Bytes to the pointer of A and C.
The last step was to implement a loop in the N dimension, resulting in a 64x48x64 kernel. The relevant code is shown below:
49 // set base matrix pointers
50 mov x20, x1 // B
51 mov x21, x2 // C
52
53 // N loop counter
54 mov x19, #8 // 48/6 = 8 blocks
55
56_n_loop:
57
58 // M loop counter
59 mov x11, #4 // 64/16 = 4 blocks
60
61 // set matrix pointers
62 mov x7, x0 // A
63 mov x8, x20 // B
64 mov x9, x21 // C
65
66_m_loop:
205 // decrement m loop counter
206 sub x11, x11, #1
207 // check if loop counter is zero
208 cbnz x11, _m_loop
209// END M LOOP
210
211 // increase B and C pointers for next block
212 // (jump 6 columns) 6*x4, 6*x5
213 add x20, x20, x22
214 add x21, x21, x23
215
216 // decrement n loop counter
217 sub x19, x19, #1
218 // check if loop counter is zero
219 cbnz x19, _n_loop
220// END N LOOP
Since we are extending the kernel from 6 to 48 in the N dimension, we need to divide the N dimension into 8 blocks of 6 elements. This means that the loop will have 8 iterations. For each N loop, it is important to first reset the pointer of A to the original address. After each iteration, we need to move the pointers of B and C to the next block. To do this, we jump by elements in the N dimension, that is specifically 6 columns of B and C. We do this by adding 6 times the stride of B and C to the pointers.
3.3.1 Testing and Benchmarking
For all three kernels we have written unit tests. To execute the tests, one first needs to compile the code by invoking make
from within the src/submissions/03_neon/03_loops
directory. This will create an executable that can be run with ./build/test
.
We also calculated the GFLOPs for each of these matrix multiplications. To calculate them we followed the simple formula:
The results that we obtained were:
1Benchmarking Matmul_16_6_64 throughput ...
2-----------------------------------------------
3Measuring throughput for Instruction
4Total time (s): 1.89248
5Instructions per Second: 1.29861e+11
6Estimated GFLOPS: 129.861 GFLOPS/sec
7-----------------------------------------------
8
9Benchmarking Matmul_64_6_64 Matmul throughput ...
10-----------------------------------------------
11Measuring throughput for Instruction
12Total time (s): 1.84635
13Instructions per Second: 1.33106e+11
14Estimated GFLOPS: 133.106 GFLOPS/sec
15-----------------------------------------------
16
17Benchmarking Matmul_64_48_64 Matmul throughput ...
18-----------------------------------------------
19Measuring throughput for Instruction
20Total time (s): 1.49743
21Instructions per Second: 1.31297e+11
22Estimated GFLOPS: 131.297 GFLOPS/sec
23-----------------------------------------------
Our results indicate that the number of GFLOPs is very consistent, even when scaling the size of our matrix.
3.4 SIMD Lanes
For this task we were supposed to create two kernels, that should be able
to function, even if we don’t have a multiple of 4 for the M
dimension.
We created several versions for both:
the
M=14
,N=6
andK=64
, andthe
M=15
,N=6
andK=64
3.4.1 Matmul_14_6_64
For the case M=14
we considered four different versions:
Our first approach was to use two loops. The first loop was used to calculate a (12 x 64) block of matrix C. That means, we would load 12 column elements of matrix A. The second loop was then used to calculate the remaining (2 x 64) block of matrix C.
157_k2_loop:
158 // load column of A
159 ldr d24, [x7] // 2 values
160
161 // B: COLUMN 0
162 ldr s29, [x8]
163 fmla v3.2s, v24.2s, v29.s[0]
164
165 // B: COLUMN 1
166 add x8, x8, x4
167 ldr s29, [x8]
168 fmla v7.2s, v24.2s, v29.s[0]
169
170 // B: COLUMN 2
171 add x8, x8, x4
172 ldr s29, [x8]
173 fmla v11.2s, v24.2s, v29.s[0]
174
175 // B: COLUMN 3
176 add x8, x8, x4
177 ldr s29, [x8]
178 fmla v15.2s, v24.2s, v29.s[0]
179
180 // B: COLUMN 4
181 add x8, x8, x4
182 ldr s29, [x8]
183 fmla v19.2s, v24.2s, v29.s[0]
184
185 // B: COLUMN 5
186 add x8, x8, x4
187 ldr s29, [x8]
188 fmla v23.2s, v24.2s, v29.s[0]
189
190 // move to next column of A
191 add x7, x7, x3
192 // move to next row of B
193 mov x8, x1
194 add x9, x9, #4
195 add x8, x8, x9
196
197 // decrement loop counter
198 sub x6, x6, #1
199 // check if loop counter is zero
200 cbnz x6, _k2_loop
The second approach was to use a single loop. We would load the whole matrix C, and matrix A
column-wise using one ldp qXX, qXX, [x7]
, one ldr qXX, [x7, #32]
and one ldr dXX, [x7, #48]
instruction.
86_k1_loop:
87 // load column of A
88 ldp q24, q25, [x7] // 4 + 4 values
89 ldr q26, [x7, #32] // 4 values
90 ldr d27, [x7, #48] // 2 values - possible memory leak
91
92 // B: COLUMN 0
93 ldr s29, [x8]
94 fmla v0.4s, v24.4s, v29.s[0]
95 fmla v1.4s, v25.4s, v29.s[0]
96 fmla v2.4s, v26.4s, v29.s[0]
97 fmla v3.2s, v27.2s, v29.s[0]
98 // B: COLUMN 1
99 add x8, x8, x4
100 ldr s29, [x8]
101 fmla v4.4s, v24.4s, v29.s[0]
102 fmla v5.4s, v25.4s, v29.s[0]
103 fmla v6.4s, v26.4s, v29.s[0]
104 fmla v7.2s, v27.2s, v29.s[0]
105 // B: COLUMN 2
106 add x8, x8, x4
107 ldr s29, [x8]
108 fmla v8.4s, v24.4s, v29.s[0]
109 fmla v9.4s, v25.4s, v29.s[0]
110 fmla v10.4s, v26.4s, v29.s[0]
111 fmla v11.2s, v27.2s, v29.s[0]
112 // B: COLUMN 3
113 add x8, x8, x4
114 ldr s29, [x8]
115 fmla v12.4s, v24.4s, v29.s[0]
116 fmla v13.4s, v25.4s, v29.s[0]
117 fmla v14.4s, v26.4s, v29.s[0]
118 fmla v15.2s, v27.2s, v29.s[0]
119 // B: COLUMN 4
120 add x8, x8, x4
121 ldr s29, [x8]
122 fmla v16.4s, v24.4s, v29.s[0]
123 fmla v17.4s, v25.4s, v29.s[0]
124 fmla v18.4s, v26.4s, v29.s[0]
125 fmla v19.2s, v27.2s, v29.s[0]
126 // B: COLUMN 5
127 add x8, x8, x4
128 ldr s29, [x8]
129 fmla v20.4s, v24.4s, v29.s[0]
130 fmla v21.4s, v25.4s, v29.s[0]
131 fmla v22.4s, v26.4s, v29.s[0]
132 fmla v23.2s, v27.2s, v29.s[0]
133
134 // move to next column of A
135 add x7, x7, x3
136 // move to next row of B
137 mov x8, x1
138 add x9, x9, #4
139 add x8, x8, x9
140
141 // decrement loop counter
142 sub x6, x6, #1
143 // check if loop counter is zero
144 cbnz x6, _k1_loop
Our third approach was again to use a single loop. But this time we would load matrix A
column-wise using two ldp qXX, qXX, [x7]
instructions and then set the last two elements
to zero using mov v27.s[2], wzr
and mov v27.s[3], wzr
.
ldp
loads 88_k1_loop:
89 // load column of A
90 ldp q24, q25, [x7] // 4 + 4 values
91 ldp q26, q27, [x7, #32] // 4 + 4 values - possible memory leak
92 mov v27.s[2], wzr
93 mov v27.s[3], wzr
94
95 // B: COLUMN 0
96 ldr s29, [x8]
97
98 fmla v0.4s, v24.4s, v29.s[0]
99 fmla v1.4s, v25.4s, v29.s[0]
100 fmla v2.4s, v26.4s, v29.s[0]
101 fmla v3.4s, v27.4s, v29.s[0]
102 // B: COLUMN 1
103 add x8, x8, x4
104 ldr s29, [x8]
105 fmla v4.4s, v24.4s, v29.s[0]
106 fmla v5.4s, v25.4s, v29.s[0]
107 fmla v6.4s, v26.4s, v29.s[0]
108 fmla v7.4s, v27.4s, v29.s[0]
109 // B: COLUMN 2
110 add x8, x8, x4
111 ldr s29, [x8]
112 fmla v8.4s, v24.4s, v29.s[0]
113 fmla v9.4s, v25.4s, v29.s[0]
114 fmla v10.4s, v26.4s, v29.s[0]
115 fmla v11.4s, v27.4s, v29.s[0]
116 // B: COLUMN 3
117 add x8, x8, x4
118 ldr s29, [x8]
119 fmla v12.4s, v24.4s, v29.s[0]
120 fmla v13.4s, v25.4s, v29.s[0]
121 fmla v14.4s, v26.4s, v29.s[0]
122 fmla v15.4s, v27.4s, v29.s[0]
123 // B: COLUMN 4
124 add x8, x8, x4
125 ldr s29, [x8]
126 fmla v16.4s, v24.4s, v29.s[0]
127 fmla v17.4s, v25.4s, v29.s[0]
128 fmla v18.4s, v26.4s, v29.s[0]
129 fmla v19.4s, v27.4s, v29.s[0]
130 // B: COLUMN 5
131 add x8, x8, x4
132 ldr s29, [x8]
133 fmla v20.4s, v24.4s, v29.s[0]
134 fmla v21.4s, v25.4s, v29.s[0]
135 fmla v22.4s, v26.4s, v29.s[0]
136 fmla v23.4s, v27.4s, v29.s[0]
137
138 // move to next column of A
139 add x7, x7, x3
140 // move to next row of B
141 mov x8, x1
142 add x9, x9, #4
143 add x8, x8, x9
144
145 // decrement loop counter
146 sub x6, x6, #1
147 // check if loop counter is zero
148 cbnz x6, _k1_loop
In our fourth approach we simply copied the second version and changed
our loads for matrix A and C. We used ld1
instead of ldp
.
ld1
loads 73_k1_loop:
74 // load column of A
75 ld1 {v24.4s-v27.4s}, [x7]
76 ldr d27, [x7, #48] // 2 values
77
78 // B: COLUMN 0
79 ldr s29, [x8]
80 fmla v0.4s, v24.4s, v29.s[0]
81 fmla v1.4s, v25.4s, v29.s[0]
82 fmla v2.4s, v26.4s, v29.s[0]
83 fmla v3.2s, v27.2s, v29.s[0]
84 // B: COLUMN 1
85 add x8, x8, x4
86 ldr s29, [x8]
87 fmla v4.4s, v24.4s, v29.s[0]
88 fmla v5.4s, v25.4s, v29.s[0]
89 fmla v6.4s, v26.4s, v29.s[0]
90 fmla v7.2s, v27.2s, v29.s[0]
91 // B: COLUMN 2
92 add x8, x8, x4
93 ldr s29, [x8]
94 fmla v8.4s, v24.4s, v29.s[0]
95 fmla v9.4s, v25.4s, v29.s[0]
96 fmla v10.4s, v26.4s, v29.s[0]
97 fmla v11.2s, v27.2s, v29.s[0]
98 // B: COLUMN 3
99 add x8, x8, x4
100 ldr s29, [x8]
101 fmla v12.4s, v24.4s, v29.s[0]
102 fmla v13.4s, v25.4s, v29.s[0]
103 fmla v14.4s, v26.4s, v29.s[0]
104 fmla v15.2s, v27.2s, v29.s[0]
105 // B: COLUMN 4
106 add x8, x8, x4
107 ldr s29, [x8]
108 fmla v16.4s, v24.4s, v29.s[0]
109 fmla v17.4s, v25.4s, v29.s[0]
110 fmla v18.4s, v26.4s, v29.s[0]
111 fmla v19.2s, v27.2s, v29.s[0]
112 // B: COLUMN 5
113 add x8, x8, x4
114 ldr s29, [x8]
115 fmla v20.4s, v24.4s, v29.s[0]
116 fmla v21.4s, v25.4s, v29.s[0]
117 fmla v22.4s, v26.4s, v29.s[0]
118 fmla v23.2s, v27.2s, v29.s[0]
119
120 // move to next column of A
121 add x7, x7, x3
122 // move to next row of B
123 mov x8, x1
124 add x9, x9, #4
125 add x8, x8, x9
126
127 // decrement loop counter
128 sub x6, x6, #1
129 // check if loop counter is zero
130 cbnz x6, _k1_loop
When benchmarking our approaches we obtained the following results:
1Benchmarking V1_Matmul_14_6_64 throughput ...
2-----------------------------------------------
3Measuring throughput for Instruction
4Total time (s): 2.54314
5Instructions per Second: 8.45568e+10
6Estimated GFLOPS: 84.5568 GFLOPS/sec
7-----------------------------------------------
8
9Benchmarking V2_Matmul_14_6_64 throughput ...
10-----------------------------------------------
11Measuring throughput for Instruction
12Total time (s): 1.88243
13Instructions per Second: 1.14235e+11
14Estimated GFLOPS: 114.235 GFLOPS/sec
15-----------------------------------------------
16
17Benchmarking V3_Matmul_14_6_64 throughput ...
18-----------------------------------------------
19Measuring throughput for Instruction
20Total time (s): 2.08544
21Instructions per Second: 1.03115e+11
22Estimated GFLOPS: 103.115 GFLOPS/sec
23-----------------------------------------------
24
25Benchmarking V4_Matmul_14_6_64 throughput ...
26-----------------------------------------------
27Measuring throughput for Instruction
28Total time (s): 1.89444
29Instructions per Second: 1.13511e+11
30Estimated GFLOPS: 113.511 GFLOPS/sec
31-----------------------------------------------
The results indicate that the version with three different loads performed
best, with an increase of about 10 GFLOPs
. The switch from ldp
to ld1
however, didn’t show
any significant changes in the number of GFLOPs.
3.4.2 Matmul_15_6_64
For the case M=15
we considered three different versions:
For our first approach we again considered two loops. Again, the first loop was used to calculate a (12 x 64) block of matrix C. The second loop was then used to calculate the remaining (3 x 64) block of matrix C.
211_k2_loop:
212 // load column of A
213 ldr d24, [x7] // 2 values
214 ldr s25, [x7, #8]
215
216 // B: COLUMN 0
217 ldr s29, [x8]
218 fmla v0.2s, v24.2s, v29.s[0]
219 fmadd s1, s25, s29, s1
220 // B: COLUMN 1
221 add x8, x8, x4
222 ldr s29, [x8]
223 fmla v2.2s, v24.2s, v29.s[0]
224 fmadd s3, s25, s29, s3
225 // B: COLUMN 2
226 add x8, x8, x4
227 ldr s29, [x8]
228 fmla v4.2s, v24.2s, v29.s[0]
229 fmadd s5, s25, s29, s5
230 // B: COLUMN 3
231 add x8, x8, x4
232 ldr s29, [x8]
233 fmla v6.2s, v24.2s, v29.s[0]
234 fmadd s7, s25, s29, s7
235 // B: COLUMN 4
236 add x8, x8, x4
237 ldr s29, [x8]
238 fmla v8.2s, v24.2s, v29.s[0]
239 fmadd s9, s25, s29, s9
240 // B: COLUMN 5
241 add x8, x8, x4
242 ldr s29, [x8]
243 fmla v10.2s, v24.2s, v29.s[0]
244 fmadd s11, s25, s29, s11
245
246 // move to next column of A
247 add x7, x7, x3
248 // move to next row of B
249 mov x8, x1
250 add x9, x9, #4
251 add x8, x8, x9
252
253 // decrement loop counter
254 sub x6, x6, #1
255 // check if loop counter is zero
256 cbnz x6, _k2_loop
In the second approach we use one loop. We load matrix A column-wise using two
two ldp qXX, qXX, [x7]
instructions and then set the last element
to zero using mov v27.s[3], wzr
.
ldp
loads 99_k1_loop:
100 // load column of A
101 ldp q24, q25, [x7] // 4 + 4 values
102 ldp q26, q27, [x7, #32] // 4 + 4 values - possible memory leak
103 mov v27.s[3], wzr
104
105 // B: COLUMN 0
106 ldr s29, [x8]
107
108 fmla v0.4s, v24.4s, v29.s[0]
109 fmla v1.4s, v25.4s, v29.s[0]
110 fmla v2.4s, v26.4s, v29.s[0]
111 fmla v3.4s, v27.4s, v29.s[0]
112 // B: COLUMN 1
113 add x8, x8, x4
114 ldr s29, [x8]
115 fmla v4.4s, v24.4s, v29.s[0]
116 fmla v5.4s, v25.4s, v29.s[0]
117 fmla v6.4s, v26.4s, v29.s[0]
118 fmla v7.4s, v27.4s, v29.s[0]
119 // B: COLUMN 2
120 add x8, x8, x4
121 ldr s29, [x8]
122 fmla v8.4s, v24.4s, v29.s[0]
123 fmla v9.4s, v25.4s, v29.s[0]
124 fmla v10.4s, v26.4s, v29.s[0]
125 fmla v11.4s, v27.4s, v29.s[0]
126 // B: COLUMN 3
127 add x8, x8, x4
128 ldr s29, [x8]
129 fmla v12.4s, v24.4s, v29.s[0]
130 fmla v13.4s, v25.4s, v29.s[0]
131 fmla v14.4s, v26.4s, v29.s[0]
132 fmla v15.4s, v27.4s, v29.s[0]
133 // B: COLUMN 4
134 add x8, x8, x4
135 ldr s29, [x8]
136 fmla v16.4s, v24.4s, v29.s[0]
137 fmla v17.4s, v25.4s, v29.s[0]
138 fmla v18.4s, v26.4s, v29.s[0]
139 fmla v19.4s, v27.4s, v29.s[0]
140 // B: COLUMN 5
141 add x8, x8, x4
142 ldr s29, [x8]
143 fmla v20.4s, v24.4s, v29.s[0]
144 fmla v21.4s, v25.4s, v29.s[0]
145 fmla v22.4s, v26.4s, v29.s[0]
146 fmla v23.4s, v27.4s, v29.s[0]
147
148 // move to next column of A
149 add x7, x7, x3
150 // move to next row of B
151 mov x8, x1
152 add x9, x9, #4
153 add x8, x8, x9
154
155 // decrement loop counter
156 sub x6, x6, #1
157 // check if loop counter is zero
158 cbnz x6, _k1_loop
In the third approach we again changed the load instructions from ldp
to
ld1
.
ld1
loads 81_k1_loop:
82 // load column of A
83 ld1 {v24.4s-v27.4s}, [x7]
84 mov v27.s[3], wzr
85
86 // B: COLUMN 0
87 ldr s29, [x8]
88
89 fmla v0.4s, v24.4s, v29.s[0]
90 fmla v1.4s, v25.4s, v29.s[0]
91 fmla v2.4s, v26.4s, v29.s[0]
92 fmla v3.4s, v27.4s, v29.s[0]
93 // B: COLUMN 1
94 add x8, x8, x4
95 ldr s29, [x8]
96 fmla v4.4s, v24.4s, v29.s[0]
97 fmla v5.4s, v25.4s, v29.s[0]
98 fmla v6.4s, v26.4s, v29.s[0]
99 fmla v7.4s, v27.4s, v29.s[0]
100 // B: COLUMN 2
101 add x8, x8, x4
102 ldr s29, [x8]
103 fmla v8.4s, v24.4s, v29.s[0]
104 fmla v9.4s, v25.4s, v29.s[0]
105 fmla v10.4s, v26.4s, v29.s[0]
106 fmla v11.4s, v27.4s, v29.s[0]
107 // B: COLUMN 3
108 add x8, x8, x4
109 ldr s29, [x8]
110 fmla v12.4s, v24.4s, v29.s[0]
111 fmla v13.4s, v25.4s, v29.s[0]
112 fmla v14.4s, v26.4s, v29.s[0]
113 fmla v15.4s, v27.4s, v29.s[0]
114 // B: COLUMN 4
115 add x8, x8, x4
116 ldr s29, [x8]
117 fmla v16.4s, v24.4s, v29.s[0]
118 fmla v17.4s, v25.4s, v29.s[0]
119 fmla v18.4s, v26.4s, v29.s[0]
120 fmla v19.4s, v27.4s, v29.s[0]
121 // B: COLUMN 5
122 add x8, x8, x4
123 ldr s29, [x8]
124 fmla v20.4s, v24.4s, v29.s[0]
125 fmla v21.4s, v25.4s, v29.s[0]
126 fmla v22.4s, v26.4s, v29.s[0]
127 fmla v23.4s, v27.4s, v29.s[0]
128
129 // move to next column of A
130 add x7, x7, x3
131 // move to next row of B
132 mov x8, x1
133 add x9, x9, #4
134 add x8, x8, x9
135
136 // decrement loop counter
137 sub x6, x6, #1
138 // check if loop counter is zero
139 cbnz x6, _k1_loop
Again, we performed some benchmarks:
41Benchmarking V1_Matmul_15_6_64 throughput ...
42-----------------------------------------------
43Measuring throughput for Instruction
44Total time (s): 2.80305
45Instructions per Second: 8.21963e+10
46Estimated GFLOPS: 82.1963 GFLOPS/sec
47-----------------------------------------------
48
49Benchmarking V2_Matmul_15_6_64 throughput ...
50-----------------------------------------------
51Measuring throughput for Instruction
52Total time (s): 2.18569
53Instructions per Second: 1.05413e+11
54Estimated GFLOPS: 105.413 GFLOPS/sec
55-----------------------------------------------
56
57Benchmarking V3_Matmul_15_6_64 throughput ...
58-----------------------------------------------
59Measuring throughput for Instruction
60Total time (s): 2.18658
61Instructions per Second: 1.0537e+11
62Estimated GFLOPS: 105.37 GFLOPS/sec
63-----------------------------------------------
Similar to the benchmarks for the matmul_14_6_64
the approach with the single loop
performs much better than the other approach. This time, we even gain about
23 GFLOPs
with this approach.
3.4.3 Generic Approach
Simply as a proof of concept we also implemented a generic approach for the matmul_14_6_64
and matmul_15_6_64
kernels. This kernel works for any M > 0
. The idea is to write specific kernels for M = 1, 2, ..., 8
. We then divide M by 8 (shift right by 3) and use that to loop the kernel for M = 8
. Basically we split the M dimension into blocks of 8 elements and compute the result using a matmul_8_6_64
kernel. If there is a remainder, it is >=1 and <=7
, which we handle with specific kernels. The selection of the specific kernels is done using a jump table.
We also benchmarked the performance of this generic kernel:
matmul_M_6_64
(M = 14) approach33Benchmarking Matmul_M_6_64 M=14 throughput ...
34-----------------------------------------------
35Measuring throughput for Instruction
36Total time (s): 2.4066
37Instructions per Second: 8.93543e+10
38Estimated GFLOPS: 89.3543 GFLOPS/sec
39-----------------------------------------------
matmul_M_6_64
(M = 15) approach65Benchmarking Matmul_M_6_64 M=15 throughput ...
66-----------------------------------------------
67Measuring throughput for Instruction
68Total time (s): 3.68161
69Instructions per Second: 6.25814e+10
70Estimated GFLOPS: 62.5814 GFLOPS/sec
71-----------------------------------------------
Compared to our other approaches our obtained GFLOPs are slightly worse, losing
about 30 GFLOPs
to our best approach for the matmul_14_6_64
and about
40 GFLOPs
to our best approach for the matmul_15_6_64
.
3.5 Accumulator Block Shapes
In this task we were supposed to implement a microkernel that computes C+=AB for M=64, N=64 and K=64. Recalling our matmul_64_48_64
kernel, we only need to change the N dimension to 64. This kernel uses the matmul_16_6_64
internally, which we changed to matmul_16_4_64
. Changing N from 6 to 4 allows us to divide the N dimension into 16 blocks of 4 elements. N = 8 was not suitable, as we ran into issues with the number of available SIMD lanes. We do not think it is necessary to show the code for this kernel, as it is very similar to the matmul_64_48_64
kernel. The only difference is that we removed the logic for 2 of the 6 columns and increased the loop counter constant.
Benchmarking this kernel we obtained the following results:
1Benchmarking V1 Matmul throughput ...
2-----------------------------------------------
3Measuring throughput for Instruction
4Total time (s): 1.27442
5Instructions per Second: 1.23418e+11
6Estimated GFLOPS: 123.418 GFLOPS/sec
7-----------------------------------------------
8
9Benchmarking V2 Matmul throughput ...
10-----------------------------------------------
11Measuring throughput for Instruction
12Total time (s): 1.246
13Instructions per Second: 1.26233e+11
14Estimated GFLOPS: 126.233 GFLOPS/sec
15-----------------------------------------------
V1 is the first version which we obtained by converting our best performing matmul_64_48_64
kernel. Trying to squeeze out more performance, we made some minor changes to the computations of the strides (as shown below). We also removed loads and stores of callee-saved registers that were not used. This resulted in a performance increase of about 2-3 GFLOPs in V2 across multiple runs.
39 // multiply strides with float size
40 mov x6, #4
41 mul x3, x3, x6 // lda
42 mul x4, x4, x6 // ldb
43 mul x5, x5, x6 // ldc
44
45 mov x6, #4
46 mul x22, x4, x6 // ldb * 4 columns
47 mul x23, x5, x6 // ldc * 4 columns
37 // multiply strides with float size
38 // *4 = lsl #2
39 lsl x3, x3, #2 // lda
40 lsl x4, x4, #2 // ldb
41 lsl x5, x5, #2 // ldc
42
43 lsl x22, x4, #2 // ldb * 4 columns
44 lsl x23, x5, #2 // ldc * 4 columns
3.6 Batch-Reduce GEMM
Based on the previous tasks, we are now implementing a batch-reduce GEMM kernel. The goal is to implement a kernel that computes \(C+=\sum_i A_i B_i\) for M=64, N=48 and K=64 matrices. The kernel should be able to handle batches of matrices. For now we are only implementing the case where the batch size is 16.
Similar to the previous tasks we implemented several versions of this kernel to optimize the performance.
In our first version we simply used our matmul_64_48_64
kernel from our loops task and looped 16 times around that kernel.
Key points that we needed to consider were the following:
56 // Batch counter
57 mov x24, #16
58
59_n_batch:
230 // next A matrix
231 add x0, x0, x6 // A
232 mov x8, x0 // A
233
234 // next B matrix
235 add x1, x1, x7 // B
236 mov x20, x1 // B
237
238 // restore Pointer for matrix C
239 mov x21, x2 // C
240 mov x10, x21 // C
241
242 sub x24, x24, #1
243
244 cbnz x24, _n_batch
In our second version we made some optimizations to the kernel. The changes we made were:
MUL
’s with LSL
’s39 // multiply strides with float size
40 lsl x3, x3, #2 // lda in bytes
41 lsl x4, x4, #2 // ldb in bytes
42 lsl x5, x5, #2 // ldc in bytes
43 lsl x6, x6, #2 // br_stride_a in bytes
44 lsl x7, x7, #2 // br_stride_b in bytes
LDP
’s with LD1
’s and STP
’s with ST1
’s78 // LOAD MATRIX C
79 mov x12, x10
80 // first column
81 ld1 {v0.4s, v1.4s, v2.4s, v3.4s}, [x12]
82 // second column
83 add x12, x12, x5
84 ld1 {v4.4s, v5.4s, v6.4s, v7.4s}, [x12]
85 // third column
86 add x12, x12, x5
87 ld1 {v8.4s, v9.4s, v10.4s, v11.4s}, [x12]
88 // fourth column
89 add x12, x12, x5
90 ld1 {v12.4s, v13.4s, v14.4s, v15.4s}, [x12]
91 // fifth column
92 add x12, x12, x5
93 ld1 {v16.4s, v17.4s, v18.4s, v19.4s}, [x12]
94 // sixth column
95 add x12, x12, x5
96 ld1 {v20.4s, v21.4s, v22.4s, v23.4s}, [x12]
These optimizations resulted in a performance increase of about 3-4 GFLOPs
.
1Benchmarking V1_Matmul_64_48_64_16 throughput ...
2-----------------------------------------------
3Measuring throughput for Instruction
4Total time (s): 1.22446
5Instructions per Second: 1.28453e+11
6Estimated GFLOPS: 128.453 GFLOPS/sec
7-----------------------------------------------
8
9Benchmarking V2_Matmul_64_48_64_16 throughput ...
10-----------------------------------------------
11Measuring throughput for Instruction
12Total time (s): 1.19946
13Instructions per Second: 1.31131e+11
14Estimated GFLOPS: 131.131 GFLOPS/sec
3.7 Transposition
In this task we were looking at how to transpose a given 8x8 matrix
using assembly.
Our approach was to firstly look at the 4x4
case.
3.7.1 Transposition Implementation
We would first load the all 4 columns of matrix A using ldr qX, [x0]
to have the whole matrix in our registers.
The second step would be to transpose the matrix:
trans_4_4
implementation56 /*
57 * Part 2.1:
58 * Transpose 4x4 block.
59 */
60 trn1 v4.4s, v0.4s, v2.4s
61 trn1 v5.4s, v1.4s, v3.4s
62
63 trn2 v6.4s, v0.4s, v2.4s
64 trn2 v7.4s, v1.4s, v3.4s
65
66 /*
67 * Part 2.2:
68 * Transpose 4x4 block.
69 */
70 zip1 v8.4s, v4.4s, v5.4s // B "column" 0
71 zip1 v9.4s, v6.4s, v7.4s // B "column" 1
72
73 zip2 v10.4s, v4.4s, v5.4s // B "column" 2
74 zip2 v11.4s, v6.4s, v7.4s // B "column" 3
The idea of trn1
and trn2
were to prepare the elements for each column in such a way, that
we could then leverage the new structure using zip1
and zip2
.
Now, looking at the 8x8 matrix
our initial approach would be very simple:

We would separate our transposition task in 4 subtasks.
Each of the 4 4x4 submatrices
would be transposed using our trans_4_4
kernel.
The upper left matrix (in the image A) would be transposed and stored at the loading position.
The upper right matrix (in the image B) would be transposed and stored at the “loading” position of the bottom left matrix.
The bottom left matrix (in the image C) would be transposed and stored at the “loading” position of the upper right matrix.
89 /*
90 * Part 1.2:
91 * Load 4x4 block of A (Left bottom, top right).
92 */
93 mov x4, x0 // A
94 mov x5, x1 // B
95
96 add x4, x4, #16
97 add x5, x5, #16
98
99 ldr q0, [x4]
100 add x4, x4, x2
101 ldr q1, [x4]
102 add x4, x4, x2
103 ldr q2, [x4]
104 add x4, x4, x2
105 ldr q3, [x4]
106
107 // Right top
108 mov x4, x0 // A
109 mov x5, x1 // B
110
111 add x4, x4, #128
112 add x5, x5, #128
113
114 ldr q12, [x4]
115 add x4, x4, x2
116 ldr q13, [x4]
117 add x4, x4, x2
118 ldr q14, [x4]
119 add x4, x4, x2
120 ldr q15, [x4]
121
122 /*
123 * Part 2.1:
124 * Transpose 4x4 block.
125 */
126 // Left Bottom
127 trn1 v4.4s, v0.4s, v2.4s
128 trn1 v5.4s, v1.4s, v3.4s
129
130 trn2 v6.4s, v0.4s, v2.4s
131 trn2 v7.4s, v1.4s, v3.4s
132
133 // Right Top
134 trn1 v16.4s, v12.4s, v14.4s
135 trn1 v17.4s, v13.4s, v15.4s
136
137 trn2 v18.4s, v12.4s, v14.4s
138 trn2 v19.4s, v13.4s, v15.4s
139
140 /*
141 * Part 2.2:
142 * Transpose 4x4 block.
143 */
144 // Left Bottom
145 zip1 v8.4s, v4.4s, v5.4s
146 zip1 v9.4s, v6.4s, v7.4s
147
148 zip2 v10.4s, v4.4s, v5.4s
149 zip2 v11.4s, v6.4s, v7.4s
150
151 // Right Top
152 zip1 v20.4s, v16.4s, v17.4s
153 zip1 v21.4s, v18.4s, v19.4s
154
155 zip2 v22.4s, v16.4s, v17.4s
156 zip2 v23.4s, v18.4s, v19.4s
157
158 /*
159 * Part 3:
160 * Store 4x4 block of Submatrix A''' into A''.
161 */
162 // Left Bottom (values from right top)
163 mov x5, x1
164 add x5, x5, #16
165
166 str q20, [x5]
167 add x5, x5, x3
168 str q21, [x5]
169 add x5, x5, x3
170 str q22, [x5]
171 add x5, x5, x3
172 str q23, [x5]
173
174 // Right top (values from left bottom)
175 mov x5, x1
176 add x5, x5, #128
177
178 str q8, [x5]
179 add x5, x5, x3
180 str q9, [x5]
181 add x5, x5, x3
182 str q10, [x5]
183 add x5, x5, x3
184 str q11, [x5]
The bottom right matrix (in the image D) would be transposed and stored at the loading position.
To optimize our first version we removed the PCS to all registers that we did not use. We also wrote our kernel in a more compact manner.
40 mov x9, #2 // n loop
41
42n_loop:
43
44 mov x6, #2 // m loop
45
46m_loop:
47 /*
48 * Part 1:
49 * Transpose 4x4 block.
50 */
51 mov x7, x4
52 mov x8, x5
53
54 ldr q0, [x7]
55 add x7, x7, x2
56
57 ldr q1, [x7]
58 add x7, x7, x2
59
60 ldr q2, [x7]
61 add x7, x7, x2
62
63 ldr q3, [x7]
64
65 /*
66 * Part 2.1:
67 * Transpose 4x4 block.
68 */
69 trn1 v4.4s, v0.4s, v2.4s
70 trn1 v5.4s, v1.4s, v3.4s
71
72 trn2 v6.4s, v0.4s, v2.4s
73 trn2 v7.4s, v1.4s, v3.4s
74
75 /*
76 * Part 2.2:
77 * Transpose 4x4 block.
78 */
79 zip1 v16.4s, v4.4s, v5.4s
80 zip1 v17.4s, v6.4s, v7.4s
81
82 zip2 v18.4s, v4.4s, v5.4s
83 zip2 v19.4s, v6.4s, v7.4s
84
85 /*
86 * Part 3:
87 * Store 4x4 block of A into B.
88 */
89 str q16, [x8]
90 add x8, x8, x3
91
92 str q17, [x8]
93 add x8, x8, x3
94
95 str q18, [x8]
96 add x8, x8, x3
97
98 str q19, [x8]
99
100 // Jump 4 rows in A
101 add x4, x4, x25
102
103 // Jump 4 columns in B
104 add x5, x5, x27
105
106 sub x6, x6, #1
107 cbnz x6, m_loop
108
109
110 // Restore Pointer for A and B
111 mov x4, x0
112 mov x5, x1
113
114 add x12, x12, x26
115 add x13, x13, x25
116
117 add x4, x4, x12
118 add x5, x5, x13
119
120 sub x9, x9, #1
121 cbnz x9, n_loop
3.7.2 Performance Measuring
We also wanted to know the performance of our transposition kernel, which in our case would be the loading and storing of elements.
trans_8_8
performance in GiB/s
1Benchmarking trans_neon_8_8 performance ...
2-----------------------------------------------
3Measuring throughput for transposition in GiB/s
4Total time (s): 1.26545
5Data movements per Second: 8.09199e+10
6Estimated GiB/s: 80.9199 GiB/s
7-----------------------------------------------
8
9Benchmarking v2_trans_neon_8_8 performance ...
10-----------------------------------------------
11Measuring throughput for transposition in GiB/s
12Total time (s): 0.902975
13Data movements per Second: 1.13403e+11
14Estimated GiB/s: 113.403 GiB/s
15-----------------------------------------------
Our results showed that we for our first version we are transferring about 80 GiB/s
.
In our optimized version we increase this performance by roughly 33 GiB/s
.