-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
Feature
For example for subtracting 2 mathematical vectors with 3 elements like so:
function u0:0(i64 sret, i64, i64) system_v {
block0(v0: i64, v1: i64, v2: i64):
v3 = load.f64 notrap v1
v4 = load.f64 notrap v2
v6 = load.f64 notrap v1+8
v7 = load.f64 notrap v2+8
v9 = load.f64 notrap v1+16
v10 = load.f64 notrap v2+16
v5 = fsub v3, v4
store notrap v5, v0
v8 = fsub v6, v7
store notrap v8, v0+8
v11 = fsub v9, v10
store notrap v11, v0+16
return
}
6 load instructions will be generated followed by 3 pairs of sub + store:
0000000000000000 <sub>:
0: 55 push rbp
1: 48 89 e5 mov rbp,rsp
4: f2 0f 10 3e movsd xmm7,QWORD PTR [rsi]
8: f2 0f 10 2a movsd xmm5,QWORD PTR [rdx]
c: f2 0f 10 46 08 movsd xmm0,QWORD PTR [rsi+0x8]
11: f2 0f 10 72 08 movsd xmm6,QWORD PTR [rdx+0x8]
16: f2 0f 10 4e 10 movsd xmm1,QWORD PTR [rsi+0x10]
1b: f2 0f 10 52 10 movsd xmm2,QWORD PTR [rdx+0x10]
20: f2 0f 5c fd subsd xmm7,xmm5
24: f2 0f 11 3f movsd QWORD PTR [rdi],xmm7
28: f2 0f 5c c6 subsd xmm0,xmm6
2c: f2 0f 11 47 08 movsd QWORD PTR [rdi+0x8],xmm0
31: f2 0f 5c ca subsd xmm1,xmm2
35: f2 0f 11 4f 10 movsd QWORD PTR [rdi+0x10],xmm1
3a: 48 89 f8 mov rax,rdi
3d: 48 89 ec mov rsp,rbp
40: 5d pop rbp
41: c3 retwhile LLVM is able to sink half the loads into the subsd instructions themself even with -O0:
sub:
mov rax, rdi
movsd xmm2, qword ptr [rsi]
subsd xmm2, qword ptr [rdx]
movsd xmm1, qword ptr [rsi + 8]
subsd xmm1, qword ptr [rdx + 8]
movsd xmm0, qword ptr [rsi + 16]
subsd xmm0, qword ptr [rdx + 16]
movsd qword ptr [rdi], xmm2
movsd qword ptr [rdi + 8], xmm1
movsd qword ptr [rdi + 16], xmm0
retCranelift is not entirely incapable of load sinking as seen for a dot product where it does load sink a single load:
function u0:0(i64, i64) -> f64 system_v {
block0(v0: i64, v1: i64):
v3 = load.f64 notrap v0
v4 = load.f64 notrap v1
v6 = load.f64 notrap v0+8
v7 = load.f64 notrap v1+8
v10 = load.f64 notrap v0+16
v11 = load.f64 notrap v1+16
v5 = fmul v3, v4
v8 = fmul v6, v7
v9 = fadd v5, v8
v12 = fmul v10, v11
v13 = fadd v9, v12
return v13
}
0000000000000000 <dot>:
0: 55 push rbp
1: 48 89 e5 mov rbp,rsp
4: f2 0f 10 07 movsd xmm0,QWORD PTR [rdi]
8: f2 0f 10 2e movsd xmm5,QWORD PTR [rsi]
c: f2 0f 10 4f 08 movsd xmm1,QWORD PTR [rdi+0x8]
11: f2 0f 10 76 08 movsd xmm6,QWORD PTR [rsi+0x8]
16: f2 0f 10 57 10 movsd xmm2,QWORD PTR [rdi+0x10]
1b: f2 0f 59 c5 mulsd xmm0,xmm5
1f: f2 0f 59 ce mulsd xmm1,xmm6
23: f2 0f 58 c1 addsd xmm0,xmm1
27: f2 0f 59 56 10 mulsd xmm2,QWORD PTR [rsi+0x10]
2c: f2 0f 58 c2 addsd xmm0,xmm2
30: 48 89 ec mov rsp,rbp
33: 5d pop rbp
34: c3 retbut again even LLVM -O0 will load sink all 3 possible loads:
dot:
mov qword ptr [rsp - 8], rdi
movsd xmm0, qword ptr [rdi]
mulsd xmm0, qword ptr [rsi]
movsd xmm1, qword ptr [rdi + 8]
mulsd xmm1, qword ptr [rsi + 8]
addsd xmm0, xmm1
movsd xmm1, qword ptr [rdi + 16]
mulsd xmm1, qword ptr [rsi + 16]
addsd xmm0, xmm1
retThese examples are taken from https://github.com/ebobby/simple-raytracer/blob/496b6164b9f16250f99b91327da8f01acc1e3534/src/vector.rs compiled with both cg_clif (-Copt-level=3) and cg_llvm (-Copt-level=0).
Benefit
Improves runtime performance.
Implementation
I think this is caused by get_value_as_source_or_const considering loads as having side-effects even when they are notrap.
Alternatives
TODO: What are the alternative implementation approaches or alternative ways to
solve the problem that this feature would solve? How do these alternatives
compare to this proposal?