Skip to content

Support load sinking across notrap loads #12033

@bjorn3

Description

@bjorn3

Feature

For example for subtracting 2 mathematical vectors with 3 elements like so:

function u0:0(i64 sret, i64, i64) system_v {
block0(v0: i64, v1: i64, v2: i64):
    v3 = load.f64 notrap v1
    v4 = load.f64 notrap v2
    v6 = load.f64 notrap v1+8
    v7 = load.f64 notrap v2+8
    v9 = load.f64 notrap v1+16
    v10 = load.f64 notrap v2+16
    v5 = fsub v3, v4
    store notrap v5, v0
    v8 = fsub v6, v7
    store notrap v8, v0+8
    v11 = fsub v9, v10
    store notrap v11, v0+16
    return
}

6 load instructions will be generated followed by 3 pairs of sub + store:

0000000000000000 <sub>:
   0:   55                      push   rbp
   1:   48 89 e5                mov    rbp,rsp
   4:   f2 0f 10 3e             movsd  xmm7,QWORD PTR [rsi]
   8:   f2 0f 10 2a             movsd  xmm5,QWORD PTR [rdx]
   c:   f2 0f 10 46 08          movsd  xmm0,QWORD PTR [rsi+0x8]
  11:   f2 0f 10 72 08          movsd  xmm6,QWORD PTR [rdx+0x8]
  16:   f2 0f 10 4e 10          movsd  xmm1,QWORD PTR [rsi+0x10]
  1b:   f2 0f 10 52 10          movsd  xmm2,QWORD PTR [rdx+0x10]
  20:   f2 0f 5c fd             subsd  xmm7,xmm5
  24:   f2 0f 11 3f             movsd  QWORD PTR [rdi],xmm7
  28:   f2 0f 5c c6             subsd  xmm0,xmm6
  2c:   f2 0f 11 47 08          movsd  QWORD PTR [rdi+0x8],xmm0
  31:   f2 0f 5c ca             subsd  xmm1,xmm2
  35:   f2 0f 11 4f 10          movsd  QWORD PTR [rdi+0x10],xmm1
  3a:   48 89 f8                mov    rax,rdi
  3d:   48 89 ec                mov    rsp,rbp
  40:   5d                      pop    rbp
  41:   c3                      ret

while LLVM is able to sink half the loads into the subsd instructions themself even with -O0:

  sub:
        mov     rax, rdi
        movsd   xmm2, qword ptr [rsi]
        subsd   xmm2, qword ptr [rdx]
        movsd   xmm1, qword ptr [rsi + 8]
        subsd   xmm1, qword ptr [rdx + 8]
        movsd   xmm0, qword ptr [rsi + 16]
        subsd   xmm0, qword ptr [rdx + 16]
        movsd   qword ptr [rdi], xmm2
        movsd   qword ptr [rdi + 8], xmm1
        movsd   qword ptr [rdi + 16], xmm0
        ret

Cranelift is not entirely incapable of load sinking as seen for a dot product where it does load sink a single load:

function u0:0(i64, i64) -> f64 system_v {
block0(v0: i64, v1: i64):
    v3 = load.f64 notrap v0
    v4 = load.f64 notrap v1
    v6 = load.f64 notrap v0+8
    v7 = load.f64 notrap v1+8
    v10 = load.f64 notrap v0+16
    v11 = load.f64 notrap v1+16
    v5 = fmul v3, v4
    v8 = fmul v6, v7
    v9 = fadd v5, v8
    v12 = fmul v10, v11
    v13 = fadd v9, v12
    return v13
}
0000000000000000 <dot>:
   0:   55                      push   rbp
   1:   48 89 e5                mov    rbp,rsp
   4:   f2 0f 10 07             movsd  xmm0,QWORD PTR [rdi]
   8:   f2 0f 10 2e             movsd  xmm5,QWORD PTR [rsi]
   c:   f2 0f 10 4f 08          movsd  xmm1,QWORD PTR [rdi+0x8]
  11:   f2 0f 10 76 08          movsd  xmm6,QWORD PTR [rsi+0x8]
  16:   f2 0f 10 57 10          movsd  xmm2,QWORD PTR [rdi+0x10]
  1b:   f2 0f 59 c5             mulsd  xmm0,xmm5
  1f:   f2 0f 59 ce             mulsd  xmm1,xmm6
  23:   f2 0f 58 c1             addsd  xmm0,xmm1
  27:   f2 0f 59 56 10          mulsd  xmm2,QWORD PTR [rsi+0x10]
  2c:   f2 0f 58 c2             addsd  xmm0,xmm2
  30:   48 89 ec                mov    rsp,rbp
  33:   5d                      pop    rbp
  34:   c3                      ret

but again even LLVM -O0 will load sink all 3 possible loads:

dot:
        mov     qword ptr [rsp - 8], rdi
        movsd   xmm0, qword ptr [rdi]
        mulsd   xmm0, qword ptr [rsi]
        movsd   xmm1, qword ptr [rdi + 8]
        mulsd   xmm1, qword ptr [rsi + 8]
        addsd   xmm0, xmm1
        movsd   xmm1, qword ptr [rdi + 16]
        mulsd   xmm1, qword ptr [rsi + 16]
        addsd   xmm0, xmm1
        ret

These examples are taken from https://github.com/ebobby/simple-raytracer/blob/496b6164b9f16250f99b91327da8f01acc1e3534/src/vector.rs compiled with both cg_clif (-Copt-level=3) and cg_llvm (-Copt-level=0).

Benefit

Improves runtime performance.

Implementation

I think this is caused by get_value_as_source_or_const considering loads as having side-effects even when they are notrap.

Alternatives

TODO: What are the alternative implementation approaches or alternative ways to
solve the problem that this feature would solve? How do these alternatives
compare to this proposal?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions