Skip to content

Conversation

@delneg
Copy link
Contributor

@delneg delneg commented May 29, 2022

Hi,
I've studied version10 of Topological Sort a bit and I see that there's some operations which are done once per loop, and I had an idea that they can be parallelized.
Parallelization via Tasks can bring a lot of overhead, but probably one could use SIMD in this case for primitive operations.

I've added an example of Edge.BatchAdd, and some comments - although I can't get it fully working but I wanted to share the general idea

P.S. Also, .NET 7.0 is needed because they added ShiftRight / ShiftLeft only in this new version

@matthewcrews
Copy link
Owner

This is super interesting! I'll need to look over this and see what the performance difference will be. Thank you for the ideas!

@delneg
Copy link
Contributor Author

delneg commented May 31, 2022

Also, added some micro-optimizations here https://github.com/delneg/FSharpPerformance/tree/wip/optimize-v10

|     Method |     Mean |    Error |   StdDev |   Median |   Gen 0 | Allocated |
|----------- |---------:|---------:|---------:|---------:|--------:|----------:|
| Version_10 | 750.6 μs | 14.75 μs | 29.45 μs | 739.2 μs | 28.3203 |    234 KB |
| Version_11 | 724.6 μs | 10.36 μs |  9.69 μs | 721.7 μs | 24.4141 |    203 KB |

It's a little bit faster for me & 31kb less memory hungry

@matthewcrews
Copy link
Owner

Interesting, on my machine it's slower. I have an AMD 3900x which is a Zen 2. Are you running on an Intel?

// * Summary *

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.22000
AMD Ryzen 9 3900X, 1 CPU, 24 logical and 12 physical cores
.NET SDK=7.0.100-preview.4.22252.9
  [Host]     : .NET 7.0.0 (7.0.22.22904), X64 RyuJIT DEBUG
  DefaultJob : .NET 7.0.0 (7.0.22.22904), X64 RyuJIT
Method Mean Error StdDev Gen 0 Allocated
Version_10 522.7 us 7.37 us 6.90 us 28.3203 234 KB
Version_11 662.1 us 12.82 us 14.77 us 24.4141 203 KB

It is interesting to see the performance difference between NET6 and NET7. Version_10 is ~50us faster with NET7

@delneg
Copy link
Contributor Author

delneg commented Jun 1, 2022

Yep, I'm testing on an x64 macOS

BenchmarkDotNet=v0.13.1, OS=macOS Big Sur 11.4
Intel Core i9-9980HK CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.5.22267.11
  [Host]     : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT DEBUG
  DefaultJob : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT

|     Method |     Mean |    Error |   StdDev |   Gen 0 | Allocated |
|----------- |---------:|---------:|---------:|--------:|----------:|
| Version_10 | 845.0 us | 15.81 us | 15.52 us | 28.3203 |    234 KB |
| Version_11 | 771.5 us | 11.80 us |  9.85 us | 24.4141 |    203 KB |

I've also tried it on an M1 Mac for interesting results:

BenchmarkDotNet=v0.13.1, OS=macOS Monterey 12.1
Apple M1, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-preview.4.22252.9
  [Host]     : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT DEBUG
  DefaultJob : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT


|     Method |     Mean |   Error |  StdDev |   Gen 0 | Allocated |
|----------- |---------:|--------:|--------:|--------:|----------:|
| Version_10 | 475.9 us | 0.98 us | 0.92 us | 38.0859 |    234 KB |
| Version_11 | 475.4 us | 1.92 us | 1.80 us | 32.7148 |    203 KB |

Both with .NET 7, both after commenting out value option ValuesCount caching (last commit just pushed) - commented it out because it was actually making it slower.

So, maybe what I did was actually a No-Op after all

@matthewcrews
Copy link
Owner

Super interesting. This is why I've been tempted to build an Intel machine along with my AMD. I knew these performance differences were possible but that seems pretty dramatic. I wonder what the root cause is for the performance difference between our different CPUs.

@delneg
Copy link
Contributor Author

delneg commented Jun 1, 2022

Well, arm64 is very different from x64 for sure and got many improvements in .net7 so no wonder
Also I'm testing on mac and not on windows so it may also affect the performance due to differences in .net runtime performance
Anyway, for your use case of <500us it's safe to say that you could buy an M1 Mac and have your one million graphs sorted in time :)

P.S. I'd say it's tempting to have x86_64, ARM64, and RISC-V platforms for testing, not only amd and intel
Also maybe get your hands onto some used cloud-grade cpu's (like the ones amazon uses or smth like that) if you're running your workloads in the cloud

@delneg
Copy link
Contributor Author

delneg commented Jun 1, 2022

It seems that I've managed to make it faster by making Graph a Struct and passing it by inref

BenchmarkDotNet=v0.13.1, OS=macOS Big Sur 11.4 (20F71) [Darwin 20.5.0]
Intel Core i9-9980HK CPU 2.40GHz, 1 CPU, 16 logical and 8 physical cores
.NET SDK=7.0.100-preview.5.22267.11
  [Host]     : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT DEBUG
  DefaultJob : .NET 7.0.0 (7.0.22.26611), X64 RyuJIT


|     Method |     Mean |    Error |   StdDev |   Gen 0 |  Gen 1 | Allocated |
|----------- |---------:|---------:|---------:|--------:|-------:|----------:|
| Version_10 | 808.6 us | 10.51 us |  9.83 us | 28.3203 |      - |    234 KB |
| Version_11 | 699.5 us | 10.69 us | 10.00 us | 21.4844 | 0.9766 |    180 KB |
BenchmarkDotNet=v0.13.1, OS=macOS Monterey 12.1 (21C52) [Darwin 21.2.0]
Apple M1, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-preview.4.22252.9
  [Host]     : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT DEBUG
  DefaultJob : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT


|     Method |     Mean |   Error |  StdDev |   Gen 0 | Allocated |
|----------- |---------:|--------:|--------:|--------:|----------:|
| Version_10 | 477.9 us | 0.76 us | 0.67 us | 38.0859 |    234 KB |
| Version_11 | 467.0 us | 0.95 us | 0.89 us | 29.2969 |    180 KB |

Update: By moving all the EdgeTracker code to static inline members & passing around remainingEdges array with Span directly, I was able to get this (M1 here because it's more consistent)

BenchmarkDotNet=v0.13.1, OS=macOS Monterey 12.1 (21C52) [Darwin 21.2.0]
Apple M1, 1 CPU, 8 logical and 8 physical cores
.NET SDK=7.0.100-preview.4.22252.9
  [Host]     : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT DEBUG
  DefaultJob : .NET 7.0.0 (7.0.22.22904), Arm64 RyuJIT


|     Method |     Mean |   Error |  StdDev |   Gen 0 | Allocated |
|----------- |---------:|--------:|--------:|--------:|----------:|
| Version_10 | 477.6 us | 0.78 us | 0.61 us | 38.0859 |    234 KB |
| Version_11 | 448.7 us | 1.05 us | 0.93 us | 29.2969 |    180 KB |

@delneg delneg mentioned this pull request Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants