branch-less programming is a fascinating area.
you have used -O3. Possibly, the compiler is also vectorizing some parts of the code. I am curious to know the contribution of AVX/SIMD to the speed-up (i.e, how much speed-up avoiding branches "alone" yields)
branch-less programming is a fascinating area. you have used -O3. Possibly, the compiler is also vectorizing some parts of the code. I am curious to know the contribution of AVX/SIMD to the speed-up (i.e, how much speed-up avoiding branches "alone" yields)
You can take a look at this - it's fast even without vector operations, as long as you avoid the branches that are often predicted incorrectly.
https://easylang.online/blog/branchless
In line 423 or the optimised code there's a typo: "sort2(e,i)" should be "sort2(i,e)"
That should give the same result.