I built one a few years ago. I never published it or anything but I figure it's worth a discussion point here.
1. Multiply is a primitive that seems to mix the most bits in the fewest clock ticks.
2. Multiply's weakness is that the bottom bits have almost no mixing at all. This is easily fixed with bit-reverse, which is implemented in one clock tick on both NVidia and AMD GPUs. X86 can use bswap instead. ARM has bit reverse.
3. Finding the best 32-bit constant to multiply is the hard part. Until you remember that GPUs can search the entire 32-bit space in less than a miliseconds. (4-Billion numbers across 2GHz and about 4000 shaders so like 8000x 32-bit space traversals per second).
4. The RNG is therefore defined as #3, what kind of searches and heuristics do you optimize for?
5. I chose to calculate the pop count(change of bits) and optimized for the largest number of popcnt==16 bitflips, which is a weird definition of the avalanche effect but hey I needed something. Popcnt is also a one-click tick instruction.
6. The remaining effort is an exercise left for the reader. But it took me like 2 days, it's not hard.
7. EDIT: Remember the lessons from XORShift. Have a simple counter (ex: state++) combined with a reversible hash-like function (ex: return hash(state++);). The above discussion is mostly about the hash. State++ function is anything that trivially cycles the 32-bit or 64-bit space for maximum cycle length. I find that state+=(odd number) is a superior mixer that still trivially traverses the whole space. Use above discussion to search for optimal constant.
ARM NEON is kinda misrepresented here - it supports both bitshifts by non-constants, and proper 32×32→32-bit multiplies; AVX2 also supports both of those.
I built one a few years ago. I never published it or anything but I figure it's worth a discussion point here.
1. Multiply is a primitive that seems to mix the most bits in the fewest clock ticks.
2. Multiply's weakness is that the bottom bits have almost no mixing at all. This is easily fixed with bit-reverse, which is implemented in one clock tick on both NVidia and AMD GPUs. X86 can use bswap instead. ARM has bit reverse.
3. Finding the best 32-bit constant to multiply is the hard part. Until you remember that GPUs can search the entire 32-bit space in less than a miliseconds. (4-Billion numbers across 2GHz and about 4000 shaders so like 8000x 32-bit space traversals per second).
4. The RNG is therefore defined as #3, what kind of searches and heuristics do you optimize for?
5. I chose to calculate the pop count(change of bits) and optimized for the largest number of popcnt==16 bitflips, which is a weird definition of the avalanche effect but hey I needed something. Popcnt is also a one-click tick instruction.
6. The remaining effort is an exercise left for the reader. But it took me like 2 days, it's not hard.
7. EDIT: Remember the lessons from XORShift. Have a simple counter (ex: state++) combined with a reversible hash-like function (ex: return hash(state++);). The above discussion is mostly about the hash. State++ function is anything that trivially cycles the 32-bit or 64-bit space for maximum cycle length. I find that state+=(odd number) is a superior mixer that still trivially traverses the whole space. Use above discussion to search for optimal constant.
ARM NEON is kinda misrepresented here - it supports both bitshifts by non-constants, and proper 32×32→32-bit multiplies; AVX2 also supports both of those.