We're open-sourcing a header-only CUDA library that brings high-performance Cuckoo filters to modern GPUs.
You can read the full arXiv preprint here: https://arxiv.org/abs/2603.15486
The Problem:
Approximate Membership Query (AMQ) data structures like Bloom filters are heavily used to filter out non-members before expensive I/O operations, but they often don't support deleting items.
While dynamic alternatives like Cuckoo filters solve this, porting them to GPUs has historically been a nightmare.
The eviction chains cause massive thread divergence and random memory accesses, bottlenecking GPU throughput.
Our Solution:
Instead of fighting the random access pattern to optimise for memory locality at all cost (like the GPU Quotient filter and Two-Choice filter do), we embraced it to fully saturate the massive global memory bandwidth of modern hardware (primary testbed was an NVIDIA GH200). To the best of our knowledge this is the first general-purpose Cuckoo filter library that successfully uses the GPU for every operation without sacrificing performance too much.
We used:
1. A lock-free architecture built on atomic compare-and-swap (CAS).
2. Vectorised operations to process large chunks of a bucket in parallel without branching.
3. A novel BFS eviction heuristic that bounds eviction chains under heavy load.
Through this, we're seeing insertion and deletion throughputs up to 378x and 258x higher than the state-of-the-art GPU Quotient Filter, and up to a 350x speedup over an optimised multi-threaded CPU implementation.
It even rivals NVIDIA's own append-only Blocked Bloom filter for query throughput.
I'd love to hear your thoughts! And of course answer any questions that might come up.
Hi HN, I'm Tim, the primary author of Cuckoo-GPU.
We're open-sourcing a header-only CUDA library that brings high-performance Cuckoo filters to modern GPUs. You can read the full arXiv preprint here: https://arxiv.org/abs/2603.15486
The Problem:
Approximate Membership Query (AMQ) data structures like Bloom filters are heavily used to filter out non-members before expensive I/O operations, but they often don't support deleting items. While dynamic alternatives like Cuckoo filters solve this, porting them to GPUs has historically been a nightmare. The eviction chains cause massive thread divergence and random memory accesses, bottlenecking GPU throughput.
Our Solution:
Instead of fighting the random access pattern to optimise for memory locality at all cost (like the GPU Quotient filter and Two-Choice filter do), we embraced it to fully saturate the massive global memory bandwidth of modern hardware (primary testbed was an NVIDIA GH200). To the best of our knowledge this is the first general-purpose Cuckoo filter library that successfully uses the GPU for every operation without sacrificing performance too much.
We used:
1. A lock-free architecture built on atomic compare-and-swap (CAS).
2. Vectorised operations to process large chunks of a bucket in parallel without branching.
3. A novel BFS eviction heuristic that bounds eviction chains under heavy load.
Through this, we're seeing insertion and deletion throughputs up to 378x and 258x higher than the state-of-the-art GPU Quotient Filter, and up to a 350x speedup over an optimised multi-threaded CPU implementation. It even rivals NVIDIA's own append-only Blocked Bloom filter for query throughput.
I'd love to hear your thoughts! And of course answer any questions that might come up.