I benchmarked 6 TPC-H analytical queries on Apple M4 across three execution paths: DuckDB SQL, NumPy CPU kernels, and MLX GPU kernels. The goal was to quantify whether unified memory
actually matters for GPU-accelerated analytics.
What I found:
- MLX GPU kernels are 1.3x-3.1x faster than identical NumPy CPU kernels on compute-heavy queries (Q1, Q6). The advantage scales with data size.
- DuckDB's optimized SQL engine beats hand-written GPU kernels on every standard TPC-H query. A C++ vectorized engine with a query optimizer is a different class of performance than
Python-orchestrated GPU kernels.
- A custom GPU-favorable query (pure parallel arithmetic, no joins) showed MLX beating DuckDB by 1.6x and NumPy by 16x -- confirming the GPU wins when the workload fits.
- If the M4 GPU were behind a PCIe 4.0 bus, data transfer would add 10-36% overhead. Unified memory eliminates this entirely.
Honest takeaway: Unified memory removes the transfer bottleneck, but the engine's software stack matters more than the hardware for typical analytical queries. GPU analytics needs
workloads heavy on parallel arithmetic and light on joins to beat an optimized CPU engine.
MLX limitations I worked around: no boolean indexing (used overflow bin pattern), float32 only (~0.08% precision loss over millions of rows), mx.array(numpy) is a copy not zero-copy.
Full paper: https://github.com/sadopc/unified-db-2/blob/main/PAPER.md
All code is MIT. Runs end-to-end with one command.
1 comments