Show HN: OS Megakernel that match M5 Max Tok/w at 2x the Throughput on RTX 3090

(github.com)

4 points | by GreenGames 5 hours ago ago

1 comments

$emanuele-em 4 hours ago

Really cool to see someone actually prove that the NVIDIA vs Apple efficiency gap is mostly a software problem. A 2020 GPU matching M5 Max tok/J at 1.8x the throughput just by fusing all 24 layers into one persistent kernel is a strong result. The DVFS sweep losing only 5% between 420W and 220W is surprising. Have you looked at what this would take on Hopper with TMA?