Really cool to see someone actually prove that the NVIDIA vs Apple efficiency gap is mostly a software problem. A 2020 GPU matching M5 Max tok/J at 1.8x the throughput just by fusing all 24 layers into one persistent kernel is a strong result. The DVFS sweep losing only 5% between 420W and 220W is surprising. Have you looked at what this would take on Hopper with TMA?
Really cool to see someone actually prove that the NVIDIA vs Apple efficiency gap is mostly a software problem. A 2020 GPU matching M5 Max tok/J at 1.8x the throughput just by fusing all 24 layers into one persistent kernel is a strong result. The DVFS sweep losing only 5% between 420W and 220W is surprising. Have you looked at what this would take on Hopper with TMA?