Bonsai 1.7B ternary model hits 442T/s on M4 Max

Hacker News·2w·hhuytho

A new quantized model achieves notably fast inference on consumer hardware by using ternary weights instead of standard formats. For indie developers building on-device AI, this kind of performance on a laptop processor could meaningfully reduce infrastructure costs.