Back to the feed

Running 27B LLM at 80 tokens/sec on dual GPUs shows accessible inference scaling

Hacker News·1w·iMil

iMil benchmarked Qwen 3.6 27B on RTX 5080 + RTX 3090, hitting 80 tokens/sec throughput. For makers building AI features on modest hardware, this suggests you don't need enterprise clusters—stacking older and newer GPUs can hit meaningful inference speeds without proportional cost.

Share𝕏 Reddit

Original story

Read the original on Hacker News

Related stories

Anthropic explores recursive self-improvement in AI systems

AI

Anthropic explores recursive self-improvement in AI systems

Hacker News·3w·Anthropic

AI

Meta adds facial recognition to Ray-Ban smart glasses

Hacker News·3w·Meta

Anthropic details the technical safeguards built into Claude deployments

AI

Anthropic details the technical safeguards built into Claude deployments

Hacker News·3w·jbredeche

Devtools

Espressif releases ESP32-S31, a stripped-down microcontroller for cost-conscious projects

Hacker News·3w·volemo