A 30B Qwen Model Walks Into a Raspberry Pi…
and Runs in Real Time
For this release, we optimize for what people actually experience when they run a model: fast, high-quality responses on a specific target device.
We use Shapelearn, our bitlength learning method to choose weight datatypes for Qwen3-30B-A3B-Instruct-2507 that maximize performance in terms of tokens per second (TPS) and output quality, with one practical constraint: the model must fit comfortably in the available memory. Once it fits, making the file smaller isn't a goal by itself. We only shrink further when it also improves the real tradeoff people care about: speed vs. quality.
Approaching bitlength learning this way matters because in llama.cpp, "fewer bits" doesn't automatically mean "more speed." Different quantization formats can trigger different kernels and overheads, and on some GPUs, going lower-bit can even get slower, despite using less memory.
Bottom line: treat memory as a budget to meet, then optimize what matters most: TPS and quality.
TL;DR
Yes, this 30B Qwen3 runs on a Raspberry Pi. On a Pi 5 (16GB),
Q3_K_S-2.70bpw [KQ-2]
hits 8.03 TPS at 2.70 BPW and maintains 94.18% of BF16 quality. It genuinely feels
real-time. More broadly, the same pattern shows up everywhere else: ByteShape models
give you a better TPS/quality tradeoff than the alternatives (here we look at Unsloth
and MagicQuant).
CPUs
On CPUs, the reducing footprint via shorter bitlengths affects the TPS and accuracy tradeoff as one would expect: once the model fits, reducing footprint tends to increase TPS in a fairly monotonic way. If datatypes are selected correctly, you can trade a bit of quality for speed predictably, which makes it much easier to pick a point on the curve that matches your constraints.
We'll start with the most memory-constrained CPU case (Raspberry Pi 5 16GB), where "fits in RAM" is the limiting factor, then move to an Intel i7 with 64GB, where everything fits.
Raspberry Pi 5
The figure below shows TPS vs. normalized accuracy for the models that fit in RAM on the Raspberry Pi 5 16GB.
Notably, sustaining 8.5 TPS at 92%+ baseline accuracy with a 30B model on a Raspberry Pi reshapes expectations for Pi-class systems. Overall, the trend shows that ShapeLearn consistently produces better models, with ByteShape trending up and to the right of Unsloth, achieving higher tokens per second at the same quality, or higher quality at the same throughput.
We highlight choices for two primary objectives: accuracy or response time.
-
Optimizing for response time while maintaining accuracy: For interactive,
on-device use, perceived responsiveness is driven by how quickly text appears, not peak
throughput. In practice, generation feels real-time once it reaches roughly 8 TPS,
comfortably above typical reading speed. In this Raspberry Pi real-time regime,
Q3_K_S-2.70bpw [KQ-2](2.70 BPW, 8.03 TPS, 94.18% accuracy) is our go-to recommendation: it crosses the real-time threshold while maintaining high accuracy. Compared to Unsloth models at similar quality, ByteShape achieves real-time performance at lower BPW and higher TPS, making it the more efficient choice for interactive edge deployment. -
Accuracy above all: The table below lists the models that achieve the highest
accuracy while still being able to run on a Raspberry Pi. Within this set, ByteShape models
make the best use of the available resources to maximize accuracy, occupying the
lowest-error rows (~1.1–1.3% relative error, ~98.8% accuracy), while the
strongest Unsloth entries remain around 2.1–2.2% error (~97.9% accuracy). Compared to Unsloth's
UD-Q3_K_XL [8], ByteShape achieves up to a 1.87× lower error rate while still operating at ~5–6 TPS, comfortably within TPS-norms on Raspberry PI making it the better choice when accuracy is the priority.
Even when prioritizing maximum speed with some reduction in accuracy,Q3_K_S-3.25bpw [KQ-5]offers a better tradeoff: more accurate, smaller, and faster than the fastest Unsloth model.
| Model | Relative Error | BPW | TPS |
|---|---|---|---|
Q4_K_S-3.92bpw [KQ-7]
|
1.14% | 3.92 | 5.30 |
Q4_K_S-3.61bpw [KQ-6]
|
1.25% | 3.61 | 5.94 |
Q3_K_S-3.25bpw [KQ-5]
|
2.03% | 3.25 | 6.68 |
UD-IQ3_XXS [6] |
2.22% | 3.38 | 5.03 |
UD-Q3_K_XL [8] |
2.13% | 3.62 | 6.28 |
Many other Unsloth and MagicQuant models (some of ours too!) are not in this chart. We compare them in other sections, but they're not applicable in the Raspberry Pi case. They simply don't fit!
Intel i7
Next, we move to the Intel i7 with 64GB RAM. The figure below shows TPS vs normalized accuracy for all models.
Overall, ByteShape models outperform both Unsloth and MagicQuant, delivering higher quality at comparable throughput using fewer bits per parameter. Only ByteShape offers models that run in the 26+ TPS range, extending performance well beyond the other methods.
Highlights:
-
Quality-first: At the high-accuracy end of the table,
IQ4_XS-4.67bpw [KQ-9]achieves the lowest relative error (0.25%), outperforming the best-running Unsloth models (Q6_K [20]andQ5_K_M [18]whose relative errors are 0.36% and 0.44%). Compared directly, ByteShape delivers up to a 1.44× lower error rate with higher throughput thanQ6_K [20], and a 1.76× lower error rate at essentially the same speed asQ5_K_M [18]. MagicQuantmxfp4 [3]trails in this regime, with both higher error and lower TPS. -
Balanced point: In the mid-accuracy, high-throughput region,
Q3_K_S-3.25bpw [KQ-5]combines ~98% accuracy with 23.1 TPS at just 3.25 BPW, offering the best overall balance in the table. Matching or exceeding this accuracy with Unsloth (IQ4_XS [10]) requires higher BPW and lower TPS, while choosing an Unsloth model closer in speed (Q3_K_S [7]) incurs a 1.73× higher error rate. MagicQuant does not offer a competitive model in this range; its fastest entry (IQ4_NL [2]) is behind both ByteShape and Unsloth in accuracy and throughput.
Takeaway: Across both quality-first and balanced settings, ByteShape consistently converts the available bit budget into either higher accuracy or higher TPS, and is the only approach that simultaneously covers the high-quality and 26+ TPS balanced-performance regions in this comparison.
GPUs: RTX5090/32GB and RTX4080/16GB
On GPUs, performance depends as much on kernel choice as on raw memory footprint. For matmul/matvec, llama.cpp's quantization-specific GPU decode paths incur very different overheads, so fewer bits per weight do not reliably translate to higher TPS. Instead, TPS often peaks at quantization-specific sweet spots. Pushing BPW lower can even increase VRAM traffic and instruction count, hurting performance rather than improving it. We dig into this behavior in more detail right after the GPU results section, where the kernel-level tradeoffs become more apparent.
We evaluate on two GPUs: an RTX 5090 (32 GB), which can run models above 4 BPW and typically reach the fastest sweet spots, and an RTX 4080 (16 GB), where >4 BPW models do not fit, forcing different trade-offs and making the device-optimized curve easier to see.
RTX 5090 (32GB of VRAM)
Let's start with the 5090, which has enough VRAM to support all of the quantized models. The figure below shows TPS vs normalized accuracy.
Two things stand out immediately:
First, this GPU shows a clear ~4-bit sweet spot: several ~4b models
cluster at very high TPS with nearly identical quality. Examples include
Unsloth Q4_0 [12], Unsloth IQ4_XS [10],
IQ4_XS-3.87bpw [IQ-6]
, and MagicQuant iq4_nl-EHQKOUD-IQ4NL [1], all running around ~302–303 TPS
at ~98.4–98.9% accuracy. Within
this tight cluster, Unsloth edges out slightly in throughput and quality.
Second, outside of that sweet spot, the tradeoff becomes much more uneven:
- Many other Unsloth and Magic Quant models show significantly lower TPS, regardless of whether they are quantized more or less aggressively.
- Past the ~4b region, only ByteShape continues to increase TPS with a more predictable reduction in quality.
Accuracy-critical workloads: when output quality is paramount,
ByteShape delivers the most accurate model on the 5090:
IQ4_XS-4.67bpw [IQ-8]
(4.67 BPW, 272.98 TPS, 99.75% accuracy). It surpasses Unsloth Q6_K [20] (6.57 BPW, 264.88 TPS, 99.64% accuracy)
while using fewer bits and achieving slightly higher throughput, and it clearly outperforms MagicQuant
mxfp4_moe-H-B16-EUR-IQ4NL-KO-Q5K-QD-Q6K [3] (5.46 BPW, 240.42 TPS, 99.32% accuracy) in both
accuracy and speed, making it the strongest choice when accuracy is a task-critical deployment requirement.
Practical takeaway. If your GPU has enough VRAM to run a strong ~4b model that
already meets your speed and accuracy requirements, that cluster is an excellent default. The curve becomes
more interesting when task-critical deployment constraints demand higher accuracy or smaller models as for
example, under tighter memory budgets or constrained environments (as we'll see on the 4080).
RTX 4080 (16GB of VRAM)
Next, let's move to a more accessible GPU, especially in these memory-challenged times. The biggest stumbling block for the 4080 is its 16GB of VRAM, which is not sufficient to support the "magical" ~4b quantizations for a 30B model. How convenient! This "avoids" the 5090's ~4b sweet spot and forces a more "real-world" comparison under a hard VRAM budget. The figure below shows TPS versus normalized accuracy for all models that fit on the 4080.
On the RTX 4080, ByteShape consistently outperforms Unsloth under the same 16 GB VRAM constraint, delivering a better TPS–quality tradeoff.
In particular, ByteShape's highest-quality model that fits,
IQ4_XS-3.87bpw [IQ-6]
(3.87 BPW, 214.81 TPS, 98.66% accuracy) delivers:
-
a 1.59× lower error rate and 9.4% higher TPS vs.
Unsloth Q3_K_XL [8](3.62 BPW, 196.42 TPS, 97.87% accuracy). -
a 2.54× lower error rate at the same TPS vs.
Unsloth IQ2_M [2](2.84 BPW, 214.79 TPS, 96.59% accuracy).
As we move to higher throughput, ByteShape's maintains accuracy, while Unsloth's error rate experiences a cliff.
The Elephant in the Room: When 3-bits is not just 3-bits
There is an inconvenient truth hiding in these results. On several setups, around 4 bpw is already flying, and pushing quantization harder does not make things faster. It just manages to be smaller and slower at the same time.
Reducing the size of data doesn't automatically speed things up. While using fewer bits to store each number seems like it should reduce memory traffic and speed up computation, GPUs don't work that way. NVIDIA GPUs process work in fixed groups of 32 threads called "warps," which move through instructions together in near lock-step. The GPU hardware is optimized for specific data formats, memory access patterns, and operations that the chip's circuits are physically designed to handle efficiently. When your workload matches these "golden paths", you get peak performance. Step outside them, and you hit slowdowns. This isn't a design flaw, it's a deliberate tradeoff. Supporting more flexibility would require additional circuitry: more wires, more transistors, more complexity. That extra hardware consumes more power and adds latency to every operation, whether a program needs that flexibility or not.
Here a few examples of relevant hardware "quirks": VRAM is read in aligned 32-byte blocks, so reading one or 32 bytes consumes the same memory bandwidth. Both on-chip and off-chip memories can also suffer contention depending on how data is laid out, meaning that a warp's accesses may complete in a single step or, in the worst case, be serialized into 32 steps. And of course, decoding quantized values before computation can require extra instructions, with the cost depending on the quantization scheme.
This explains the behaviour we observe: 4-bit kernels use VRAM bandwidth more efficiently than 3- or 2-bit kernels and require fewer decode steps before computation. At the same time, 4-bit kernels exploit subword parallelism just as effectively as lower-bit kernels, and all rely primarily on dynamic caches rather than shared memory to take advantage of data reuse when possible.
So why llama.cpp hasn't been optimized to deliver peak speed for every bit-length? Our understanding is that llama.cpp prioritizes portable, space-efficient quantization that can run across a wide range of hardware. That design goal limits how aggressively backends can reshape data layouts or reorder computation in ways that might help one GPU or one bit-width.
A key example is its choice to store quantized weights in fixed blocks of 256 values. Each block is self-contained (it carries everything needed to decode it) and sits at a simple, predictable offset in the tensor, which makes the format easy to implement and fast to locate.
The tradeoff is that GPUs often need to decode many blocks in parallel to keep their wide compute units busy. With many independent 256-value blocks, those parallel decodes can translate into more scattered or fragmented VRAM reads and extra decode overhead, reducing bandwidth efficiency, especially for some lower-bit formats.
Point for example on RTX 5090: a matrix multiply [256, 768] × [768, 2048] takes
~54µs with iq4_xs datatype, but ~62µs with
iq3_xxs (mul_mat_q()+mul_mat_q_stream_k_fixup()). In other words,
cutting nearly 1.2 bits per weight (a reduction of more than 25% in weight footprint) leads
to a ~13% slowdown, directly hurting user experience.
An excellent reminder that bitlength learning matters: Heuristics can get us part of the way, but not all the way. ShapeLearn makes deliberate, per-tensor datatype choices that improve speed without sacrificing accuracy.
Methodology (brief recap)
If you're wondering how we are scoring these points, the full methodology is discussed in our previous blog post. This post is intentionally focused on the curves and device tradeoffs, so here is the quick version.
For each quantized variant, we measure throughput (TPS) on the target device and compute a single normalized quality score relative to the BF16 baseline, using the same evaluation harness and prompts as the methodology post. The quality score aggregates standard benchmarks (MMLU, GSM8K, IFEval, LiveCodeBench V4) into one number so you can compare points directly. In other words, every dot in the plots answers two questions: how fast does it run on this device, and how much quality does it retain compared to BF16, with memory fit as the first constraint.
We also want to thank all for the many, excellent suggestions on our recent Reddit post for improving and extending this evaluation strategy, and we’re actively working through them. Right now, evaluation is the main bottleneck and not bitlength learning/quantization. Careful evaluation is essential to clearly communicate the strengths of each model.
Wrapping up
First, thank you for your tenacity. You made it through all of this without giving up. We are sincerely flattered!
The takeaway is simple: treat memory as a constraint, not a goal. Once a model fits on your device, what matters is the tradeoff curve, TPS versus quality. Across CPUs and GPUs, ByteShape consistently lands on the better side of that curve, delivering either more speed at the same quality or higher quality at the same speed.
If you're deploying on a Raspberry Pi 5 (16 GB) and want a genuinely
interactive experience, start with
Q3_K_S-2.70bpw [KQ-2]
. On larger CPUs or GPUs, you can move up the curve toward higher-quality points with
little loss in throughput, the same rule applies:
fit first, then optimize the tradeoff.
We'll keep releasing more device-targeted variants (and more plots). If your system can't run a 30B model smoothly, don't blame the model or the silicon. Blame the datatypes.