Blackwell Picks Favorites:
Qwen 3.5 35B A3B
Here's our next ByteShape release: Qwen 3.5 35B A3B. This time it is a Mixture-of-Experts model, not a dense one like the 9B, and the hardware story flips almost entirely.
With the 9B, GPUs were fairly agreeable while CPUs had strong, diverging preferences. For this 35B model it is the reverse: CPUs are surprisingly consistent, while GPUs are much pickier about which quantized models run best on each card.
That is why we are presenting models a bit differently this time. In the GPU charts, we highlight the models that are the best match for each specific card and gray out the ones that are not really the right choice. There is no single GPU pick that works best everywhere, but once you know your hardware, the right options become fairly clear.
We also have a step-by-step tutorial on how to run our models locally with OpenCode.
And… as always, be careful with any tool you choose to grant access to.
TL;DR
On CPUs, the picture is clean. ByteShape models trace a very consistent speed versus quality frontier across the i7, Ryzen 9 5900X, Ultra 7, and even the Raspberry Pi 5, so the recommendations barely change from one system to another.
On GPUs, device-specific optimization matters a lot more. The 40-series cards clearly like one set of models, while the Blackwell cards prefer others.
For the impatient:
- RTX 4080:
GPU-4is the obvious balanced pick. - RTX 4090:
GPU-5would be our recommendation for "best" speed vs. accuracy, whileGPU-6practically matches baseline accuracy. - RTX 5090 / RTX Pro 6000 Blackwell:
GPU-5remains our recommendation. - CPUs:
CPU-5is our recommendation for all CPUs, withCPU-6if you want to stay as close to baseline as possible.
GPUs
Unlike the usual pattern, where the same ByteShape models tend to perform best across all GPUs, Blackwell shows a much stronger preference for specific datatypes. The charts make this clear, so we highlight the models that best match what Blackwell favors, while still releasing models that remain strong choices on older GPUs as well.
RTX 4090 (24 GB)
Show Legend
| # | Model | Acc | TPS | BPW |
|---|---|---|---|---|
| ByteShape | ||||
| GPU-1 | 2.17bpw | 0.8826 | 183.02 | 2.17 |
| GPU-2 | 2.73bpw | 0.9387 | 180.49 | 2.73 |
| GPU-3 | 2.89bpw | 0.9641 | 176.35 | 2.89 |
| GPU-4 | 3.40bpw | 0.9849 | 168.98 | 3.40 |
| GPU-5 | 4.06bpw | 0.9969 | 164.77 | 4.06 |
| GPU-6 | 4.12bpw | 0.9981 | 156.44 | 4.12 |
| Unsloth | ||||
| 1 | IQ2_XXS | 0.9191 | 167.54 | 2.46 |
| 2 | IQ2_M | 0.9468 | 167.34 | 2.63 |
| 3 | Q2_K_XL | 0.9501 | 166.66 | 2.80 |
| 4 | IQ3_XXS | 0.9671 | 157.59 | 3.02 |
| 5 | IQ3_S | 0.9654 | 157.65 | 3.13 |
| 6 | Q3_K_S | 0.9863 | 148.82 | 3.52 |
| 7 | Q3_K_M | 0.9817 | 150.90 | 3.77 |
| 8 | Q3_K_XL | 0.9843 | 150.95 | 3.83 |
| 9 | IQ4_XS | 0.9814 | 150.24 | 4.03 |
| 10 | IQ4_NL | 0.9870 | 151.58 | 4.11 |
| 11 | Q4_K_L | 0.9956 | 159.96 | 4.66 |
| 12 | Q4_K_S | 0.9861 | 149.69 | 4.77 |
| 13 | MXFP4_MOE | 0.9823 | 147.13 | 4.98 |
| 14 | Q4_K_M | 0.9884 | 149.31 | 5.08 |
| 15 | Q4_K_XL | 0.9875 | 146.56 | 5.13 |
The 4090 is actually one of the clearest charts here.
GPU-6 is the conservative, practically same-as-baseline accuracy pick. GPU-5, though, holds essentially baseline quality while adding a meaningful speed boost.
Below those, GPU-4 remains a solid balance pick, and GPU-3 is the faster, more aggressive pick if you are willing to give up more quality and free space for more context.
So on the 4090:
GPU-6if you want the safest near-baseline choice,GPU-5if you want the more exciting high-end pick,GPU-4if you want a more classic speed/quality balance.
RTX 4080 (16 GB)
Show Legend
| # | Model | Acc | TPS | BPW |
|---|---|---|---|---|
| ByteShape | ||||
| GPU-1 | 2.17bpw | 0.8826 | 160.49 | 2.17 |
| GPU-2 | 2.73bpw | 0.9387 | 157.17 | 2.73 |
| GPU-3 | 2.89bpw | 0.9641 | 154.14 | 2.89 |
| GPU-4 | 3.40bpw | 0.9849 | 145.95 | 3.40 |
| Unsloth | ||||
| 1 | IQ2_XXS | 0.9191 | 145.23 | 2.46 |
| 2 | IQ2_M | 0.9468 | 144.14 | 2.63 |
| 3 | Q2_K_XL | 0.9501 | 143.66 | 2.80 |
| 4 | IQ3_XXS | 0.9671 | 134.83 | 3.02 |
| 5 | IQ3_S | 0.9654 | 134.84 | 3.13 |
| 6 | Q3_K_S | 0.9863 | 125.95 | 3.52 |
Similar to 4090 but with less VRAM therefore fewer options.
GPU-4 is the clear recommendation: it sits around the mid-140s TPS while still landing at roughly 98.5% of baseline quality. That is a very clean trade-off with a maximum of 16K context length.
If you want more speed or bigger context, GPU-3 is the faster option, pushing into the mid-150s TPS range while staying around 96.4% quality. After that, GPU-2 and GPU-1 get increasingly aggressive, and by that point the quality trade-off is much harder to justify unless your priority is simply extracting maximum throughput/minimizing model size to allow for bigger context windows.
So for the 4080, the recommendation is straightforward:
RTX 5090 (32 GB)
Show Legend
| # | Model | Acc | TPS | BPW |
|---|---|---|---|---|
| ByteShape | ||||
| GPU-1 | 2.17bpw | 0.8826 | 202.97 | 2.17 |
| GPU-2 | 2.73bpw | 0.9387 | 194.92 | 2.73 |
| GPU-3 | 2.89bpw | 0.9641 | 197.28 | 2.89 |
| GPU-4 | 3.40bpw | 0.9849 | 187.82 | 3.40 |
| GPU-5 | 4.06bpw | 0.9969 | 196.91 | 4.06 |
| GPU-6 | 4.12bpw | 0.9981 | 190.53 | 4.12 |
| Unsloth | ||||
| 1 | IQ2_XXS | 0.9191 | 193.02 | 2.46 |
| 2 | IQ2_M | 0.9468 | 192.95 | 2.63 |
| 3 | Q2_K_XL | 0.9501 | 192.78 | 2.80 |
| 4 | IQ3_XXS | 0.9671 | 183.84 | 3.02 |
| 5 | IQ3_S | 0.9654 | 181.40 | 3.13 |
| 6 | Q3_K_S | 0.9863 | 178.10 | 3.52 |
| 7 | Q3_K_M | 0.9817 | 181.96 | 3.77 |
| 8 | Q3_K_XL | 0.9843 | 181.82 | 3.83 |
| 9 | IQ4_XS | 0.9814 | 182.02 | 4.03 |
| 10 | IQ4_NL | 0.9870 | 180.71 | 4.11 |
| 11 | Q4_K_L | 0.9956 | 190.59 | 4.66 |
| 12 | Q4_K_S | 0.9861 | 180.82 | 4.77 |
| 13 | MXFP4_MOE | 0.9823 | 180.43 | 4.98 |
| 14 | Q4_K_M | 0.9884 | 182.30 | 5.08 |
| 15 | Q4_K_XL | 0.9875 | 179.57 | 5.13 |
| 16 | Q5_K_S | 0.9862 | 178.65 | 5.73 |
| 17 | Q5_K_M | 0.9875 | 177.91 | 6.06 |
| 18 | Q5_K_XL | 0.9877 | 177.69 | 6.09 |
| 19 | Q6_K_S | 0.9922 | 181.63 | 6.58 |
| 20 | Q6_K | 0.9965 | 179.47 | 6.66 |
| 21 | Q6_K_XL | 0.9967 | 173.75 | 7.40 |
Reminder: Greyed-out models are shown for completeness. We do not recommend them for this GPU but include them for comparison to the older GPUs where those grayed-out models perform best.
This is where Blackwell's preferences really start to stand out. In particular, on Blackwell with llama.cpp kernels, IQ datatypes perform better, while KQ datatypes are less competitive. As a result, GPU-4 and GPU-2 do not fall on the ideal quality-performance curve because most of their layers are dominated by KQ datatypes.
On Blackwell, GPU-5 looks excellent. It gets very close to baseline quality while pushing close to the 200 TPS range.
GPU-6 is still there as the high-quality anchor, and it is also fast. But if you are looking at the 5090 chart and asking "what is the new thing here?", the answer is clearly GPU-5.
So for the 5090:
RTX Pro 6000 Blackwell (96 GB)
Show Legend
| # | Model | Acc | TPS | BPW |
|---|---|---|---|---|
| ByteShape | ||||
| GPU-1 | 2.17bpw | 0.8826 | 202.77 | 2.17 |
| GPU-2 | 2.73bpw | 0.9387 | 193.76 | 2.73 |
| GPU-3 | 2.89bpw | 0.9641 | 196.72 | 2.89 |
| GPU-4 | 3.40bpw | 0.9849 | 187.31 | 3.40 |
| GPU-5 | 4.06bpw | 0.9969 | 194.42 | 4.06 |
| GPU-6 | 4.12bpw | 0.9981 | 186.67 | 4.12 |
| Unsloth | ||||
| 1 | IQ2_XXS | 0.9191 | 191.71 | 2.46 |
| 2 | IQ2_M | 0.9468 | 192.25 | 2.63 |
| 3 | Q2_K_XL | 0.9501 | 191.27 | 2.80 |
| 4 | IQ3_XXS | 0.9671 | 183.36 | 3.02 |
| 5 | IQ3_S | 0.9654 | 181.24 | 3.13 |
| 6 | Q3_K_S | 0.9863 | 177.25 | 3.52 |
| 7 | Q3_K_M | 0.9817 | 181.00 | 3.77 |
| 8 | Q3_K_XL | 0.9843 | 179.91 | 3.83 |
| 9 | IQ4_XS | 0.9814 | 180.14 | 4.03 |
| 10 | IQ4_NL | 0.9870 | 180.16 | 4.11 |
| 11 | Q4_K_L | 0.9956 | 188.66 | 4.66 |
| 12 | Q4_K_S | 0.9861 | 178.55 | 4.77 |
| 13 | MXFP4_MOE | 0.9823 | 177.43 | 4.98 |
| 14 | Q4_K_M | 0.9884 | 178.87 | 5.08 |
| 15 | Q4_K_XL | 0.9875 | 176.18 | 5.13 |
| 16 | Q5_K_S | 0.9862 | 176.10 | 5.73 |
| 17 | Q5_K_M | 0.9875 | 175.33 | 6.06 |
| 18 | Q5_K_XL | 0.9877 | 174.72 | 6.09 |
| 19 | Q6_K_S | 0.9922 | 179.47 | 6.58 |
| 20 | Q6_K | 0.9965 | 177.41 | 6.66 |
| 21 | Q6_K_XL | 0.9967 | 171.02 | 7.40 |
| 22 | Q8_0 | 0.9966 | 170.09 | 8.52 |
Reminder: Greyed-out models are shown for completeness. We do not recommend them for this GPU but include them for comparison to the older GPUs where those grayed-out models perform best.
Very similar story to the 5090, which is exactly what you would expect.
Again, GPU-5 looks unusually strong here. It stays essentially on top of the quality chart while delivering a large speed advantage, and once again it looks especially well aligned with Blackwell.
GPU-6 is the safe pick if your top priority is hugging the baseline as closely as possible. GPU-4 is still fine. But the interesting result is the same as on the 5090: GPU-5 is the one that makes you stop and look twice.
So, déjà vu, for the RTX Pro 6000 Blackwell (workstation):
CPUs
Picking the right quant for CPU proves more straightforward.
Intel Core i7 12700KF
Show Legend
| # | Model | Acc | TPS | BPW |
|---|---|---|---|---|
| ByteShape | ||||
| CPU-1 | 2.69bpw | 0.9299 | 11.83 | 2.69 |
| CPU-2 | 2.89bpw | 0.9641 | 10.67 | 2.89 |
| CPU-3 | 3.40bpw | 0.9849 | 10.18 | 3.40 |
| CPU-4 | 3.51bpw | 0.9858 | 9.63 | 3.51 |
| CPU-5 | 4.06bpw | 0.9969 | 9.02 | 4.06 |
| CPU-6 | 4.12bpw | 0.9981 | 8.23 | 4.12 |
| Unsloth | ||||
| 1 | IQ2_XXS | 0.9191 | 10.08 | 2.46 |
| 2 | IQ2_M | 0.9468 | 9.92 | 2.63 |
| 3 | Q2_K_XL | 0.9501 | 9.88 | 2.80 |
| 4 | IQ3_XXS | 0.9671 | 9.04 | 3.02 |
| 5 | IQ3_S | 0.9654 | 8.91 | 3.13 |
| 6 | Q3_K_S | 0.9863 | 8.04 | 3.52 |
| 7 | Q3_K_M | 0.9817 | 8.05 | 3.77 |
| 8 | Q3_K_XL | 0.9843 | 8.05 | 3.83 |
| 9 | IQ4_XS | 0.9814 | 7.95 | 4.03 |
| 10 | IQ4_NL | 0.9870 | 7.91 | 4.11 |
| 11 | Q4_K_L | 0.9956 | 8.39 | 4.66 |
| 12 | Q4_K_S | 0.9861 | 7.84 | 4.77 |
| 13 | MXFP4_MOE | 0.9823 | 7.52 | 4.98 |
| 14 | Q4_K_M | 0.9884 | 7.76 | 5.08 |
| 15 | Q4_K_XL | 0.9875 | 7.49 | 5.13 |
| 16 | Q5_K_S | 0.9862 | 7.39 | 5.73 |
| 17 | Q5_K_M | 0.9875 | 7.26 | 6.06 |
| 18 | Q5_K_XL | 0.9877 | 7.22 | 6.09 |
| 19 | Q6_K_S | 0.9922 | 7.92 | 6.58 |
| 20 | Q6_K | 0.9965 | 7.53 | 6.66 |
| 21 | Q6_K_XL | 0.9967 | 6.57 | 7.40 |
| 22 | Q8_0 | 0.9966 | 6.74 | 8.52 |
CPU-6 is the near-baseline option. CPU-5 is probably the default pick for most people: still very close to baseline quality, but with a bit more speed. CPU-4 is the more aggressive/balanced option, and after that we get into more noticeable quality vs. speed trade-offs.
So, for the i7:
CPU-5is the default recommendation,CPU-6if you want the highest quality,CPU-4if you want more speed without getting too adventurous.
Ryzen 9 5900X
Show Legend
| # | Model | Acc | TPS | BPW |
|---|---|---|---|---|
| ByteShape | ||||
| CPU-1 | 2.69bpw | 0.9299 | 10.87 | 2.69 |
| CPU-2 | 2.89bpw | 0.9641 | 10.19 | 2.89 |
| CPU-3 | 3.40bpw | 0.9849 | 9.94 | 3.40 |
| CPU-4 | 3.51bpw | 0.9858 | 9.51 | 3.51 |
| CPU-5 | 4.06bpw | 0.9969 | 9.07 | 4.06 |
| CPU-6 | 4.12bpw | 0.9981 | 8.57 | 4.12 |
| Unsloth | ||||
| 1 | IQ2_XXS | 0.9191 | 9.74 | 2.46 |
| 2 | IQ2_M | 0.9468 | 9.66 | 2.63 |
| 3 | Q2_K_XL | 0.9501 | 9.60 | 2.80 |
| 4 | IQ3_XXS | 0.9671 | 9.06 | 3.02 |
| 5 | IQ3_S | 0.9654 | 8.99 | 3.13 |
| 6 | Q3_K_S | 0.9863 | 8.30 | 3.52 |
| 7 | Q3_K_M | 0.9817 | 8.36 | 3.77 |
| 8 | Q3_K_XL | 0.9843 | 8.42 | 3.83 |
| 9 | IQ4_XS | 0.9814 | 8.27 | 4.03 |
| 10 | IQ4_NL | 0.9870 | 8.25 | 4.11 |
| 11 | Q4_K_L | 0.9956 | 8.68 | 4.66 |
| 12 | Q4_K_S | 0.9861 | 8.36 | 4.77 |
| 13 | MXFP4_MOE | 0.9823 | 8.00 | 4.98 |
| 14 | Q4_K_M | 0.9884 | 8.18 | 5.08 |
| 15 | Q4_K_XL | 0.9875 | 7.97 | 5.13 |
| 16 | Q5_K_S | 0.9862 | 7.83 | 5.73 |
| 17 | Q5_K_M | 0.9875 | 7.79 | 6.06 |
| 18 | Q5_K_XL | 0.9877 | 7.80 | 6.09 |
| 19 | Q6_K_S | 0.9922 | 8.33 | 6.58 |
| 20 | Q6_K | 0.9965 | 8.01 | 6.66 |
| 21 | Q6_K_XL | 0.9967 | 7.27 | 7.40 |
| 22 | Q8_0 | 0.9966 | 7.42 | 8.52 |
Almost the same story as the i7, which is refreshing.
The frontier is clean, the recommendations are clean, and the overall trade-off pattern looks very familiar. Again, CPU-5 remains a great default choice, CPU-6 is the quality-first choice, and CPU-4 is the more speed-oriented alternative.
Ultra 7 265KF
Show Legend
| # | Model | Acc | TPS | BPW |
|---|---|---|---|---|
| ByteShape | ||||
| CPU-1 | 2.69bpw | 0.9299 | 14.00 | 2.69 |
| CPU-2 | 2.89bpw | 0.9641 | 13.63 | 2.89 |
| CPU-3 | 3.40bpw | 0.9849 | 12.90 | 3.40 |
| CPU-4 | 3.51bpw | 0.9858 | 12.68 | 3.51 |
| CPU-5 | 4.06bpw | 0.9969 | 12.27 | 4.06 |
| CPU-6 | 4.12bpw | 0.9981 | 11.79 | 4.12 |
| Unsloth | ||||
| 1 | IQ2_XXS | 0.9191 | 12.71 | 2.46 |
| 2 | IQ2_M | 0.9468 | 12.73 | 2.63 |
| 3 | Q2_K_XL | 0.9501 | 12.12 | 2.80 |
| 4 | IQ3_XXS | 0.9671 | 12.00 | 3.02 |
| 5 | IQ3_S | 0.9654 | 11.42 | 3.13 |
| 6 | Q3_K_S | 0.9863 | 10.60 | 3.52 |
| 7 | Q3_K_M | 0.9817 | 10.29 | 3.77 |
| 8 | Q3_K_XL | 0.9843 | 11.00 | 3.83 |
| 9 | IQ4_XS | 0.9814 | 10.52 | 4.03 |
| 10 | IQ4_NL | 0.9870 | 10.40 | 4.11 |
| 11 | Q4_K_L | 0.9956 | 11.41 | 4.66 |
| 12 | Q4_K_S | 0.9861 | 10.58 | 4.77 |
| 13 | MXFP4_MOE | 0.9823 | 10.37 | 4.98 |
| 14 | Q4_K_M | 0.9884 | 10.52 | 5.08 |
| 15 | Q4_K_XL | 0.9875 | 10.37 | 5.13 |
| 16 | Q5_K_S | 0.9862 | 10.39 | 5.73 |
| 17 | Q5_K_M | 0.9875 | 10.26 | 6.06 |
| 18 | Q5_K_XL | 0.9877 | 10.55 | 6.09 |
| 19 | Q6_K_S | 0.9922 | 11.16 | 6.58 |
| 20 | Q6_K | 0.9965 | 10.87 | 6.66 |
| 21 | Q6_K_XL | 0.9967 | 9.18 | 7.40 |
| 22 | Q8_0 | 0.9966 | 9.74 | 8.52 |
The Ultra 7 is just good news all around.
Here again, the same ByteShape models sit on the frontier, but the absolute speeds are better. Sounds familiar? CPU-5 is a great default, CPU-6 is near-baseline pick, and CPU-4 or CPU-3 are there if you want to push harder on speed.
Raspberry Pi 5 (16 GB)
Show Legend
| # | Model | Acc | TPS | BPW |
|---|---|---|---|---|
| ByteShape | ||||
| CPU-1 | 2.69bpw | 0.9299 | 3.29 | 2.69 |
| CPU-2 | 2.89bpw | 0.9641 | 3.04 | 2.89 |
| CPU-3 | 3.40bpw | 0.9849 | 2.96 | 3.40 |
| CPU-4 | 3.51bpw | 0.9858 | 2.84 | 3.51 |
| Unsloth | ||||
| 1 | IQ2_XXS | 0.9191 | 2.80 | 2.46 |
| 2 | IQ2_M | 0.9468 | 2.74 | 2.63 |
| 3 | Q2_K_XL | 0.9501 | 2.73 | 2.80 |
| 4 | IQ3_XXS | 0.9671 | 2.58 | 3.02 |
| 5 | IQ3_S | 0.9654 | 2.53 | 3.13 |
| 6 | Q3_K_S | 0.9863 | 2.37 | 3.52 |
It's alive, barely, but still kicking. For Pi deployments, maybe take a look at Qwen3-Coder which is nearly 3x faster.
CPU-4 and CPU-3 are worth a closer look, landing around the 3 TPS range while keeping quality in a range that is still genuinely usable.
Benchmarking Methodology
Generating our models takes little time. What takes disproportionately longer is evaluating all reported models across the following seven benchmarks:
- BFCL_V3 for tool calling
- GSM8K_V for vision + math
- LiveCodeBench V6 and HumanEval for coding
- GSM8K for math
- IFEVAL for instruction following
- MMLU for general knowledge
The reported score is the mean across these benchmarks, with each benchmark normalized to the original model's score.
We evaluated GSM8K_V in both instruct and thinking modes and
treated them as separate entries in the average. In practice, we observe that the
relative performance between modes remains consistent.
All evaluations were run with llama.cpp b8204.