Blackwell Picks Favorites:
Qwen 3.5 35B A3B

Published by ByteShape Team • 10 April 2026

Here's our next ByteShape release: Qwen 3.5 35B A3B. This time it is a Mixture-of-Experts model, not a dense one like the 9B, and the hardware story flips almost entirely.

With the 9B, GPUs were fairly agreeable while CPUs had strong, diverging preferences. For this 35B model it is the reverse: CPUs are surprisingly consistent, while GPUs are much pickier about which quantized models run best on each card.

That is why we are presenting models a bit differently this time. In the GPU charts, we highlight the models that are the best match for each specific card and gray out the ones that are not really the right choice. There is no single GPU pick that works best everywhere, but once you know your hardware, the right options become fairly clear.

We also have a step-by-step tutorial on how to run our models locally with OpenCode.

⚠️

And… as always, be careful with any tool you choose to grant access to.

TL;DR

On CPUs, the picture is clean. ByteShape models trace a very consistent speed versus quality frontier across the i7, Ryzen 9 5900X, Ultra 7, and even the Raspberry Pi 5, so the recommendations barely change from one system to another.

On GPUs, device-specific optimization matters a lot more. The 40-series cards clearly like one set of models, while the Blackwell cards prefer others.

For the impatient:

RTX 4080: GPU-4 is the obvious balanced pick.
RTX 4090: GPU-5 would be our recommendation for "best" speed vs. accuracy, while GPU-6 practically matches baseline accuracy.
RTX 5090 / RTX Pro 6000 Blackwell: GPU-5 remains our recommendation.
CPUs: CPU-5 is our recommendation for all CPUs, with CPU-6 if you want to stay as close to baseline as possible.

GPUs

Unlike the usual pattern, where the same ByteShape models tend to perform best across all GPUs, Blackwell shows a much stronger preference for specific datatypes. The charts make this clear, so we highlight the models that best match what Blackwell favors, while still releasing models that remain strong choices on older GPUs as well.

RTX 4090 (24 GB)

RTX 4090: tokens per second vs quality with model size as bubble size — RTX 4090: Tokens per second vs quality (bubble size = model footprint) Tap *Show Legend* below for model details.

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
GPU-1	2.17bpw	0.8826	183.02	2.17
GPU-2	2.73bpw	0.9387	180.49	2.73
GPU-3	2.89bpw	0.9641	176.35	2.89
GPU-4	3.40bpw	0.9849	168.98	3.40
GPU-5	4.06bpw	0.9969	164.77	4.06
GPU-6	4.12bpw	0.9981	156.44	4.12
Unsloth
1	IQ2_XXS	0.9191	167.54	2.46
2	IQ2_M	0.9468	167.34	2.63
3	Q2_K_XL	0.9501	166.66	2.80
4	IQ3_XXS	0.9671	157.59	3.02
5	IQ3_S	0.9654	157.65	3.13
6	Q3_K_S	0.9863	148.82	3.52
7	Q3_K_M	0.9817	150.90	3.77
8	Q3_K_XL	0.9843	150.95	3.83
9	IQ4_XS	0.9814	150.24	4.03
10	IQ4_NL	0.9870	151.58	4.11
11	Q4_K_L	0.9956	159.96	4.66
12	Q4_K_S	0.9861	149.69	4.77
13	MXFP4_MOE	0.9823	147.13	4.98
14	Q4_K_M	0.9884	149.31	5.08
15	Q4_K_XL	0.9875	146.56	5.13

RTX 4090: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

The 4090 is actually one of the clearest charts here.

GPU-6 is the conservative, practically same-as-baseline accuracy pick. GPU-5, though, holds essentially baseline quality while adding a meaningful speed boost.

Below those, GPU-4 remains a solid balance pick, and GPU-3 is the faster, more aggressive pick if you are willing to give up more quality and free space for more context.

So on the 4090:

GPU-6 if you want the safest near-baseline choice,
GPU-5 if you want the more exciting high-end pick,
GPU-4 if you want a more classic speed/quality balance.

RTX 4080 (16 GB)

RTX 4080: tokens per second vs quality with model size as bubble size — RTX 4080: Tokens per second vs quality (bubble size = model footprint) Tap *Show Legend* below for model details.

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
GPU-1	2.17bpw	0.8826	160.49	2.17
GPU-2	2.73bpw	0.9387	157.17	2.73
GPU-3	2.89bpw	0.9641	154.14	2.89
GPU-4	3.40bpw	0.9849	145.95	3.40
Unsloth
1	IQ2_XXS	0.9191	145.23	2.46
2	IQ2_M	0.9468	144.14	2.63
3	Q2_K_XL	0.9501	143.66	2.80
4	IQ3_XXS	0.9671	134.83	3.02
5	IQ3_S	0.9654	134.84	3.13
6	Q3_K_S	0.9863	125.95	3.52

RTX 4080: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

Similar to 4090 but with less VRAM therefore fewer options.

GPU-4 is the clear recommendation: it sits around the mid-140s TPS while still landing at roughly 98.5% of baseline quality. That is a very clean trade-off with a maximum of 16K context length.

If you want more speed or bigger context, GPU-3 is the faster option, pushing into the mid-150s TPS range while staying around 96.4% quality. After that, GPU-2 and GPU-1 get increasingly aggressive, and by that point the quality trade-off is much harder to justify unless your priority is simply extracting maximum throughput/minimizing model size to allow for bigger context windows.

So for the 4080, the recommendation is straightforward:

GPU-4 for the best overall balance,
GPU-3 if you want the faster, slightly more aggressive option.

RTX 5090 (32 GB)

RTX 5090: tokens per second vs quality with model size as bubble size — RTX 5090: Tokens per second vs quality (bubble size = model footprint) Tap *Show Legend* below for model details.

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
GPU-1	2.17bpw	0.8826	202.97	2.17
GPU-2	2.73bpw	0.9387	194.92	2.73
GPU-3	2.89bpw	0.9641	197.28	2.89
GPU-4	3.40bpw	0.9849	187.82	3.40
GPU-5	4.06bpw	0.9969	196.91	4.06
GPU-6	4.12bpw	0.9981	190.53	4.12
Unsloth
1	IQ2_XXS	0.9191	193.02	2.46
2	IQ2_M	0.9468	192.95	2.63
3	Q2_K_XL	0.9501	192.78	2.80
4	IQ3_XXS	0.9671	183.84	3.02
5	IQ3_S	0.9654	181.40	3.13
6	Q3_K_S	0.9863	178.10	3.52
7	Q3_K_M	0.9817	181.96	3.77
8	Q3_K_XL	0.9843	181.82	3.83
9	IQ4_XS	0.9814	182.02	4.03
10	IQ4_NL	0.9870	180.71	4.11
11	Q4_K_L	0.9956	190.59	4.66
12	Q4_K_S	0.9861	180.82	4.77
13	MXFP4_MOE	0.9823	180.43	4.98
14	Q4_K_M	0.9884	182.30	5.08
15	Q4_K_XL	0.9875	179.57	5.13
16	Q5_K_S	0.9862	178.65	5.73
17	Q5_K_M	0.9875	177.91	6.06
18	Q5_K_XL	0.9877	177.69	6.09
19	Q6_K_S	0.9922	181.63	6.58
20	Q6_K	0.9965	179.47	6.66
21	Q6_K_XL	0.9967	173.75	7.40

RTX 5090: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

ℹ️

Reminder: Greyed-out models are shown for completeness. We do not recommend them for this GPU but include them for comparison to the older GPUs where those grayed-out models perform best.

This is where Blackwell's preferences really start to stand out. In particular, on Blackwell with llama.cpp kernels, IQ datatypes perform better, while KQ datatypes are less competitive. As a result, GPU-4 and GPU-2 do not fall on the ideal quality-performance curve because most of their layers are dominated by KQ datatypes.

On Blackwell, GPU-5 looks excellent. It gets very close to baseline quality while pushing close to the 200 TPS range.

GPU-6 is still there as the high-quality anchor, and it is also fast. But if you are looking at the 5090 chart and asking "what is the new thing here?", the answer is clearly GPU-5.

So for the 5090:

GPU-5 is the standout,
GPU-6 is the conservative premium-quality choice.

RTX Pro 6000 Blackwell (96 GB)

RTX Pro 6000 Blackwell: tokens per second vs quality with model size as bubble size — RTX Pro 6000 Blackwell: Tokens per second vs quality (bubble size = model footprint) Tap *Show Legend* below for model details.

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
GPU-1	2.17bpw	0.8826	202.77	2.17
GPU-2	2.73bpw	0.9387	193.76	2.73
GPU-3	2.89bpw	0.9641	196.72	2.89
GPU-4	3.40bpw	0.9849	187.31	3.40
GPU-5	4.06bpw	0.9969	194.42	4.06
GPU-6	4.12bpw	0.9981	186.67	4.12
Unsloth
1	IQ2_XXS	0.9191	191.71	2.46
2	IQ2_M	0.9468	192.25	2.63
3	Q2_K_XL	0.9501	191.27	2.80
4	IQ3_XXS	0.9671	183.36	3.02
5	IQ3_S	0.9654	181.24	3.13
6	Q3_K_S	0.9863	177.25	3.52
7	Q3_K_M	0.9817	181.00	3.77
8	Q3_K_XL	0.9843	179.91	3.83
9	IQ4_XS	0.9814	180.14	4.03
10	IQ4_NL	0.9870	180.16	4.11
11	Q4_K_L	0.9956	188.66	4.66
12	Q4_K_S	0.9861	178.55	4.77
13	MXFP4_MOE	0.9823	177.43	4.98
14	Q4_K_M	0.9884	178.87	5.08
15	Q4_K_XL	0.9875	176.18	5.13
16	Q5_K_S	0.9862	176.10	5.73
17	Q5_K_M	0.9875	175.33	6.06
18	Q5_K_XL	0.9877	174.72	6.09
19	Q6_K_S	0.9922	179.47	6.58
20	Q6_K	0.9965	177.41	6.66
21	Q6_K_XL	0.9967	171.02	7.40
22	Q8_0	0.9966	170.09	8.52

RTX Pro 6000 Blackwell: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

ℹ️

Reminder: Greyed-out models are shown for completeness. We do not recommend them for this GPU but include them for comparison to the older GPUs where those grayed-out models perform best.

Very similar story to the 5090, which is exactly what you would expect.

Again, GPU-5 looks unusually strong here. It stays essentially on top of the quality chart while delivering a large speed advantage, and once again it looks especially well aligned with Blackwell.

GPU-6 is the safe pick if your top priority is hugging the baseline as closely as possible. GPU-4 is still fine. But the interesting result is the same as on the 5090: GPU-5 is the one that makes you stop and look twice.

So, déjà vu, for the RTX Pro 6000 Blackwell (workstation):

GPU-5 is excellent speed and accuracy,
GPU-6 is the conservative premium-quality choice.

CPUs

Picking the right quant for CPU proves more straightforward.

Intel Core i7 12700KF

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
CPU-1	2.69bpw	0.9299	11.83	2.69
CPU-2	2.89bpw	0.9641	10.67	2.89
CPU-3	3.40bpw	0.9849	10.18	3.40
CPU-4	3.51bpw	0.9858	9.63	3.51
CPU-5	4.06bpw	0.9969	9.02	4.06
CPU-6	4.12bpw	0.9981	8.23	4.12
Unsloth
1	IQ2_XXS	0.9191	10.08	2.46
2	IQ2_M	0.9468	9.92	2.63
3	Q2_K_XL	0.9501	9.88	2.80
4	IQ3_XXS	0.9671	9.04	3.02
5	IQ3_S	0.9654	8.91	3.13
6	Q3_K_S	0.9863	8.04	3.52
7	Q3_K_M	0.9817	8.05	3.77
8	Q3_K_XL	0.9843	8.05	3.83
9	IQ4_XS	0.9814	7.95	4.03
10	IQ4_NL	0.9870	7.91	4.11
11	Q4_K_L	0.9956	8.39	4.66
12	Q4_K_S	0.9861	7.84	4.77
13	MXFP4_MOE	0.9823	7.52	4.98
14	Q4_K_M	0.9884	7.76	5.08
15	Q4_K_XL	0.9875	7.49	5.13
16	Q5_K_S	0.9862	7.39	5.73
17	Q5_K_M	0.9875	7.26	6.06
18	Q5_K_XL	0.9877	7.22	6.09
19	Q6_K_S	0.9922	7.92	6.58
20	Q6_K	0.9965	7.53	6.66
21	Q6_K_XL	0.9967	6.57	7.40
22	Q8_0	0.9966	6.74	8.52

Intel Core i7 12700KF: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

CPU-6 is the near-baseline option. CPU-5 is probably the default pick for most people: still very close to baseline quality, but with a bit more speed. CPU-4 is the more aggressive/balanced option, and after that we get into more noticeable quality vs. speed trade-offs.

So, for the i7:

CPU-5 is the default recommendation,
CPU-6 if you want the highest quality,
CPU-4 if you want more speed without getting too adventurous.

Ryzen 9 5900X

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
CPU-1	2.69bpw	0.9299	10.87	2.69
CPU-2	2.89bpw	0.9641	10.19	2.89
CPU-3	3.40bpw	0.9849	9.94	3.40
CPU-4	3.51bpw	0.9858	9.51	3.51
CPU-5	4.06bpw	0.9969	9.07	4.06
CPU-6	4.12bpw	0.9981	8.57	4.12
Unsloth
1	IQ2_XXS	0.9191	9.74	2.46
2	IQ2_M	0.9468	9.66	2.63
3	Q2_K_XL	0.9501	9.60	2.80
4	IQ3_XXS	0.9671	9.06	3.02
5	IQ3_S	0.9654	8.99	3.13
6	Q3_K_S	0.9863	8.30	3.52
7	Q3_K_M	0.9817	8.36	3.77
8	Q3_K_XL	0.9843	8.42	3.83
9	IQ4_XS	0.9814	8.27	4.03
10	IQ4_NL	0.9870	8.25	4.11
11	Q4_K_L	0.9956	8.68	4.66
12	Q4_K_S	0.9861	8.36	4.77
13	MXFP4_MOE	0.9823	8.00	4.98
14	Q4_K_M	0.9884	8.18	5.08
15	Q4_K_XL	0.9875	7.97	5.13
16	Q5_K_S	0.9862	7.83	5.73
17	Q5_K_M	0.9875	7.79	6.06
18	Q5_K_XL	0.9877	7.80	6.09
19	Q6_K_S	0.9922	8.33	6.58
20	Q6_K	0.9965	8.01	6.66
21	Q6_K_XL	0.9967	7.27	7.40
22	Q8_0	0.9966	7.42	8.52

Ryzen 9 5900X: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

Almost the same story as the i7, which is refreshing.

The frontier is clean, the recommendations are clean, and the overall trade-off pattern looks very familiar. Again, CPU-5 remains a great default choice, CPU-6 is the quality-first choice, and CPU-4 is the more speed-oriented alternative.

Ultra 7 265KF

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
CPU-1	2.69bpw	0.9299	14.00	2.69
CPU-2	2.89bpw	0.9641	13.63	2.89
CPU-3	3.40bpw	0.9849	12.90	3.40
CPU-4	3.51bpw	0.9858	12.68	3.51
CPU-5	4.06bpw	0.9969	12.27	4.06
CPU-6	4.12bpw	0.9981	11.79	4.12
Unsloth
1	IQ2_XXS	0.9191	12.71	2.46
2	IQ2_M	0.9468	12.73	2.63
3	Q2_K_XL	0.9501	12.12	2.80
4	IQ3_XXS	0.9671	12.00	3.02
5	IQ3_S	0.9654	11.42	3.13
6	Q3_K_S	0.9863	10.60	3.52
7	Q3_K_M	0.9817	10.29	3.77
8	Q3_K_XL	0.9843	11.00	3.83
9	IQ4_XS	0.9814	10.52	4.03
10	IQ4_NL	0.9870	10.40	4.11
11	Q4_K_L	0.9956	11.41	4.66
12	Q4_K_S	0.9861	10.58	4.77
13	MXFP4_MOE	0.9823	10.37	4.98
14	Q4_K_M	0.9884	10.52	5.08
15	Q4_K_XL	0.9875	10.37	5.13
16	Q5_K_S	0.9862	10.39	5.73
17	Q5_K_M	0.9875	10.26	6.06
18	Q5_K_XL	0.9877	10.55	6.09
19	Q6_K_S	0.9922	11.16	6.58
20	Q6_K	0.9965	10.87	6.66
21	Q6_K_XL	0.9967	9.18	7.40
22	Q8_0	0.9966	9.74	8.52

Ultra 7 265KF: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

The Ultra 7 is just good news all around.

Here again, the same ByteShape models sit on the frontier, but the absolute speeds are better. Sounds familiar? CPU-5 is a great default, CPU-6 is near-baseline pick, and CPU-4 or CPU-3 are there if you want to push harder on speed.

Raspberry Pi 5 (16 GB)

Raspberry Pi 5: tokens per second vs quality with model size as bubble size — Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint) Tap *Show Legend* below for model details.

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
CPU-1	2.69bpw	0.9299	3.29	2.69
CPU-2	2.89bpw	0.9641	3.04	2.89
CPU-3	3.40bpw	0.9849	2.96	3.40
CPU-4	3.51bpw	0.9858	2.84	3.51
Unsloth
1	IQ2_XXS	0.9191	2.80	2.46
2	IQ2_M	0.9468	2.74	2.63
3	Q2_K_XL	0.9501	2.73	2.80
4	IQ3_XXS	0.9671	2.58	3.02
5	IQ3_S	0.9654	2.53	3.13
6	Q3_K_S	0.9863	2.37	3.52

Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

It's alive, barely, but still kicking. For Pi deployments, maybe take a look at Qwen3-Coder which is nearly 3x faster.

CPU-4 and CPU-3 are worth a closer look, landing around the 3 TPS range while keeping quality in a range that is still genuinely usable.

Benchmarking Methodology

Generating our models takes little time. What takes disproportionately longer is evaluating all reported models across the following seven benchmarks:

BFCL_V3 for tool calling
GSM8K_V for vision + math
LiveCodeBench V6 and HumanEval for coding
GSM8K for math
IFEVAL for instruction following
MMLU for general knowledge

The reported score is the mean across these benchmarks, with each benchmark normalized to the original model's score.

We evaluated GSM8K_V in both instruct and thinking modes and treated them as separate entries in the average. In practice, we observe that the relative performance between modes remains consistent.

All evaluations were run with llama.cpp b8204.