Blackwell Picks Favorites:
Qwen 3.5 35B A3B

Published by ByteShape Team • 10 April 2026 • Last updated 12 April 20261. Added benchmarking for RTX 5060Ti
2. Added two more GPU models and updated GPU plots

Here's our next ByteShape release: Qwen 3.5 35B A3B. With the 9B model, behaviour across GPUs was fairly consistent while CPUs had differing preferences. For this 35B model, it is almost the reverse: CPUs are surprisingly consistent, while GPUs are much pickier about which quantized models run best on each card.

That is why we are presenting models a bit differently this time. In the GPU charts, we highlight the models that are the best match for each specific card and gray out the ones that are not really the right choice. There is no single GPU pick that works best everywhere, but once you know your hardware, the right options become fairly clear.

We also have a step-by-step tutorial on how to run our models locally with OpenCode.

⚠️

And… as always, … be careful with any tool you choose to grant access to.

TL;DR

On CPUs, the picture is clean. ByteShape models trace a very consistent speed versus quality frontier across the i7, Ryzen 9 5900X, Ultra 7, and even the Raspberry Pi 5, so the recommendations barely change from one system to another.

On GPUs, device-specific optimization matters a lot more.

RTX4090/ RTX 5090/ RTX 6000 Blackwell: GPU-7 is the clear choice offering all-around excellent balance of token and prompt processing speed and near baseline quality.
RTX 4080/ RTX5060Ti: GPU-5 is an excellent choice balancing quality and speed. If you need a smaller model to support longer context lengths, GPU-4 is also a great option.
CPUs: CPU-5 is our top recommendation for maximum quality. If you want faster prompt processing and token generation while still maintaining high output quality, CPU-4 is a strong alternative.

GPUs

The graphs for this release provide more nuanced guidance on model selection for two main reasons:

Unlike the usual pattern, where the same ByteShape models perform well across GPUs, Blackwell shows a much stronger preference for specific datatypes.
We typically recommend models that balance quality with both prompt processing and token generation speed. In this release, however, we add a bit more flavour with some models intentionally prioritizing output quality and token generation speed over prompt processing. Since prompt processing is usually about 20× faster than token generation, trading some prompt processing for better generation speed and higher quality may be a practical choice (these models are shown with a striped infill in the plots).

You will see three types of markers:

Solid markers: Strong all-around choices with our preferred balance of prompt processing speed, token generation speed, and quality. Recommended.
Greyed-out markers: Models included for comparison that perform better on other GPUs.
Striped infill markers: Recommended when token generation speed matters more than prompt processing speed.

ℹ️

Note to the interested reader:

Prompt processing — the phase where the model ingests and processes a large batch of input tokens at once, such as when you submit a long prompt or context.
Token generation — the phase where the model produces output one token at a time.

Examples:

Prompt:	"Write a 5000 word funny story"	→ Token generation vastly outweighs prompt processing.
Prompt:	"Read 'Les Misérables' and summarize the ending in two sentences."	→ Prompt processing will most likely dominate.

RTX 4090 (24 GB)

RTX 4090: tokens per second vs quality with model size as bubble size — RTX 4090: Tokens per second vs quality (bubble size = model footprint) Tap *Show Legend* below for model details.

RTX 4090: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles, or click Show Legend below, for model details.

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
GPU-1	IQ2_S-2.17bpw	0.8826	183.02	2.17
GPU-2	Q3_K_S-2.73bpw	0.9387	180.49	2.73
GPU-3	Q3_K_S-2.89bpw	0.9641	176.35	2.89
GPU-4	IQ3_S-3.01bpw	0.9770	171.83	3.01
GPU-5	IQ3_S-3.26bpw	0.9840	162.86	3.26
GPU-6	Q3_K_S-3.40bpw	0.9849	168.98	3.40
GPU-7	IQ4_XS-4.06bpw	0.9969	164.77	4.06
GPU-8	IQ4_XS-4.12bpw	0.9981	156.44	4.12
Unsloth
A	UD-IQ2_XXS	0.9191	167.54	2.46
B	UD-IQ2_M	0.9468	167.34	2.63
C	UD-Q2_K_XL	0.9501	166.66	2.80
D	UD-IQ3_XXS	0.9671	157.59	3.02
E	UD-IQ3_S	0.9654	157.65	3.13
F	Q3_K_S	0.9863	148.82	3.52
G	Q3_K_M	0.9817	150.90	3.77
H	UD-Q3_K_XL	0.9843	150.95	3.83
I	UD-IQ4_XS	0.9814	150.24	4.03
J	UD-IQ4_NL	0.9870	151.58	4.11
K	UD-Q4_K_L	0.9956	159.96	4.66
L	Q4_K_S	0.9861	149.69	4.77
M	MXFP4_MOE	0.9823	147.13	4.98
N	Q4_K_M	0.9884	149.31	5.08
O	UD-Q4_K_XL	0.9875	146.56	5.13

ℹ️

Reminder:

Greyed-out models are shown for completeness. We do not recommend them for this GPU but include them for comparison to the other GPUs where those grayed-out models perform best.
Striped-infill models are recommended when token generation speed matters more to you than prompt processing.

The 4090 is actually one of the clearest charts here.

GPU-8 is the conservative, practically same-as-baseline quality pick. GPU-7, though, holds essentially baseline quality while adding a meaningful speed boost.

Below those, GPU-6 boosts token generation at the expense of prompt processing and maintains competitive quality. Beyond that, you have several clear choices that trade-off quality for further increases in speed.

So on the 4090:

GPU-7 is our recommendation with near-baseline quality and excellent all-around speed,
GPU-8 if you want the safest near-baseline choice,
GPU-4 if you need a further boost in speed.

RTX 4080 (16 GB)

RTX 4080: tokens per second vs quality with model size as bubble size — RTX 4080: Tokens per second vs quality (bubble size = model footprint) Tap *Show Legend* below for model details.

RTX 4080: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles, or click Show Legend below, for model details.

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
GPU-1	IQ2_S-2.17bpw	0.8826	160.49	2.17
GPU-2	Q3_K_S-2.73bpw	0.9387	157.17	2.73
GPU-3	Q3_K_S-2.89bpw	0.9641	154.14	2.89
GPU-4	IQ3_S-3.01bpw	0.9770	149.03	3.01
GPU-5	IQ3_S-3.26bpw	0.9840	140.35	3.26
GPU-6	Q3_K_S-3.40bpw	0.9849	145.95	3.40
Unsloth
A	UD-IQ2_XXS	0.9191	145.23	2.46
B	UD-IQ2_M	0.9468	144.14	2.63
C	UD-Q2_K_XL	0.9501	143.66	2.80
D	UD-IQ3_XXS	0.9671	134.83	3.02
E	UD-IQ3_S	0.9654	134.84	3.13
F	Q3_K_S	0.9863	125.95	3.52

ℹ️

Reminder:

Striped-infill models are recommended when token generation speed matters more to you than prompt processing.

With less VRAM, there are fewer viable options and a more nuanced choice is in order.

GPU-5 is the clear recommendation: it delivers 140+ TPS and 98%+ of baseline quality. That is a very clean trade-off with a maximum of 16K context length.

For larger context, GPU-4 also delivers balanced improvements in speed and 97.7% quality with 32K context.

So for the 4080, the recommendation is straightforward as both models below offer near baseline quality:

GPU-5 for the highest output quality.
GPU-4 best overall balance of quality, size and speed.

RTX 5090 (32 GB)

RTX 5090: tokens per second vs quality with model size as bubble size — RTX 5090: Tokens per second vs quality (bubble size = model footprint) Tap *Show Legend* below for model details.

RTX 5090: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles, or click Show Legend below, for model details.

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
GPU-1	IQ2_S-2.17bpw	0.8826	202.97	2.17
GPU-2	Q3_K_S-2.73bpw	0.9387	194.92	2.73
GPU-3	Q3_K_S-2.89bpw	0.9641	197.28	2.89
GPU-4	IQ3_S-3.01bpw	0.9770	193.94	3.01
GPU-5	IQ3_S-3.26bpw	0.9840	189.83	3.26
GPU-6	Q3_K_S-3.40bpw	0.9849	187.82	3.40
GPU-7	IQ4_XS-4.06bpw	0.9969	196.91	4.06
GPU-8	IQ4_XS-4.12bpw	0.9981	190.53	4.12
Unsloth
A	UD-IQ2_XXS	0.9191	193.02	2.46
B	UD-IQ2_M	0.9468	192.95	2.63
C	UD-Q2_K_XL	0.9501	192.78	2.80
D	UD-IQ3_XXS	0.9671	183.84	3.02
E	UD-IQ3_S	0.9654	181.40	3.13
F	Q3_K_S	0.9863	178.10	3.52
G	Q3_K_M	0.9817	181.96	3.77
H	UD-Q3_K_XL	0.9843	181.82	3.83
I	UD-IQ4_XS	0.9814	182.02	4.03
J	UD-IQ4_NL	0.9870	180.71	4.11
K	UD-Q4_K_L	0.9956	190.59	4.66
L	Q4_K_S	0.9861	180.82	4.77
M	MXFP4_MOE	0.9823	180.43	4.98
N	Q4_K_M	0.9884	182.30	5.08
O	UD-Q4_K_XL	0.9875	179.57	5.13
P	Q5_K_S	0.9862	178.65	5.73
Q	Q5_K_M	0.9875	177.91	6.06
R	UD-Q5_K_XL	0.9877	177.69	6.09
S	UD-Q6_K_S	0.9922	181.63	6.58
T	Q6_K	0.9965	179.47	6.66
U	UD-Q6_K_XL	0.9967	173.75	7.40

ℹ️

Reminder:

Greyed-out models are shown for completeness. We do not recommend them for this GPU but include them for comparison to the other GPUs where those grayed-out models perform best.
Striped-infill models are recommended when token generation speed matters more to you than prompt processing.

This is where Blackwell's preferences really start to stand out. On an RTX 5090 with 32 GB of VRAM, GPU-7 is a clear winner delivering the best overall experience, combining top-tier quality with the fastest speed.

RTX 5060Ti (16 GB)

RTX 5060Ti: tokens per second vs quality with model size as bubble size — RTX 5060Ti: Tokens per second vs quality (bubble size = model footprint) Tap *Show Legend* below for model details.

RTX 5060Ti: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles, or click Show Legend below, for model details.

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
GPU-1	IQ2_S-2.17bpw	0.8826	103.25	2.17
GPU-2	Q3_K_S-2.73bpw	0.9387	96.60	2.73
GPU-3	Q3_K_S-2.89bpw	0.9641	100.42	2.89
GPU-4	IQ3_S-3.01bpw	0.9770	98.18	3.01
GPU-5	IQ3_S-3.26bpw	0.9840	92.40	3.26
GPU-6	Q3_K_S-3.40bpw	0.9849	89.64	3.40
Unsloth
A	UD-IQ2_XXS	0.9191	95.79	2.46
B	UD-IQ2_M	0.9468	95.61	2.63
C	UD-Q2_K_XL	0.9501	95.41	2.80
D	UD-IQ3_XXS	0.9671	88.72	3.02
E	UD-IQ3_S	0.9654	87.29	3.13
F	Q3_K_S	0.9863	80.94	3.52

ℹ️

Reminder:

Greyed-out models are shown for completeness. We do not recommend them for this GPU but include them for comparison to the other GPUs where those grayed-out models perform best.
Striped-infill models are recommended when token generation speed matters more to you than prompt processing.

With 16GB there are fewer choices. GPU-5 and GPU-4 offer well balanced speed and quality trade-offs with GPU-5 emphasizing quality whereas GPU-4 delivering a ~5% boost in speed with less than ~1% reduction in quality. GPU-6 fits but it's slower than GPU-5 in token generation and prompt processing speed and its quality is not meaningfully higher. Not recommended here.

Our recommendation:

GPU-5 balanced performance and quality trade-off.

RTX Pro 6000 Blackwell Workstation (96 GB)

RTX Pro 6000 Blackwell workstation: tokens per second vs quality with model size as bubble size — RTX Pro 6000 Blackwell workstation: Tokens per second vs quality (bubble size = model footprint) Tap *Show Legend* below for model details.

RTX Pro 6000 Blackwell workstation: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles, or click Show Legend below, for model details.

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
GPU-1	IQ2_S-2.17bpw	0.8826	202.77	2.17
GPU-2	Q3_K_S-2.73bpw	0.9387	193.76	2.73
GPU-3	Q3_K_S-2.89bpw	0.9641	196.72	2.89
GPU-4	IQ3_S-3.01bpw	0.9770	195.41	3.01
GPU-5	IQ3_S-3.26bpw	0.9840	188.44	3.26
GPU-6	Q3_K_S-3.40bpw	0.9849	187.31	3.40
GPU-7	IQ4_XS-4.06bpw	0.9969	194.42	4.06
GPU-8	IQ4_XS-4.12bpw	0.9981	186.67	4.12
Unsloth
A	UD-IQ2_XXS	0.9191	191.71	2.46
B	UD-IQ2_M	0.9468	192.25	2.63
C	UD-Q2_K_XL	0.9501	191.27	2.80
D	UD-IQ3_XXS	0.9671	183.36	3.02
E	UD-IQ3_S	0.9654	181.24	3.13
F	Q3_K_S	0.9863	177.25	3.52
G	Q3_K_M	0.9817	181.00	3.77
H	UD-Q3_K_XL	0.9843	179.91	3.83
I	UD-IQ4_XS	0.9814	180.14	4.03
J	UD-IQ4_NL	0.9870	180.16	4.11
K	UD-Q4_K_L	0.9956	188.66	4.66
L	Q4_K_S	0.9861	178.55	4.77
M	MXFP4_MOE	0.9823	177.43	4.98
N	Q4_K_M	0.9884	178.87	5.08
O	UD-Q4_K_XL	0.9875	176.18	5.13
P	Q5_K_S	0.9862	176.10	5.73
Q	Q5_K_M	0.9875	175.33	6.06
R	UD-Q5_K_XL	0.9877	174.72	6.09
S	UD-Q6_K_S	0.9922	179.47	6.58
T	Q6_K	0.9965	177.41	6.66
U	UD-Q6_K_XL	0.9967	171.02	7.40
V	Q8_0	0.9966	170.09	8.52

ℹ️

Reminder:

Greyed-out models are shown for completeness. We do not recommend them for this GPU but include them for comparison to the other GPUs where those grayed-out models perform best.
Striped-infill models are recommended when token generation speed matters more to you than prompt processing.

Very similar story to the 5090.

So, for the RTX6000 Pro Blackwell (workstation):

GPU-7 is excellent speed and quality.

CPUs

Picking the right quant for CPU proves more straightforward.

Intel Core i7 12700KF

Intel Core i7 12700KF: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles, or click Show Legend below, for model details.

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
CPU-1	Q3_K_S-2.69bpw	0.9299	11.83	2.69
CPU-2	Q3_K_S-2.89bpw	0.9641	10.67	2.89
CPU-3	Q3_K_S-3.40bpw	0.9849	10.18	3.40
CPU-4	Q4_K_S-3.51bpw	0.9858	9.63	3.51
CPU-5	IQ4_XS-4.06bpw	0.9969	9.02	4.06
CPU-6	IQ4_XS-4.12bpw	0.9981	8.23	4.12
Unsloth
A	UD-IQ2_XXS	0.9191	10.08	2.46
B	UD-IQ2_M	0.9468	9.92	2.63
C	UD-Q2_K_XL	0.9501	9.88	2.80
D	UD-IQ3_XXS	0.9671	9.04	3.02
E	UD-IQ3_S	0.9654	8.91	3.13
F	Q3_K_S	0.9863	8.04	3.52
G	Q3_K_M	0.9817	8.05	3.77
H	UD-Q3_K_XL	0.9843	8.05	3.83
I	UD-IQ4_XS	0.9814	7.95	4.03
J	UD-IQ4_NL	0.9870	7.91	4.11
K	UD-Q4_K_L	0.9956	8.39	4.66
L	Q4_K_S	0.9861	7.84	4.77
M	MXFP4_MOE	0.9823	7.52	4.98
N	Q4_K_M	0.9884	7.76	5.08
O	UD-Q4_K_XL	0.9875	7.49	5.13
P	Q5_K_S	0.9862	7.39	5.73
Q	Q5_K_M	0.9875	7.26	6.06
R	UD-Q5_K_XL	0.9877	7.22	6.09
S	UD-Q6_K_S	0.9922	7.92	6.58
T	Q6_K	0.9965	7.53	6.66
U	UD-Q6_K_XL	0.9967	6.57	7.40
V	Q8_0	0.9966	6.74	8.52

CPU-6 is the near-baseline option. CPU-5 is probably the default pick for most people: still very close to baseline quality, but with a bit more speed. CPU-4 is the more aggressive/balanced option, and after that we get into more noticeable quality vs. speed trade-offs.

So, for the i7:

CPU-5 is the default recommendation,
CPU-6 if you want the highest quality,
CPU-4 if you want more speed without getting too adventurous.

Ryzen 9 5900X

Ryzen 9 5900X: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles, or click Show Legend below, for model details.

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
CPU-1	Q3_K_S-2.69bpw	0.9299	10.87	2.69
CPU-2	Q3_K_S-2.89bpw	0.9641	10.19	2.89
CPU-3	Q3_K_S-3.40bpw	0.9849	9.94	3.40
CPU-4	Q4_K_S-3.51bpw	0.9858	9.51	3.51
CPU-5	IQ4_XS-4.06bpw	0.9969	9.07	4.06
CPU-6	IQ4_XS-4.12bpw	0.9981	8.57	4.12
Unsloth
A	UD-IQ2_XXS	0.9191	9.74	2.46
B	UD-IQ2_M	0.9468	9.66	2.63
C	UD-Q2_K_XL	0.9501	9.60	2.80
D	UD-IQ3_XXS	0.9671	9.06	3.02
E	UD-IQ3_S	0.9654	8.99	3.13
F	Q3_K_S	0.9863	8.30	3.52
G	Q3_K_M	0.9817	8.36	3.77
H	UD-Q3_K_XL	0.9843	8.42	3.83
I	UD-IQ4_XS	0.9814	8.27	4.03
J	UD-IQ4_NL	0.9870	8.25	4.11
K	UD-Q4_K_L	0.9956	8.68	4.66
L	Q4_K_S	0.9861	8.36	4.77
M	MXFP4_MOE	0.9823	8.00	4.98
N	Q4_K_M	0.9884	8.18	5.08
O	UD-Q4_K_XL	0.9875	7.97	5.13
P	Q5_K_S	0.9862	7.83	5.73
Q	Q5_K_M	0.9875	7.79	6.06
R	UD-Q5_K_XL	0.9877	7.80	6.09
S	UD-Q6_K_S	0.9922	8.33	6.58
T	Q6_K	0.9965	8.01	6.66
U	UD-Q6_K_XL	0.9967	7.27	7.40
V	Q8_0	0.9966	7.42	8.52

Almost the same story as the i7, which is refreshing.

The frontier is clean, the recommendations are clean, and the overall trade-off pattern looks very familiar. Again, CPU-5 remains a great default choice, CPU-6 is the quality-first choice, and CPU-4 is the more speed-oriented alternative.

Ultra 7 265KF

Ultra 7 265KF: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles, or click Show Legend below, for model details.

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
CPU-1	Q3_K_S-2.69bpw	0.9299	14.00	2.69
CPU-2	Q3_K_S-2.89bpw	0.9641	13.63	2.89
CPU-3	Q3_K_S-3.40bpw	0.9849	12.90	3.40
CPU-4	Q4_K_S-3.51bpw	0.9858	12.68	3.51
CPU-5	IQ4_XS-4.06bpw	0.9969	12.27	4.06
CPU-6	IQ4_XS-4.12bpw	0.9981	11.79	4.12
Unsloth
A	UD-IQ2_XXS	0.9191	12.71	2.46
B	UD-IQ2_M	0.9468	12.73	2.63
C	UD-Q2_K_XL	0.9501	12.12	2.80
D	UD-IQ3_XXS	0.9671	12.00	3.02
E	UD-IQ3_S	0.9654	11.42	3.13
F	Q3_K_S	0.9863	10.60	3.52
G	Q3_K_M	0.9817	10.29	3.77
H	UD-Q3_K_XL	0.9843	11.00	3.83
I	UD-IQ4_XS	0.9814	10.52	4.03
J	UD-IQ4_NL	0.9870	10.40	4.11
K	UD-Q4_K_L	0.9956	11.41	4.66
L	Q4_K_S	0.9861	10.58	4.77
M	MXFP4_MOE	0.9823	10.37	4.98
N	Q4_K_M	0.9884	10.52	5.08
O	UD-Q4_K_XL	0.9875	10.37	5.13
P	Q5_K_S	0.9862	10.39	5.73
Q	Q5_K_M	0.9875	10.26	6.06
R	UD-Q5_K_XL	0.9877	10.55	6.09
S	UD-Q6_K_S	0.9922	11.16	6.58
T	Q6_K	0.9965	10.87	6.66
U	UD-Q6_K_XL	0.9967	9.18	7.40
V	Q8_0	0.9966	9.74	8.52

The Ultra 7 is just good news all around.

Here again, the same ByteShape models sit on the frontier, but the absolute speeds are better. Sounds familiar? CPU-5 is a great default, CPU-6 is near-baseline pick, and CPU-4 or CPU-3 are there if you want to push harder on speed.

Raspberry Pi 5 (16 GB)

Raspberry Pi 5: tokens per second vs quality with model size as bubble size — Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint) Tap *Show Legend* below for model details.

Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles, or click Show Legend below, for model details.

Show Legend

#	Model	Acc	TPS	BPW
ByteShape
CPU-1	Q3_K_S-2.69bpw	0.9299	3.29	2.69
CPU-2	Q3_K_S-2.89bpw	0.9641	3.04	2.89
CPU-3	Q3_K_S-3.40bpw	0.9849	2.96	3.40
CPU-4	Q4_K_S-3.51bpw	0.9858	2.84	3.51
Unsloth
A	UD-IQ2_XXS	0.9191	2.80	2.46
B	UD-IQ2_M	0.9468	2.74	2.63
C	UD-Q2_K_XL	0.9501	2.73	2.80
D	UD-IQ3_XXS	0.9671	2.58	3.02
E	UD-IQ3_S	0.9654	2.53	3.13
F	Q3_K_S	0.9863	2.37	3.52

It's alive, barely, but still kicking. For Pi deployments, maybe take a look at Qwen3-Coder which is nearly 3x faster.

CPU-4 and CPU-3 are worth a closer look, landing around the 3 TPS range while keeping quality in a range that is still genuinely usable.

Benchmarking Methodology

Generating our models takes little time. What takes disproportionately longer is evaluating all reported models across the following seven benchmarks:

BFCL_V3 for tool calling
GSM8K_V for vision + math
LiveCodeBench V6 and HumanEval for coding
GSM8K for math
IFEVAL for instruction following
MMLU for general knowledge

The reported score is the mean across these benchmarks, with each benchmark normalized to the original model's score.

We evaluated GSM8K_V in both instruct and thinking modes and treated them as separate entries in the average. In practice, we observe that the relative performance between modes remains consistent.

All evaluations were run with llama.cpp b8204.