Blackwell Picks Favorites:
Qwen 3.5 35B A3B

Published by ByteShape Team • 10 April 2026 • Last updated 12 April 20261. Added benchmarking for RTX 5060Ti
2. Added two more GPU models and updated GPU plots

Here's our next ByteShape release: Qwen 3.5 35B A3B. With the 9B model, behaviour across GPUs was fairly consistent while CPUs had differing preferences. For this 35B model, it is almost the reverse: CPUs are surprisingly consistent, while GPUs are much pickier about which quantized models run best on each card.

That is why we are presenting models a bit differently this time. In the GPU charts, we highlight the models that are the best match for each specific card and gray out the ones that are not really the right choice. There is no single GPU pick that works best everywhere, but once you know your hardware, the right options become fairly clear.

We also have a step-by-step tutorial on how to run our models locally with OpenCode.

⚠️

And… as always, … be careful with any tool you choose to grant access to.

TL;DR

On CPUs, the picture is clean. ByteShape models trace a very consistent speed versus quality frontier across the i7, Ryzen 9 5900X, Ultra 7, and even the Raspberry Pi 5, so the recommendations barely change from one system to another.

On GPUs, device-specific optimization matters a lot more.

  • RTX4090/ RTX 5090/ RTX 6000 Blackwell: GPU-7 is the clear choice offering all-around excellent balance of token and prompt processing speed and near baseline quality.
  • RTX 4080/ RTX5060Ti: GPU-5 is an excellent choice balancing quality and speed. If you need a smaller model to support longer context lengths, GPU-4 is also a great option.
  • CPUs: CPU-5 is our top recommendation for maximum quality. If you want faster prompt processing and token generation while still maintaining high output quality, CPU-4 is a strong alternative.

GPUs

The graphs for this release provide more nuanced guidance on model selection for two main reasons:

  1. Unlike the usual pattern, where the same ByteShape models perform well across GPUs, Blackwell shows a much stronger preference for specific datatypes.
  2. We typically recommend models that balance quality with both prompt processing and token generation speed. In this release, however, we add a bit more flavour with some models intentionally prioritizing output quality and token generation speed over prompt processing. Since prompt processing is usually about 20× faster than token generation, trading some prompt processing for better generation speed and higher quality may be a practical choice (these models are shown with a striped infill in the plots).

You will see three types of markers:

  1. Solid markers: Strong all-around choices with our preferred balance of prompt processing speed, token generation speed, and quality. Recommended.
  2. Greyed-out markers: Models included for comparison that perform better on other GPUs.
  3. Striped infill markers: Recommended when token generation speed matters more than prompt processing speed.
ℹ️

Note to the interested reader:

  • Prompt processing — the phase where the model ingests and processes a large batch of input tokens at once, such as when you submit a long prompt or context.
  • Token generation — the phase where the model produces output one token at a time.

Examples:

Prompt: "Write a 5000 word funny story" → Token generation vastly outweighs prompt processing.
Prompt: "Read 'Les Misérables' and summarize the ending in two sentences." → Prompt processing will most likely dominate.

RTX 4090 (24 GB)

RTX 4090: tokens per second vs quality with model size as bubble size
RTX 4090: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
GPU-1IQ2_S-2.17bpw0.8826183.022.17
GPU-2Q3_K_S-2.73bpw0.9387180.492.73
GPU-3Q3_K_S-2.89bpw0.9641176.352.89
GPU-4IQ3_S-3.01bpw0.9770171.833.01
GPU-5IQ3_S-3.26bpw0.9840162.863.26
GPU-6Q3_K_S-3.40bpw0.9849168.983.40
GPU-7IQ4_XS-4.06bpw0.9969164.774.06
GPU-8IQ4_XS-4.12bpw0.9981156.444.12
AUD-IQ2_XXS0.9191167.542.46
BUD-IQ2_M0.9468167.342.63
CUD-Q2_K_XL0.9501166.662.80
DUD-IQ3_XXS0.9671157.593.02
EUD-IQ3_S0.9654157.653.13
FQ3_K_S0.9863148.823.52
GQ3_K_M0.9817150.903.77
HUD-Q3_K_XL0.9843150.953.83
IUD-IQ4_XS0.9814150.244.03
JUD-IQ4_NL0.9870151.584.11
KUD-Q4_K_L0.9956159.964.66
LQ4_K_S0.9861149.694.77
MMXFP4_MOE0.9823147.134.98
NQ4_K_M0.9884149.315.08
OUD-Q4_K_XL0.9875146.565.13
RTX 4090: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.
ℹ️

Reminder:

  • Greyed-out models are shown for completeness. We do not recommend them for this GPU but include them for comparison to the other GPUs where those grayed-out models perform best.
  • Striped-infill models are recommended when token generation speed matters more to you than prompt processing.

The 4090 is actually one of the clearest charts here.

GPU-8 is the conservative, practically same-as-baseline quality pick. GPU-7, though, holds essentially baseline quality while adding a meaningful speed boost.

Below those, GPU-6 boosts token generation at the expense of prompt processing and maintains competitive quality. Beyond that, you have several clear choices that trade-off quality for further increases in speed.

So on the 4090:

  • GPU-7 is our recommendation with near-baseline quality and excellent all-around speed,
  • GPU-8 if you want the safest near-baseline choice,
  • GPU-4 if you need a further boost in speed.

RTX 4080 (16 GB)

RTX 4080: tokens per second vs quality with model size as bubble size
RTX 4080: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
GPU-1IQ2_S-2.17bpw0.8826160.492.17
GPU-2Q3_K_S-2.73bpw0.9387157.172.73
GPU-3Q3_K_S-2.89bpw0.9641154.142.89
GPU-4IQ3_S-3.01bpw0.9770149.033.01
GPU-5IQ3_S-3.26bpw0.9840140.353.26
GPU-6Q3_K_S-3.40bpw0.9849145.953.40
AUD-IQ2_XXS0.9191145.232.46
BUD-IQ2_M0.9468144.142.63
CUD-Q2_K_XL0.9501143.662.80
DUD-IQ3_XXS0.9671134.833.02
EUD-IQ3_S0.9654134.843.13
FQ3_K_S0.9863125.953.52
RTX 4080: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.
ℹ️

Reminder:

  • Striped-infill models are recommended when token generation speed matters more to you than prompt processing.

With less VRAM, there are fewer viable options and a more nuanced choice is in order.

GPU-5 is the clear recommendation: it delivers 140+ TPS and 98%+ of baseline quality. That is a very clean trade-off with a maximum of 16K context length.

For larger context, GPU-4 also delivers balanced improvements in speed and 97.7% quality with 32K context.

So for the 4080, the recommendation is straightforward as both models below offer near baseline quality:

  • GPU-5 for the highest output quality.
  • GPU-4 best overall balance of quality, size and speed.

RTX 5090 (32 GB)

RTX 5090: tokens per second vs quality with model size as bubble size
RTX 5090: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
GPU-1IQ2_S-2.17bpw0.8826202.972.17
GPU-2Q3_K_S-2.73bpw0.9387194.922.73
GPU-3Q3_K_S-2.89bpw0.9641197.282.89
GPU-4IQ3_S-3.01bpw0.9770193.943.01
GPU-5IQ3_S-3.26bpw0.9840189.833.26
GPU-6Q3_K_S-3.40bpw0.9849187.823.40
GPU-7IQ4_XS-4.06bpw0.9969196.914.06
GPU-8IQ4_XS-4.12bpw0.9981190.534.12
AUD-IQ2_XXS0.9191193.022.46
BUD-IQ2_M0.9468192.952.63
CUD-Q2_K_XL0.9501192.782.80
DUD-IQ3_XXS0.9671183.843.02
EUD-IQ3_S0.9654181.403.13
FQ3_K_S0.9863178.103.52
GQ3_K_M0.9817181.963.77
HUD-Q3_K_XL0.9843181.823.83
IUD-IQ4_XS0.9814182.024.03
JUD-IQ4_NL0.9870180.714.11
KUD-Q4_K_L0.9956190.594.66
LQ4_K_S0.9861180.824.77
MMXFP4_MOE0.9823180.434.98
NQ4_K_M0.9884182.305.08
OUD-Q4_K_XL0.9875179.575.13
PQ5_K_S0.9862178.655.73
QQ5_K_M0.9875177.916.06
RUD-Q5_K_XL0.9877177.696.09
SUD-Q6_K_S0.9922181.636.58
TQ6_K0.9965179.476.66
UUD-Q6_K_XL0.9967173.757.40
RTX 5090: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.
ℹ️

Reminder:

  • Greyed-out models are shown for completeness. We do not recommend them for this GPU but include them for comparison to the other GPUs where those grayed-out models perform best.
  • Striped-infill models are recommended when token generation speed matters more to you than prompt processing.

This is where Blackwell's preferences really start to stand out. On an RTX 5090 with 32 GB of VRAM, GPU-7 is a clear winner delivering the best overall experience, combining top-tier quality with the fastest speed.

RTX 5060Ti (16 GB)

RTX 5060Ti: tokens per second vs quality with model size as bubble size
RTX 5060Ti: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
GPU-1IQ2_S-2.17bpw0.8826103.252.17
GPU-2Q3_K_S-2.73bpw0.938796.602.73
GPU-3Q3_K_S-2.89bpw0.9641100.422.89
GPU-4IQ3_S-3.01bpw0.977098.183.01
GPU-5IQ3_S-3.26bpw0.984092.403.26
GPU-6Q3_K_S-3.40bpw0.984989.643.40
AUD-IQ2_XXS0.919195.792.46
BUD-IQ2_M0.946895.612.63
CUD-Q2_K_XL0.950195.412.80
DUD-IQ3_XXS0.967188.723.02
EUD-IQ3_S0.965487.293.13
FQ3_K_S0.986380.943.52
RTX 5060Ti: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.
ℹ️

Reminder:

  • Greyed-out models are shown for completeness. We do not recommend them for this GPU but include them for comparison to the other GPUs where those grayed-out models perform best.
  • Striped-infill models are recommended when token generation speed matters more to you than prompt processing.

With 16GB there are fewer choices. GPU-5 and GPU-4 offer well balanced speed and quality trade-offs with GPU-5 emphasizing quality whereas GPU-4 delivering a ~5% boost in speed with less than ~1% reduction in quality. GPU-6 fits but it's slower than GPU-5 in token generation and prompt processing speed and its quality is not meaningfully higher. Not recommended here.

Our recommendation:

  • GPU-5 balanced performance and quality trade-off.

RTX Pro 6000 Blackwell Workstation (96 GB)

RTX Pro 6000 Blackwell workstation: tokens per second vs quality with model size as bubble size
RTX Pro 6000 Blackwell workstation: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
GPU-1IQ2_S-2.17bpw0.8826202.772.17
GPU-2Q3_K_S-2.73bpw0.9387193.762.73
GPU-3Q3_K_S-2.89bpw0.9641196.722.89
GPU-4IQ3_S-3.01bpw0.9770195.413.01
GPU-5IQ3_S-3.26bpw0.9840188.443.26
GPU-6Q3_K_S-3.40bpw0.9849187.313.40
GPU-7IQ4_XS-4.06bpw0.9969194.424.06
GPU-8IQ4_XS-4.12bpw0.9981186.674.12
AUD-IQ2_XXS0.9191191.712.46
BUD-IQ2_M0.9468192.252.63
CUD-Q2_K_XL0.9501191.272.80
DUD-IQ3_XXS0.9671183.363.02
EUD-IQ3_S0.9654181.243.13
FQ3_K_S0.9863177.253.52
GQ3_K_M0.9817181.003.77
HUD-Q3_K_XL0.9843179.913.83
IUD-IQ4_XS0.9814180.144.03
JUD-IQ4_NL0.9870180.164.11
KUD-Q4_K_L0.9956188.664.66
LQ4_K_S0.9861178.554.77
MMXFP4_MOE0.9823177.434.98
NQ4_K_M0.9884178.875.08
OUD-Q4_K_XL0.9875176.185.13
PQ5_K_S0.9862176.105.73
QQ5_K_M0.9875175.336.06
RUD-Q5_K_XL0.9877174.726.09
SUD-Q6_K_S0.9922179.476.58
TQ6_K0.9965177.416.66
UUD-Q6_K_XL0.9967171.027.40
VQ8_00.9966170.098.52
RTX Pro 6000 Blackwell workstation: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.
ℹ️

Reminder:

  • Greyed-out models are shown for completeness. We do not recommend them for this GPU but include them for comparison to the other GPUs where those grayed-out models perform best.
  • Striped-infill models are recommended when token generation speed matters more to you than prompt processing.

Very similar story to the 5090.

So, for the RTX6000 Pro Blackwell (workstation):

  • GPU-7 is excellent speed and quality.

CPUs

Picking the right quant for CPU proves more straightforward.

Intel Core i7 12700KF

Intel Core i7 12700KF: tokens per second vs quality with model size as bubble size
Intel Core i7 12700KF: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
CPU-1Q3_K_S-2.69bpw0.929911.832.69
CPU-2Q3_K_S-2.89bpw0.964110.672.89
CPU-3Q3_K_S-3.40bpw0.984910.183.40
CPU-4Q4_K_S-3.51bpw0.98589.633.51
CPU-5IQ4_XS-4.06bpw0.99699.024.06
CPU-6IQ4_XS-4.12bpw0.99818.234.12
AUD-IQ2_XXS0.919110.082.46
BUD-IQ2_M0.94689.922.63
CUD-Q2_K_XL0.95019.882.80
DUD-IQ3_XXS0.96719.043.02
EUD-IQ3_S0.96548.913.13
FQ3_K_S0.98638.043.52
GQ3_K_M0.98178.053.77
HUD-Q3_K_XL0.98438.053.83
IUD-IQ4_XS0.98147.954.03
JUD-IQ4_NL0.98707.914.11
KUD-Q4_K_L0.99568.394.66
LQ4_K_S0.98617.844.77
MMXFP4_MOE0.98237.524.98
NQ4_K_M0.98847.765.08
OUD-Q4_K_XL0.98757.495.13
PQ5_K_S0.98627.395.73
QQ5_K_M0.98757.266.06
RUD-Q5_K_XL0.98777.226.09
SUD-Q6_K_S0.99227.926.58
TQ6_K0.99657.536.66
UUD-Q6_K_XL0.99676.577.40
VQ8_00.99666.748.52
Intel Core i7 12700KF: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

CPU-6 is the near-baseline option. CPU-5 is probably the default pick for most people: still very close to baseline quality, but with a bit more speed. CPU-4 is the more aggressive/balanced option, and after that we get into more noticeable quality vs. speed trade-offs.

So, for the i7:

  • CPU-5 is the default recommendation,
  • CPU-6 if you want the highest quality,
  • CPU-4 if you want more speed without getting too adventurous.

Ryzen 9 5900X

Ryzen 9 5900X: tokens per second vs quality with model size as bubble size
Ryzen 9 5900X: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
CPU-1Q3_K_S-2.69bpw0.929910.872.69
CPU-2Q3_K_S-2.89bpw0.964110.192.89
CPU-3Q3_K_S-3.40bpw0.98499.943.40
CPU-4Q4_K_S-3.51bpw0.98589.513.51
CPU-5IQ4_XS-4.06bpw0.99699.074.06
CPU-6IQ4_XS-4.12bpw0.99818.574.12
AUD-IQ2_XXS0.91919.742.46
BUD-IQ2_M0.94689.662.63
CUD-Q2_K_XL0.95019.602.80
DUD-IQ3_XXS0.96719.063.02
EUD-IQ3_S0.96548.993.13
FQ3_K_S0.98638.303.52
GQ3_K_M0.98178.363.77
HUD-Q3_K_XL0.98438.423.83
IUD-IQ4_XS0.98148.274.03
JUD-IQ4_NL0.98708.254.11
KUD-Q4_K_L0.99568.684.66
LQ4_K_S0.98618.364.77
MMXFP4_MOE0.98238.004.98
NQ4_K_M0.98848.185.08
OUD-Q4_K_XL0.98757.975.13
PQ5_K_S0.98627.835.73
QQ5_K_M0.98757.796.06
RUD-Q5_K_XL0.98777.806.09
SUD-Q6_K_S0.99228.336.58
TQ6_K0.99658.016.66
UUD-Q6_K_XL0.99677.277.40
VQ8_00.99667.428.52
Ryzen 9 5900X: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

Almost the same story as the i7, which is refreshing.

The frontier is clean, the recommendations are clean, and the overall trade-off pattern looks very familiar. Again, CPU-5 remains a great default choice, CPU-6 is the quality-first choice, and CPU-4 is the more speed-oriented alternative.

Ultra 7 265KF

Ultra 7 265KF: tokens per second vs quality with model size as bubble size
Ultra 7 265KF: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
CPU-1Q3_K_S-2.69bpw0.929914.002.69
CPU-2Q3_K_S-2.89bpw0.964113.632.89
CPU-3Q3_K_S-3.40bpw0.984912.903.40
CPU-4Q4_K_S-3.51bpw0.985812.683.51
CPU-5IQ4_XS-4.06bpw0.996912.274.06
CPU-6IQ4_XS-4.12bpw0.998111.794.12
AUD-IQ2_XXS0.919112.712.46
BUD-IQ2_M0.946812.732.63
CUD-Q2_K_XL0.950112.122.80
DUD-IQ3_XXS0.967112.003.02
EUD-IQ3_S0.965411.423.13
FQ3_K_S0.986310.603.52
GQ3_K_M0.981710.293.77
HUD-Q3_K_XL0.984311.003.83
IUD-IQ4_XS0.981410.524.03
JUD-IQ4_NL0.987010.404.11
KUD-Q4_K_L0.995611.414.66
LQ4_K_S0.986110.584.77
MMXFP4_MOE0.982310.374.98
NQ4_K_M0.988410.525.08
OUD-Q4_K_XL0.987510.375.13
PQ5_K_S0.986210.395.73
QQ5_K_M0.987510.266.06
RUD-Q5_K_XL0.987710.556.09
SUD-Q6_K_S0.992211.166.58
TQ6_K0.996510.876.66
UUD-Q6_K_XL0.99679.187.40
VQ8_00.99669.748.52
Ultra 7 265KF: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

The Ultra 7 is just good news all around.

Here again, the same ByteShape models sit on the frontier, but the absolute speeds are better. Sounds familiar? CPU-5 is a great default, CPU-6 is near-baseline pick, and CPU-4 or CPU-3 are there if you want to push harder on speed.

Raspberry Pi 5 (16 GB)

Raspberry Pi 5: tokens per second vs quality with model size as bubble size
Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
CPU-1Q3_K_S-2.69bpw0.92993.292.69
CPU-2Q3_K_S-2.89bpw0.96413.042.89
CPU-3Q3_K_S-3.40bpw0.98492.963.40
CPU-4Q4_K_S-3.51bpw0.98582.843.51
AUD-IQ2_XXS0.91912.802.46
BUD-IQ2_M0.94682.742.63
CUD-Q2_K_XL0.95012.732.80
DUD-IQ3_XXS0.96712.583.02
EUD-IQ3_S0.96542.533.13
FQ3_K_S0.98632.373.52
Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

It's alive, barely, but still kicking. For Pi deployments, maybe take a look at Qwen3-Coder which is nearly 3x faster.

CPU-4 and CPU-3 are worth a closer look, landing around the 3 TPS range while keeping quality in a range that is still genuinely usable.

Benchmarking Methodology

Generating our models takes little time. What takes disproportionately longer is evaluating all reported models across the following seven benchmarks:

  • BFCL_V3 for tool calling
  • GSM8K_V for vision + math
  • LiveCodeBench V6 and HumanEval for coding
  • GSM8K for math
  • IFEVAL for instruction following
  • MMLU for general knowledge

The reported score is the mean across these benchmarks, with each benchmark normalized to the original model's score.

We evaluated GSM8K_V in both instruct and thinking modes and treated them as separate entries in the average. In practice, we observe that the relative performance between modes remains consistent.

All evaluations were run with llama.cpp b8204.