Blackwell Picks Favorites:
Qwen 3.5 35B A3B

Published by ByteShape Team • 10 April 2026

Here's our next ByteShape release: Qwen 3.5 35B A3B. This time it is a Mixture-of-Experts model, not a dense one like the 9B, and the hardware story flips almost entirely.

With the 9B, GPUs were fairly agreeable while CPUs had strong, diverging preferences. For this 35B model it is the reverse: CPUs are surprisingly consistent, while GPUs are much pickier about which quantized models run best on each card.

That is why we are presenting models a bit differently this time. In the GPU charts, we highlight the models that are the best match for each specific card and gray out the ones that are not really the right choice. There is no single GPU pick that works best everywhere, but once you know your hardware, the right options become fairly clear.

We also have a step-by-step tutorial on how to run our models locally with OpenCode.

⚠️

And… as always, be careful with any tool you choose to grant access to.

TL;DR

On CPUs, the picture is clean. ByteShape models trace a very consistent speed versus quality frontier across the i7, Ryzen 9 5900X, Ultra 7, and even the Raspberry Pi 5, so the recommendations barely change from one system to another.

On GPUs, device-specific optimization matters a lot more. The 40-series cards clearly like one set of models, while the Blackwell cards prefer others.

For the impatient:

  • RTX 4080: GPU-4 is the obvious balanced pick.
  • RTX 4090: GPU-5 would be our recommendation for "best" speed vs. accuracy, while GPU-6 practically matches baseline accuracy.
  • RTX 5090 / RTX Pro 6000 Blackwell: GPU-5 remains our recommendation.
  • CPUs: CPU-5 is our recommendation for all CPUs, with CPU-6 if you want to stay as close to baseline as possible.

GPUs

Unlike the usual pattern, where the same ByteShape models tend to perform best across all GPUs, Blackwell shows a much stronger preference for specific datatypes. The charts make this clear, so we highlight the models that best match what Blackwell favors, while still releasing models that remain strong choices on older GPUs as well.

RTX 4090 (24 GB)

RTX 4090: tokens per second vs quality with model size as bubble size
RTX 4090: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
GPU-12.17bpw0.8826183.022.17
GPU-22.73bpw0.9387180.492.73
GPU-32.89bpw0.9641176.352.89
GPU-43.40bpw0.9849168.983.40
GPU-54.06bpw0.9969164.774.06
GPU-64.12bpw0.9981156.444.12
1IQ2_XXS0.9191167.542.46
2IQ2_M0.9468167.342.63
3Q2_K_XL0.9501166.662.80
4IQ3_XXS0.9671157.593.02
5IQ3_S0.9654157.653.13
6Q3_K_S0.9863148.823.52
7Q3_K_M0.9817150.903.77
8Q3_K_XL0.9843150.953.83
9IQ4_XS0.9814150.244.03
10IQ4_NL0.9870151.584.11
11Q4_K_L0.9956159.964.66
12Q4_K_S0.9861149.694.77
13MXFP4_MOE0.9823147.134.98
14Q4_K_M0.9884149.315.08
15Q4_K_XL0.9875146.565.13
RTX 4090: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

The 4090 is actually one of the clearest charts here.

GPU-6 is the conservative, practically same-as-baseline accuracy pick. GPU-5, though, holds essentially baseline quality while adding a meaningful speed boost.

Below those, GPU-4 remains a solid balance pick, and GPU-3 is the faster, more aggressive pick if you are willing to give up more quality and free space for more context.

So on the 4090:

  • GPU-6 if you want the safest near-baseline choice,
  • GPU-5 if you want the more exciting high-end pick,
  • GPU-4 if you want a more classic speed/quality balance.

RTX 4080 (16 GB)

RTX 4080: tokens per second vs quality with model size as bubble size
RTX 4080: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
GPU-12.17bpw0.8826160.492.17
GPU-22.73bpw0.9387157.172.73
GPU-32.89bpw0.9641154.142.89
GPU-43.40bpw0.9849145.953.40
1IQ2_XXS0.9191145.232.46
2IQ2_M0.9468144.142.63
3Q2_K_XL0.9501143.662.80
4IQ3_XXS0.9671134.833.02
5IQ3_S0.9654134.843.13
6Q3_K_S0.9863125.953.52
RTX 4080: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

Similar to 4090 but with less VRAM therefore fewer options.

GPU-4 is the clear recommendation: it sits around the mid-140s TPS while still landing at roughly 98.5% of baseline quality. That is a very clean trade-off with a maximum of 16K context length.

If you want more speed or bigger context, GPU-3 is the faster option, pushing into the mid-150s TPS range while staying around 96.4% quality. After that, GPU-2 and GPU-1 get increasingly aggressive, and by that point the quality trade-off is much harder to justify unless your priority is simply extracting maximum throughput/minimizing model size to allow for bigger context windows.

So for the 4080, the recommendation is straightforward:

  • GPU-4 for the best overall balance,
  • GPU-3 if you want the faster, slightly more aggressive option.

RTX 5090 (32 GB)

RTX 5090: tokens per second vs quality with model size as bubble size
RTX 5090: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
GPU-12.17bpw0.8826202.972.17
GPU-22.73bpw0.9387194.922.73
GPU-32.89bpw0.9641197.282.89
GPU-43.40bpw0.9849187.823.40
GPU-54.06bpw0.9969196.914.06
GPU-64.12bpw0.9981190.534.12
1IQ2_XXS0.9191193.022.46
2IQ2_M0.9468192.952.63
3Q2_K_XL0.9501192.782.80
4IQ3_XXS0.9671183.843.02
5IQ3_S0.9654181.403.13
6Q3_K_S0.9863178.103.52
7Q3_K_M0.9817181.963.77
8Q3_K_XL0.9843181.823.83
9IQ4_XS0.9814182.024.03
10IQ4_NL0.9870180.714.11
11Q4_K_L0.9956190.594.66
12Q4_K_S0.9861180.824.77
13MXFP4_MOE0.9823180.434.98
14Q4_K_M0.9884182.305.08
15Q4_K_XL0.9875179.575.13
16Q5_K_S0.9862178.655.73
17Q5_K_M0.9875177.916.06
18Q5_K_XL0.9877177.696.09
19Q6_K_S0.9922181.636.58
20Q6_K0.9965179.476.66
21Q6_K_XL0.9967173.757.40
RTX 5090: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.
ℹ️

Reminder: Greyed-out models are shown for completeness. We do not recommend them for this GPU but include them for comparison to the older GPUs where those grayed-out models perform best.

This is where Blackwell's preferences really start to stand out. In particular, on Blackwell with llama.cpp kernels, IQ datatypes perform better, while KQ datatypes are less competitive. As a result, GPU-4 and GPU-2 do not fall on the ideal quality-performance curve because most of their layers are dominated by KQ datatypes.

On Blackwell, GPU-5 looks excellent. It gets very close to baseline quality while pushing close to the 200 TPS range.

GPU-6 is still there as the high-quality anchor, and it is also fast. But if you are looking at the 5090 chart and asking "what is the new thing here?", the answer is clearly GPU-5.

So for the 5090:

  • GPU-5 is the standout,
  • GPU-6 is the conservative premium-quality choice.

RTX Pro 6000 Blackwell (96 GB)

RTX Pro 6000 Blackwell: tokens per second vs quality with model size as bubble size
RTX Pro 6000 Blackwell: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
GPU-12.17bpw0.8826202.772.17
GPU-22.73bpw0.9387193.762.73
GPU-32.89bpw0.9641196.722.89
GPU-43.40bpw0.9849187.313.40
GPU-54.06bpw0.9969194.424.06
GPU-64.12bpw0.9981186.674.12
1IQ2_XXS0.9191191.712.46
2IQ2_M0.9468192.252.63
3Q2_K_XL0.9501191.272.80
4IQ3_XXS0.9671183.363.02
5IQ3_S0.9654181.243.13
6Q3_K_S0.9863177.253.52
7Q3_K_M0.9817181.003.77
8Q3_K_XL0.9843179.913.83
9IQ4_XS0.9814180.144.03
10IQ4_NL0.9870180.164.11
11Q4_K_L0.9956188.664.66
12Q4_K_S0.9861178.554.77
13MXFP4_MOE0.9823177.434.98
14Q4_K_M0.9884178.875.08
15Q4_K_XL0.9875176.185.13
16Q5_K_S0.9862176.105.73
17Q5_K_M0.9875175.336.06
18Q5_K_XL0.9877174.726.09
19Q6_K_S0.9922179.476.58
20Q6_K0.9965177.416.66
21Q6_K_XL0.9967171.027.40
22Q8_00.9966170.098.52
RTX Pro 6000 Blackwell: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.
ℹ️

Reminder: Greyed-out models are shown for completeness. We do not recommend them for this GPU but include them for comparison to the older GPUs where those grayed-out models perform best.

Very similar story to the 5090, which is exactly what you would expect.

Again, GPU-5 looks unusually strong here. It stays essentially on top of the quality chart while delivering a large speed advantage, and once again it looks especially well aligned with Blackwell.

GPU-6 is the safe pick if your top priority is hugging the baseline as closely as possible. GPU-4 is still fine. But the interesting result is the same as on the 5090: GPU-5 is the one that makes you stop and look twice.

So, déjà vu, for the RTX Pro 6000 Blackwell (workstation):

  • GPU-5 is excellent speed and accuracy,
  • GPU-6 is the conservative premium-quality choice.

CPUs

Picking the right quant for CPU proves more straightforward.

Intel Core i7 12700KF

Intel Core i7 12700KF: tokens per second vs quality with model size as bubble size
Intel Core i7 12700KF: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
CPU-12.69bpw0.929911.832.69
CPU-22.89bpw0.964110.672.89
CPU-33.40bpw0.984910.183.40
CPU-43.51bpw0.98589.633.51
CPU-54.06bpw0.99699.024.06
CPU-64.12bpw0.99818.234.12
1IQ2_XXS0.919110.082.46
2IQ2_M0.94689.922.63
3Q2_K_XL0.95019.882.80
4IQ3_XXS0.96719.043.02
5IQ3_S0.96548.913.13
6Q3_K_S0.98638.043.52
7Q3_K_M0.98178.053.77
8Q3_K_XL0.98438.053.83
9IQ4_XS0.98147.954.03
10IQ4_NL0.98707.914.11
11Q4_K_L0.99568.394.66
12Q4_K_S0.98617.844.77
13MXFP4_MOE0.98237.524.98
14Q4_K_M0.98847.765.08
15Q4_K_XL0.98757.495.13
16Q5_K_S0.98627.395.73
17Q5_K_M0.98757.266.06
18Q5_K_XL0.98777.226.09
19Q6_K_S0.99227.926.58
20Q6_K0.99657.536.66
21Q6_K_XL0.99676.577.40
22Q8_00.99666.748.52
Intel Core i7 12700KF: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

CPU-6 is the near-baseline option. CPU-5 is probably the default pick for most people: still very close to baseline quality, but with a bit more speed. CPU-4 is the more aggressive/balanced option, and after that we get into more noticeable quality vs. speed trade-offs.

So, for the i7:

  • CPU-5 is the default recommendation,
  • CPU-6 if you want the highest quality,
  • CPU-4 if you want more speed without getting too adventurous.

Ryzen 9 5900X

Ryzen 9 5900X: tokens per second vs quality with model size as bubble size
Ryzen 9 5900X: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
CPU-12.69bpw0.929910.872.69
CPU-22.89bpw0.964110.192.89
CPU-33.40bpw0.98499.943.40
CPU-43.51bpw0.98589.513.51
CPU-54.06bpw0.99699.074.06
CPU-64.12bpw0.99818.574.12
1IQ2_XXS0.91919.742.46
2IQ2_M0.94689.662.63
3Q2_K_XL0.95019.602.80
4IQ3_XXS0.96719.063.02
5IQ3_S0.96548.993.13
6Q3_K_S0.98638.303.52
7Q3_K_M0.98178.363.77
8Q3_K_XL0.98438.423.83
9IQ4_XS0.98148.274.03
10IQ4_NL0.98708.254.11
11Q4_K_L0.99568.684.66
12Q4_K_S0.98618.364.77
13MXFP4_MOE0.98238.004.98
14Q4_K_M0.98848.185.08
15Q4_K_XL0.98757.975.13
16Q5_K_S0.98627.835.73
17Q5_K_M0.98757.796.06
18Q5_K_XL0.98777.806.09
19Q6_K_S0.99228.336.58
20Q6_K0.99658.016.66
21Q6_K_XL0.99677.277.40
22Q8_00.99667.428.52
Ryzen 9 5900X: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

Almost the same story as the i7, which is refreshing.

The frontier is clean, the recommendations are clean, and the overall trade-off pattern looks very familiar. Again, CPU-5 remains a great default choice, CPU-6 is the quality-first choice, and CPU-4 is the more speed-oriented alternative.

Ultra 7 265KF

Ultra 7 265KF: tokens per second vs quality with model size as bubble size
Ultra 7 265KF: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
CPU-12.69bpw0.929914.002.69
CPU-22.89bpw0.964113.632.89
CPU-33.40bpw0.984912.903.40
CPU-43.51bpw0.985812.683.51
CPU-54.06bpw0.996912.274.06
CPU-64.12bpw0.998111.794.12
1IQ2_XXS0.919112.712.46
2IQ2_M0.946812.732.63
3Q2_K_XL0.950112.122.80
4IQ3_XXS0.967112.003.02
5IQ3_S0.965411.423.13
6Q3_K_S0.986310.603.52
7Q3_K_M0.981710.293.77
8Q3_K_XL0.984311.003.83
9IQ4_XS0.981410.524.03
10IQ4_NL0.987010.404.11
11Q4_K_L0.995611.414.66
12Q4_K_S0.986110.584.77
13MXFP4_MOE0.982310.374.98
14Q4_K_M0.988410.525.08
15Q4_K_XL0.987510.375.13
16Q5_K_S0.986210.395.73
17Q5_K_M0.987510.266.06
18Q5_K_XL0.987710.556.09
19Q6_K_S0.992211.166.58
20Q6_K0.996510.876.66
21Q6_K_XL0.99679.187.40
22Q8_00.99669.748.52
Ultra 7 265KF: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

The Ultra 7 is just good news all around.

Here again, the same ByteShape models sit on the frontier, but the absolute speeds are better. Sounds familiar? CPU-5 is a great default, CPU-6 is near-baseline pick, and CPU-4 or CPU-3 are there if you want to push harder on speed.

Raspberry Pi 5 (16 GB)

Raspberry Pi 5: tokens per second vs quality with model size as bubble size
Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint) Tap Show Legend below for model details.
Show Legend
#ModelAccTPSBPW
CPU-12.69bpw0.92993.292.69
CPU-22.89bpw0.96413.042.89
CPU-33.40bpw0.98492.963.40
CPU-43.51bpw0.98582.843.51
1IQ2_XXS0.91912.802.46
2IQ2_M0.94682.742.63
3Q2_K_XL0.95012.732.80
4IQ3_XXS0.96712.583.02
5IQ3_S0.96542.533.13
6Q3_K_S0.98632.373.52
Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint) Hover over the bubbles for model details.

It's alive, barely, but still kicking. For Pi deployments, maybe take a look at Qwen3-Coder which is nearly 3x faster.

CPU-4 and CPU-3 are worth a closer look, landing around the 3 TPS range while keeping quality in a range that is still genuinely usable.

Benchmarking Methodology

Generating our models takes little time. What takes disproportionately longer is evaluating all reported models across the following seven benchmarks:

  • BFCL_V3 for tool calling
  • GSM8K_V for vision + math
  • LiveCodeBench V6 and HumanEval for coding
  • GSM8K for math
  • IFEVAL for instruction following
  • MMLU for general knowledge

The reported score is the mean across these benchmarks, with each benchmark normalized to the original model's score.

We evaluated GSM8K_V in both instruct and thinking modes and treated them as separate entries in the average. In practice, we observe that the relative performance between modes remains consistent.

All evaluations were run with llama.cpp b8204.