Happy GPUs, Moody CPUs: Qwen 3.5 9B

Here's our next ByteShape release: Qwen 3.5 9B. As usual, ShapeLearn finds better quality/speed trade-offs across a range of hardware, and we'll walk through what looks the best on each one.

We're also putting out a fun "little" tutorial on how to run our models with OpenCode very soon.

TL;DR

We're kicking off Qwen 3.5 with the 9B model, and GPUs were unusually agreeable about it. The same few ByteShape models kept coming out on top:
GPU-7 is the absolute baseline quality pick, GPU-6 offers the best overall balance, and GPU-4 is the fun faster one if you are okay giving up a bit more quality.

CPUs, meanwhile, had much stronger opinions. There is still a lot to like there, but the best model depends much more on the exact device you are running on.

GPUs

Across GPUs, the story is very consistent. The same few ByteShape models keep showing up as the best trade-offs, so once you know what kind of quality/speed balance you want, the recommendation does not change much from one GPU to another.

RTX 5090 (32 GB)

Let's start by highlighting the RTX 5090 results. Like in all our other releases, ShapeLearn finds quantized models that achieve clearly better quality/speed trade-off.

RTX 5090: tokens per second vs quality with model size as bubble size — RTX 5090: Tokens per second vs quality (bubble size = model footprint)

RTX 5090: Tokens per second vs quality (bubble size = model footprint)

ShapeLearn consistently delivers the best quality/speed trade-off across the full spectrum. The reductions in relative error are visually more obvious in the above 190 TPS regime. If you are looking for a near baseline quality model (within 1%!), GPU-6 it is, with just above 190 TPS. If you can afford it, GPU-7 gets 99.63% of the baseline while reaching almost 180 TPS. Finally, GPU-4 exceeds 210 tokens/second while still achieving less than 4% error.

RTX 4080 (16 GB)

We see a very similar story for a less fancy GPU, like the 16GB RTX 4080. However, the performance gains we get for sacrificing quality are increasing!

RTX 4080: tokens per second vs quality with model size as bubble size — RTX 4080: Tokens per second vs quality (bubble size = model footprint)

RTX 4080: Tokens per second vs quality (bubble size = model footprint)

We would recommend the same models in this case.

For absolutely matching baseline quality, go with GPU-7 with 99.63% quality and just shy of 100 TPS.

If you are not crazy about absolute matching baseline quality, and 1% loss is something that sounds appealing, go with GPU-6 with more than 105 TPS.

If you are okay with a slight quality loss, pick GPU-4. It will have an even bigger effect with 16% TPS increase (over GPU-6), while still achieving sub 4% error!

If you are even more adventurous, GPU-2 will achieve 140 TPS (over 30% increase vs GPU-6) with "almost" 90% quality!

RTX 3090 (24 GB)

Old, but used to be fancy? Still the same story. These charts are not very creative!

RTX 3090: tokens per second vs quality with model size as bubble size — RTX 3090: Tokens per second vs quality (bubble size = model footprint)

RTX 3090: Tokens per second vs quality (bubble size = model footprint)

Pick the same models as above. They are going to run a bit slower, but that is par for the course with older hardware.

RTX 5060Ti (16 GB)

New, but on the lower end? Same story? Impossible!

RTX 5060Ti: tokens per second vs quality with model size as bubble size — RTX 5060Ti: Tokens per second vs quality (bubble size = model footprint)

RTX 5060Ti: Tokens per second vs quality (bubble size = model footprint)

We are looking at lower TPS here, but the overall pattern stays the same. Giving up some quality still buys a noticeable speed boost. In practice, the 5060 Ti behaves a lot like the 4080, so the discussion above is relevant here too.

CPUs

This is where it gets messy. We tested a bunch of quantized models we generated on quite a few CPUs and… well the CPUs had "diverging opinions" on which models actually run fast. Since every CPU had its favourite models and the ones it really dislikes, we release all of them and highlight the ones that work well on each CPU (the brighter ones). This also clearly shows that model optimization should be done with the exact device in mind, as models that run well on one CPU might work horribly on others. It's not as simple as CPU vs. GPU.

Intel Core i7 12700KF

The CPU side starts hurting right away. For i7, ByteShape models will extend the usable quality range towards the "real-time" range, but even with the boost, it will test your patience.

Intel Core i7 12700KF: Tokens per second vs quality (bubble size = model footprint)

If you prefer the wait over possible traces of gibberish, go with CPU-5 at 6.45 TPS and 96.4% quality. If you would rather buy a bit more speed with some quality loss, CPU-2 reaches 7.41 TPS at 87.8% quality.

Ryzen 9 5900X

Ryzen likes all our models! Nice, clear, optimal performance vs quality line.

Ryzen 9 5900X: Tokens per second vs quality (bubble size = model footprint)

Model-wise, the trade-offs end up looking pretty similar to the i7: CPU-8 gets 99.15% quality at 6.60 TPS, CPU-5 gives 96.4% quality at 7.16 TPS, and CPU-2 reaches 7.72 TPS at 87.8% quality.

Ultra 7 265KF

If you are lucky enough to run a fancier CPU like Ultra 7 (but not lucky enough to run a proper GPU), you can comfortably run in "real-time".

Ultra 7 265KF: Tokens per second vs quality (bubble size = model footprint)

In this case, CPU-9 is the clear choice. It achieves 99.41% quality at 8.77 TPS. No need to look further.

RPi 5 (16GB)

More like RIP5. It's painfully slow. Go with 4b or 35b MoE once we release them!

Raspberry Pi 5: tokens per second vs quality with model size as bubble size — Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint)

Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint)

Benchmarking Methodology

We evaluated the models across seven benchmarks:

BFCL_V3 for tool calling
GSM8K_V for vision + math
LiveCodeBench V6 and HumanEval for coding
GSM8K for math
IFEVAL for instruction following
MMLU for general knowledge

The reported score is the mean across these benchmarks, with each benchmark normalized to the original model's score.

We evaluated GSM8K_V in both instruct and thinking modes and treated them as separate entries in the average. In practice, we observe that the relative performance between modes remains consistent.

All evaluations were run with llama.cpp b8204.

More Qwen 3.5 models are on the way

Hopefully, very soon!

Wrapping up

Thanks for reading and for following our work. This is the first Qwen 3.5 drop, and we'll keep adding the rest here as they are ready.