Every Hardware Deserves a Coder: Devstral Small 2 24B & Qwen3 Coder 30B

We'll keep this one short. You already know what we do. You've definitely read our previous posts on how ShapeLearn works… right. Right? So no warm-up: one small methodology tweak, then straight to the facts.

ShapeLearn learns the best datatypes for a specific use case and the dataset that represents it. In our first two blogs we released general instruction-following models, so we trained on a broad mix of "a bit of everything." For this coder release, the mix is intentionally coder-shaped: code understanding and completion, refactoring and bug fixing, tool/function calling, tricky formatting, some "visual coding" tasks (e.g., recreating HTML from a screenshot or interpreting charts), plus a solid chunk of math/logic and a small amount of general instruction-following. The focus isn't martini recipes and outfit suggestions, but on coding, math, and calling tools (so yes, it will fetch the weather if you ask for outfit advice).

This release shows how ShapeLearn brings strong coding models to a wide range of devices. We optimized Devstral Small 2 for high-end consumer NVIDIA GPUs (RTX 40 and 50 series), and Qwen3-Coder for the full spectrum of our available hardware: Raspberry Pi all the way to an RTX 5090.

Why not Devstral on CPU or Pi? Devstral is a dense 24B-parameter model, and even with ShapeLearn it lands under 10 TPS on those platforms. If you're benchmarking in the "near-c" (near-coffee) regime, it's flying, but for day-to-day coding workflows, we aim a bit faster.

So, which model should you use?

It's complicated. (Isn't it always?)

Devstral is more capable. For example, it supports vision (Qwen3 doesn't), and it scores higher on several benchmarks.
Devstral is also much more demanding because it activates all 24B parameters per token. In practice, that usually means lower TPS.
Your coding experience is also shaped by the context window you can run. More context helps the model track history and reason over larger code regions but it also increases compute cost and memory pressure, especially from KV cache traffic. Devstral needs about ~5 GB per 32K context (FP16), close to 2× Qwen3, which is around ~3 GB.

Our recommendation:

If you have a 40- or 50-series NVIDIA GPU, and the context you want fits, and you're happy with Devstral's TPS, Devstral it is. That's three "ifs."
Otherwise, go with Qwen3.

One More Note: Custom Qwen Template

Qwen GGUF files are released with a custom template that performed better in our testing than existing alternatives. In particular, it supports parallel tool call requests when needed.

For a fair comparison, all evaluations were run using the exact same template for both Unsloth and our models.

This template was tested with llama.cpp. If you run into issues, let us know.

With that out of the way, here are the models and the results.

Devstral-Small-2-24B-Instruct-2512

We showcase Devstral's performance on RTX 5090, RTX 4090 and RTX 4080.

Two key takeaways:

For the same quality level, ShapeLearn's datatype learning delivers a smaller and faster model than other approaches. The smaller footprint enables larger context windows, which can improve code generation quality in practice.
Devstral exhibits a clear quality cliff around 2.30 bpw, illustrating that Devstral is way more sensitive to quantization than Qwen. At this regime, ad-hoc datatype selection gets punished fast whereas ShapeLearn's approach uncovers the right recipe delivering roughly 50% higher quality at the same speed.

Below are the accuracy vs. TPS measurements for RTX 4080, RTX 4090, and RTX 5090.

RTX 4080 (16GB)

4.4 bits IQ4_XS-4.04bpw [IQ-8] is all you need, and IQ3_S-3.47bpw [IQ-7] works too if larger context matters. Really squeezed for VRAM? Several other choices exist.

Devstral RTX 4080: tokens per second vs quality with model size as bubble size — Devstral Small 2 24B on RTX 4080 (16GB): TPS vs quality (bubble size = model footprint)

Devstral Small 2 24B on RTX 4080 (16GB): TPS vs quality (bubble size = model footprint)

RTX 4090 (24GB)

Model IQ4_XS-4.04bpw [IQ-8] makes 64K+ context possible in a RTX 4090 while keeping near perfect accuracy.

Devstral RTX 4090: tokens per second vs quality — Devstral Small 2 24B on RTX 4090 (24GB): TPS vs quality (bubble size = model footprint)

Devstral Small 2 24B on RTX 4090 (24GB): TPS vs quality (bubble size = model footprint)

RTX 5090 (32GB)

Now we are talking… 128K context and practically baseline accuracy with IQ4_XS-4.04bpw [IQ-8]!

Devstral RTX 5090: tokens per second vs quality — Devstral Small 2 24B on RTX 5090 (32GB): TPS vs quality (bubble size = model footprint)

Devstral Small 2 24B on RTX 5090 (32GB): TPS vs quality (bubble size = model footprint)

Qwen3-Coder-30B-A3B-Instruct

For Qwen3-Coder, similar to our previous release, we provide two main groups of models:

CPU-Optimized
GPU-Optimized

Let's start with the superstar from our previous release.

Raspberry Pi 5 (16GB)

Is it a bird? Is it a plane? It's a 30B coding model doing 9 TPS on Raspberry Pi! And it's good too: accuracy is around 90% of BF16 quality.

That said, let's be real: if any brave soul uses this, you deserve a medal for patience and perseverance. Let us know!

Qwen3-Coder 30B on Raspberry Pi 5: tokens per second vs quality — Qwen3-Coder 30B A3B on Raspberry Pi 5 (16GB): TPS vs quality (bubble size = model footprint)

Qwen3-Coder 30B A3B on Raspberry Pi 5 (16GB): TPS vs quality (bubble size = model footprint)

Intel Core i7 CPU

Now we are talking. 16+ TPS! But be careful, there is a cliff at 20 TPS. Actually… not really. ShapeLearn to the rescue…

Qwen3-Coder 30B on Intel i7: tokens per second vs quality — Qwen3-Coder 30B A3B on Intel Core i7: TPS vs quality (bubble size = model footprint)

Qwen3-Coder 30B A3B on Intel Core i7: TPS vs quality (bubble size = model footprint)

RTX 4080 (16GB)

Now Qwen is really motivated!

Qwen3-Coder 30B on RTX 4080: tokens per second vs quality — Qwen3-Coder 30B A3B on RTX 4080 (16GB): TPS vs quality (bubble size = model footprint)

Qwen3-Coder 30B A3B on RTX 4080 (16GB): TPS vs quality (bubble size = model footprint)

RTX 5090 (32GB)

We are in the big league now: ultra-fast Qwen or Devstral? Hard choice.

Qwen3-Coder 30B on RTX 5090: tokens per second vs quality — Qwen3-Coder 30B A3B on RTX 5090: TPS vs quality (bubble size = model footprint)

Qwen3-Coder 30B A3B on RTX 5090: TPS vs quality (bubble size = model footprint)

Benchmarking for Coders

Devstral supports both tool calling and vision, so we evaluated it on:

BFCL_V3 for tool calling
GSM8K_V for vision
LiveCodeBench V6 and HumanEval for coding
GSM8K and Math500 for math
MMLU for general knowledge

The reported score is the mean across these benchmarks, with each benchmark normalized to the original model's score.

Qwen was evaluated using the same setup, with two exceptions:

No GSM8K_V (no vision support)
No MMLU (not a general knowledge evaluation)

All evaluations were run with llama.cpp b7744. We used 4K as the minimum context window required for a model to be considered "fit" on a given device.

The ByteShape team