Every Hardware Deserves a Coder:
Devstral Small 2 24B & Qwen3 Coder 30B
We'll keep this one short. You already know what we do. You've definitely read our previous posts on how ShapeLearn works… right. Right? So no warm-up: one small methodology tweak, then straight to the facts.
ShapeLearn learns the best datatypes for a specific use case and the dataset that represents it. In our first two blogs we released general instruction-following models, so we trained on a broad mix of "a bit of everything." For this coder release, the mix is intentionally coder-shaped: code understanding and completion, refactoring and bug fixing, tool/function calling, tricky formatting, some "visual coding" tasks (e.g., recreating HTML from a screenshot or interpreting charts), plus a solid chunk of math/logic and a small amount of general instruction-following. The focus isn't martini recipes and outfit suggestions, but on coding, math, and calling tools (so yes, it will fetch the weather if you ask for outfit advice).
This release shows how ShapeLearn brings strong coding models to a wide range of devices. We optimized Devstral Small 2 for high-end consumer NVIDIA GPUs (RTX 40 and 50 series), and Qwen3-Coder for the full spectrum of our available hardware: Raspberry Pi all the way to an RTX 5090.
Why not Devstral on CPU or Pi? Devstral is a dense 24B-parameter model, and even with ShapeLearn it lands under 10 TPS on those platforms. If you're benchmarking in the "near-c" (near-coffee) regime, it's flying, but for day-to-day coding workflows, we aim a bit faster.
So, which model should you use?
It's complicated. (Isn't it always?)
- Devstral is more capable. For example, it supports vision (Qwen3 doesn't), and it scores higher on several benchmarks.
- Devstral is also much more demanding because it activates all 24B parameters per token. In practice, that usually means lower TPS.
- Your coding experience is also shaped by the context window you can run. More context helps the model track history and reason over larger code regions but it also increases compute cost and memory pressure, especially from KV cache traffic. Devstral needs about ~5 GB per 32K context (FP16), close to 2× Qwen3, which is around ~3 GB.
Our recommendation:
- If you have a 40- or 50-series NVIDIA GPU, and the context you want fits, and you're happy with Devstral's TPS, Devstral it is. That's three "ifs."
- Otherwise, go with Qwen3.
One More Note: Custom Qwen Template
Qwen GGUF files are released with a custom template that performed better in our testing than existing alternatives. In particular, it supports parallel tool call requests when needed.
For a fair comparison, all evaluations were run using the exact same template for both Unsloth and our models.
This template was tested with llama.cpp. If you run into issues, let us know.
With that out of the way, here are the models and the results.
Devstral-Small-2-24B-Instruct-2512
We showcase Devstral's performance on RTX 5090, RTX 4090 and RTX 4080.
Two key takeaways:
- For the same quality level, ShapeLearn's datatype learning delivers a smaller and faster model than other approaches. The smaller footprint enables larger context windows, which can improve code generation quality in practice.
- Devstral exhibits a clear quality cliff around 2.30 bpw, illustrating that Devstral is way more sensitive to quantization than Qwen. At this regime, ad-hoc datatype selection gets punished fast whereas ShapeLearn's approach uncovers the right recipe delivering roughly 50% higher quality at the same speed.
Below are the accuracy vs. TPS measurements for RTX 4080, RTX 4090, and RTX 5090.
RTX 4080 (16GB)
4.4 bits
IQ4_XS-4.04bpw [IQ-8] is all you need, and
IQ3_S-3.47bpw [IQ-7] works too if larger context
matters. Really squeezed for VRAM? Several other choices exist.
RTX 4090 (24GB)
Model IQ4_XS-4.04bpw [IQ-8] makes 64K+ context
possible in a RTX 4090 while keeping near perfect accuracy.
RTX 5090 (32GB)
Now we are talking… 128K context and practically baseline accuracy
with IQ4_XS-4.04bpw [IQ-8]!
Qwen3-Coder-30B-A3B-Instruct
For Qwen3-Coder, similar to our previous release, we provide two main groups of models:
- CPU-Optimized
- GPU-Optimized
Let's start with the superstar from our previous release.
Raspberry Pi 5 (16GB)
Is it a bird? Is it a plane? It's a 30B coding model doing 9 TPS on Raspberry Pi! And it's good too: accuracy is around 90% of BF16 quality.
That said, let's be real: if any brave soul uses this, you deserve a medal for patience and perseverance. Let us know!
Intel Core i7 CPU
Now we are talking. 16+ TPS! But be careful, there is a cliff at 20 TPS. Actually… not really. ShapeLearn to the rescue…
RTX 4080 (16GB)
Now Qwen is really motivated!
RTX 5090 (32GB)
We are in the big league now: ultra-fast Qwen or Devstral? Hard choice.
Benchmarking for Coders
Devstral supports both tool calling and vision, so we evaluated it on:
- BFCL_V3 for tool calling
- GSM8K_V for vision
- LiveCodeBench V6 and HumanEval for coding
- GSM8K and Math500 for math
- MMLU for general knowledge
The reported score is the mean across these benchmarks, with each benchmark normalized to the original model's score.
Qwen was evaluated using the same setup, with two exceptions:
- No GSM8K_V (no vision support)
- No MMLU (not a general knowledge evaluation)
All evaluations were run with llama.cpp b7744. We used
4K as the minimum context window required for a model to be
considered "fit" on a given device.
The ByteShape team