From BF16 to Bits That Matter: How ShapeLearn Optimizes Llama and Qwen

We're ByteShape, a team spun out of a University of Toronto research group, building tools to make AI much faster and more efficient. Our core technology, ShapeLearn, automatically learns the best datatype for every part of a model. It adapts precision at any granularity for both weights and activations, keeping quality high even at very low bitlengths. ShapeLearn is not model-specific; it can work with any architecture, LLMs to Diffusion and beyond.

To show ShapeLearn in practice, our first public release includes several GGUF-quantized models based on Qwen3 4B Instruct 2507 and Llama 3.1 8B Instruct. We're releasing a range of variants from 5 bits down to 2.7 bits. to showcase accuracy-size-performance tradeoffs. The advantage becomes especially clear at lower bitlengths, where our method preserves quality far better than established approaches.

Right now we're focused on the llama.cpp backend to deliver high-performance quantized models that a wide audience can easily use. Each release will include multiple quality-cost tradeoffs, plus detailed evaluations so you know exactly what to expect from each variant. For now, we benchmark against Unsloth, the leading llama.cpp-based quantization method, and across several established tasks.

TL;DR

Compared to leading quantized models with almost identical TPS with llama.cpp, our Qwen 4B Instruct 2507 variants greatly reduce error rates:

3.34 bpw: 2.46× less error on Raspberry Pi 5
3.55 bpw: 3.5× less error on RTX5090.

Read on for head-to-head comparisons with leading quantization methods and for more models that explore different quality-size-speed sweet spots.

Qwen3 4B Instruct 2507

Let's start with Qwen3 4B Instruct 2507. We use ShapeLearn to pick the best per-tensor GGUF weight formats to minimize model size while keeping output quality high. We focus on the 2.5–5 bits/weight range, where quality starts to drop, and the most interesting tradeoffs appear.

For reference, we compare against Unsloth-produced GGUF models, which offer a strong set of high-quality quantizations across many models.

Size vs. Accuracy: Bits/Weight vs. Score

The figure below shows bits/weight (BPW) vs quality (normalized average across 4 tasks) for ShapeLearn and Unsloth, along with their trendlines. Our models sit above (higher quality) and to the left (smaller footprint) of the Unsloth ones.

Bits per weight vs quality comparison showing ByteShape models achieving higher quality at lower sizes — Bits per weight (BPW) vs quality (normalized average across 4 tasks) for ShapeLearn and Unsloth models

Bits per weight (BPW) vs quality (normalized average across 4 tasks) for ShapeLearn and Unsloth models

For example, our model IQ-3.07bpw [#4 on the graph]:

is 18% smaller (0.67 bits) than unsloth/Qwen3-4B-Instruct-2507-Q3_K_S [#6 on the graph] and higher quality with 1.15x smaller error rate(7.94% vs 9.10%),
delivers ≈4.02x lower error rate (7.94% vs 31.97%) than unsloth/Qwen3-4B-Instruct-2507-UD-IQ2_M [#1 on the graph] with only a 1% larger footprint (0.04 bits).

All of these points on the curve come from changing a single ShapeLearn hyperparameter that controls quantization aggressiveness, which lets us smoothly move along the quality–cost trade-off.

Real Performance: TPS (tokens/second)

So far, we've looked at model size vs quality. In practice, speed is paramount to usability. We therefore measured throughput (TPS) for all variants on NVIDIA GPUs, Intel CPUs, and Raspberry Pi.

GPU: RTX 5090

The graph below shows average TPS vs quality for the RTX 5090, with bubble size proportional to model footprint. We again focus on models in the 2.7–5 bits/weight range, which correspond to roughly 340–420 TPS on the RTX 5090.

RTX 5090 performance: tokens per second vs quality with model size shown as bubble size — RTX 5090: Tokens per second vs quality (bubble size = model footprint)

RTX 5090: Tokens per second vs quality (bubble size = model footprint)

Our models cluster to the right (higher throughput) and above (higher quality) compared to the reference models.

For example:

Our IQ-4.04bpw (382 TPS) [#7 on the graph] has both higher throughput and higher quality than all 13 Unsloth reference models in the 330–380 TPS range. The first Unsloth model that surpasses it in quality is unsloth/Qwen3-4B-Instruct-2507-Q5_K_S [#16 on the graph] (322 TPS), which is 16% slower.
Our models in the top-right region (IQ-3.31bpw [#5 on the graph] and IQ-3.07bpw [#4 on the graph]) are alone at the top, combining >90% accuracy with the highest throughput among all evaluated models.

On GPU setup, low-bit KQ kernels are noticeably less efficient than IQ kernels. As a result, we constructed the GPU variants with a preference for IQ quantization.

CPU: Intel i7 12th-Gen CPU

The graph below shows average TPS vs quality for the i7-12700KF CPU, with bubble size proportional to model footprint. We again focus on models in the 2.7–5 bits/weight range, which correspond to roughly 16–30 TPS on the Intel i7 CPU. Contrary to the behaviour observed on our GPU setup, the KQ kernels outperform the IQ kernels on the Intel CPU. Accordingly, we suggest trying the KQ models first if you plan to run on CPUs.

Intel CPU performance: tokens per second vs quality — Intel i7-12700KF: Tokens per second vs quality (bubble size = model footprint)

Intel i7-12700KF: Tokens per second vs quality (bubble size = model footprint)

Just like with the RTX 5090, our models cluster to the right (higher throughput) and above (higher quality) compared to the reference models.

For example, our model KQ-3.34bpw [#4 on the graph]:

is more than 10% faster than unsloth/Qwen3-4B-Instruct-2507-Q3_K_S [#6 on the graph] and noticeably better quality (1.15x smaller error rate),
delivers 2.46x smaller error rate than unsloth/Qwen3-4B-Instruct-2507-Q2_K_L [#2 on the graph] with slightly smaller throughput (2% less TPS).

Raspberry Pi 5

Finally, the graph below shows average TPS vs quality for the Raspberry Pi 5, with bubble size proportional to model footprint. We again focus on models in the 2.7–5 bits/weight range, which correspond to roughly 4–7 TPS. Similar to the behaviour observed on our Intel i7 12th-Gen CPU, the KQ kernels outperform the IQ kernels.

Raspberry Pi performance: tokens per second vs quality — Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint)

Raspberry Pi 5: Tokens per second vs quality (bubble size = model footprint)

Exactly like with the Intel CPU, our models cluster to the right (higher throughput) and above (higher quality) compared to the reference models on Raspberry Pi.

The exact same model KQ-3.34bpw [#4 on the graph]:

is almost 13% faster than unsloth/Qwen3-4B-Instruct-2507-Q3_K_S [#6 on the graph] and at noticeably better quality (1.15x smaller error rate),
delivers 2.48x smaller error rate than unsloth/Qwen3-4B-Instruct-2507-Q2_K [#2 on the graph], this time with ever so slightly better throughput than the Unsloth reference (only 0.5% though)!

Llama 3.1 8B Instruct

For completeness, here's the size vs. quality tradeoff for various Llama 3.1 8B Instruct GGUF-quantized models (heads up! these results do not include livecodebench (LCB) since the score is really low even for the baseline). The trend is similar to what we observed for Qwen, but the difference is less extensive. ShapeLearn generally delivers models that offer higher quality for the same footprint, with the benefits being higher at the lower-bit regime.

Llama 3.1 8B size vs quality comparison — Llama 3.1 8B Instruct: Size vs quality tradeoff

Llama 3.1 8B Instruct: Size vs quality tradeoff

Just like in Qwen3 4B Instruct, the better footprint vs. quality translates directly into better performance vs quality measurements.

On the RTX 5090 (figure below), ByteShape Llama 3.1 8B Instruct still pulls ahead of the Unsloth models, though the gap isn't as large as with Qwen3 4B. Even so, the results show how consistently ShapeLearn reveals the performance–quality trade-off without surprises.

Llama 3.1 8B RTX 5090 performance — Llama 3.1 8B on RTX 5090: Tokens per second vs quality

Llama 3.1 8B on RTX 5090: Tokens per second vs quality

On the CPU side, we're back to seeing clear performance wins. The figures below show this on both an Intel i7 12th Gen and a Raspberry Pi.

Llama 3.1 8B Intel CPU performance — Llama 3.1 8B on Intel i7-12700KF: Tokens per second vs quality

Llama 3.1 8B on Intel i7-12700KF: Tokens per second vs quality

Just like with the Qwen3 4B, our models cluster to the right (higher throughput) and above (higher quality) compared to the reference models, in the 12–16 tokens range (where our focus is).

For instance, in comparison to the best reference extreme quantization model unsloth/Llama-3.1-8B-Instruct-Q3_K_S [#6 in the graph]:

Our KQ-3.41bpw [#5 on the graph] is slightly (5% more TPS) faster than unsloth/Qwen3-4B-Instruct-2507-Q3_K_S [#6 on the graph] and noticeably better quality (1.68x smaller error rate).
Our KQ-3.24bpw [#3 on the graph] delivers more than 10% speedup with a slightly better quality (1.07 times smaller error rate).

Llama 3.1 8B Raspberry Pi performance — Llama 3.1 8B on Raspberry Pi 5: Tokens per second vs quality

Llama 3.1 8B on Raspberry Pi 5: Tokens per second vs quality

And we have the same trend and story for the Raspberry Pi. For instance, the same comparison as above, unsloth/Llama-3.1-8B-Instruct-Q3_K_S [#6 in the graph]:

Our KQ-3.41bpw [#5 on the graph] is slightly (1% more TPS) faster than unsloth/Qwen3-4B-Instruct-2507-Q3_K_S [#6 on the graph] and noticeably better quality (1.68x smaller error rate).
Our KQ-3.24bpw [#3 on the graph] delivers more than 7% speedup with a slightly better quality (1.07 times smaller error rate).

How ShapeLearn Learns Bitlengths

ShapeLearn is our gradient-based tool for learning the best numeric format (bitlength and type) for weights and activations. Instead of hand-picking datatypes, it uses gradient descent to choose them automatically, at any granularity you need (per-tensor, per-channel, per-group, etc.). For this reason, it can learn datatypes for any model and any quantization.

It supports a wide range of formats: integer, floating-point, block floating-point, micro-scaling, and similar quantization schemes. With a single "aggressiveness" hyperparameter, you can move along the cost–quality trade-off curve and control how you trade memory, compute, and accuracy. Any quantifiable objective (e.g., memory footprint, operation count) can be optimized directly.

Because it relies on gradient descent, ShapeLearn needs a bit of representative data, but it converges quickly. By choosing a relevant dataset, you can tune datatype selection to a specific use case and get a quantized model that's optimized for that workload.

ShapeLearn can be used as post-training quantization (PTQ) or quantization-aware training (QAT) to select inference formats, or even during training to pick efficient training formats.

For the models we are releasing, we used ShapeLearn in a PTQ setup with:

Weights-only quantization (to match GGUF),
Integer group quantization (again matching GGUF),
Learned bitlengths with frozen weights.

We fine-tuned bitlengths for one epoch on a 15k-sample, open-source, commercially-friendly general-knowledge and instruction-following dataset. On a single RTX 5090, this produces a quantized Qwen3 4B model in under 30 minutes and a Llama 3.1 8B model in about an hour. All one would need to generate task-specific variants is swapping this dataset. The workflow remains the same.

ShapeLearn is our first commercial product, built on a decade of research on training and inference acceleration at the University of Toronto. It extends ideas from our prior published work, including:

BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization
Schrödinger's FP: Dynamic Adaptation of Floating-Point Containers for Deep Learning Training
Every Bit Matters: A Hardware/Software Approach for Enabling More Powerful Machine Learning Models
And if you really can't sleep at night and need something to knock you out: Learning Efficient Datatypes for Deep Neural Network Training and Inference

From Stormtrooper Aim to Jedi Precision

Selecting datatypes is hard. There are many possible formats, thousands of tensors, and evaluating each choice is expensive. The design space explodes quickly. Heuristics that try to minimize or cap per-tensor error help, but they don't deliver truly precise, cost–quality–optimal datatype assignments.

The graph below illustrates the difference. The tensors are ordered by size first, and then alphabetically.

Datatype comparison showing Unsloth vs ByteShape models — Per-tensor datatype assignments: Unsloth baseline vs three ByteShape models

Per-tensor datatype assignments: Unsloth baseline vs three ByteShape models

We compare four models:

One Unsloth baseline
Three ByteShape models with average average bitlengths that are just above, about the same, and just below the baseline.

Unsloth's Choices:

Unsloth mostly uses a uniform datatype setup with large stretches of the network share the same precision, with a few strategic exceptions. It's simple and stable, yet leaves lots of opportunity untapped.

ShapeLearn's Strategy:

ShapeLearn looks at every tensor individually and asks: How low can I drop the bitlength without hurting quality?

Big tensors (like embeddings) often get pushed much lower than in Unsloth, saving a lot of space.
Small or high-impact tensors get more bits when needed to avoid accuracy loss.

Why this matters:

Instead of one-size-fits-all quantization, ShapeLearn builds a fine-grained, cost-aware datatype map shrinking where it's cheap and preserving bits where it counts.

A Closer Look at ShapeLearn's Choices:

Even though the models are roughly the same size (3.31 bpw (bits/weight) for Unsloth and 3.07, 3.31, 3.55 bpw for the ByteShape models) our variants are much better:

All three ByteShape models achieve 2.2–3.5× lower error rates.
The first two also run slightly faster on an RTX 5090; the last one is only marginally slower.

The stacked datatype view in the graph also highlights another property of ShapeLearn: consistency.

Tensors that get a certain datatype at a more aggressive setting will get at least that many bits (often more) at less aggressive settings.
If a tensor is important, ShapeLearn learns to treat it as such and assign it a more expressive datatype. As aggressiveness increases, it will gradually make harder tradeoffs and only then reduce that tensor's bitlength.

Finally, the graph below shows that ShapeLearn learns these patterns quickly and confidently. The datatypes converge within a couple of dozen iterations. In particular, the largest tensors get assigned quickly and permanently, while the smaller tensors do need a bit more samples to zero in precisely on the required datatypes.

Animated view of ShapeLearn learning bitlengths over time — ShapeLearn convergence: Watching bitlengths stabilize across training iterations

ShapeLearn convergence: Watching bitlengths stabilize across training iterations

In the end, ShapeLearn delivers a quantized model that excels at balancing quality and cost. Instead of guessing and hoping, Shapelearn delivers a clear, optimized datatype assignment for every tensor. In other words, these are the datatypes you're looking for.

Evaluation Methodology

You could just trust us here and skip to the Wrapping-Up section, you know…
But if you're still here, you must be a fellow overthinker. A true kindred soul in the art of not trusting anything on the first try. Fine. Let's dig in.

Our evaluation framework employs a multi-task assessment protocol designed to measure model performance across diverse capabilities. The evaluation infrastructure is built on lighteval, leveraging llama.cpp's server backend for efficient GGUF model inference across multiple platforms. Each model is evaluated using a custom automation pipeline that orchestrates server initialization, task execution, and result collection in a fully reproducible manner.

The evaluation architecture follows a client-server paradigm where llama.cpp's llama-server serves models via an OpenAI compatible API endpoint, and lighteval acts as the evaluation client through the liteLLM backend. The system is configured to handle 4 concurrent requests (concurrent_requests: 4, matching the -np 4 parameter in llama-server) with a context window of 32,768 tokens distributed across parallel processes. Server parameters include a batch size of 32,768 and a micro-batch size of 2,048 tokens, with all GPU layers offloaded (-ngl 999) for maximum performance. Temperature is set to 0 with a fixed seed of 42 to ensure deterministic and reproducible results across all evaluations.

For averaging metrics, we first normalize all metrics by the BF16 baseline score and then compute the mean. We also apply a cap of 1.0 per task so that a model cannot exceed a score of 1 on any individual task. This ensures that all tasks contribute equally and that the final aggregated score reflects balanced effectiveness rather than being dominated by outlier values.

Task Configuration and Customizations

We evaluate models across four primary benchmarks, each targeting distinct capabilities: GSM8K (8-shot), MMLU (5-shot), IFEval (0-shot), and LiveCodeBench Code Generation (release v4). Each task is evaluated as described in the details provided below.

GSM8K

GSM8K tests mathematical reasoning using 8 few-shot examples drawn from the training set via random sampling. The task uses a structured prompt that instructs models to solve math problems step-by-step and format their final answer as "ANSWER: $ANSWER". The evaluation employs expression-based gold metric scoring with a maximum generation size of 256 tokens and "Question:" as the stop sequence. The prompt function formats examples to include the reasoning chain followed by the numerical answer, enabling the model to learn the expected output structure from the few-shot demonstrations. We report the extractive_match metric, which extracts and compares the final numerical answer to measure accuracy.

MMLU

MMLU evaluates broad knowledge across 57 subjects using 5-shot examples selected from the development split. Each subject is treated as a separate task (e.g., mmlu:anatomy, mmlu:college_chemistry) using the HELM-style prompt format. The tasks are configured with a generation size of only 5 tokens since the expected output is a single letter (A, B, C, or D). Our setup includes a critical customization for MMLU: we inject a system prompt instructing models to "respond with ONLY the letter of your answer (A, B, C, or D), nothing else" to improve accuracy by preventing verbose explanations. We report the aggregated em (exact match) metric, averaging scores across all 57 MMLU subjects to measure overall knowledge performance.

IFEVAL

IFEval (Instruction Following Eval) measures a model's ability to follow precise formatting and structural instructions without few-shot examples (0-shot). Unlike traditional tasks with gold answers, IFEval validates whether model outputs conform to specific rules defined in instruction constraints (e.g., "include exactly 3 paragraphs", "use all capital letters"). The evaluation computes four metrics: prompt-level and instruction-level accuracy under both strict and loose matching criteria. Our implementation includes preprocessing that handles multiple response variations (removing first/last lines, stripping asterisks) to account for formatting quirks. We report the prompt_level_strict_acc metric, which represents the most stringent measurement of instruction-following capability at the prompt level.

LiveCodeBench

LiveCodeBench Code Generation (release v4) assesses programming ability through real-world coding problems. The task is configured with an extensive generation size of 32,768 tokens to accommodate complete code solutions with no stop sequences (uses EOS token). We report the codegen_pass@1:16 metric, which generates 16 solutions per problem and computes the pass@1 rate by executing code against test cases within a sandboxed environment. The metric uses multiprocessing with 16 workers for parallel execution and enforces a 10-second timeout per test case. This setup enables thorough assessment of code correctness through actual execution rather than pattern matching.

Wrapping-Up

This first release is our demo that learned datatypes beat hand-crafted recipes. With ShapeLearn, we can navigate the datatype space per tensor, achieve better quality–size tradeoffs than existing GGUF baselines, and move smoothly along the cost–quality curve by turning a single aggressiveness knob.

On Qwen3 4B Instruct 2507 and Llama 3.1 8B Instruct, that translates into:

Smaller models at the same quality
Higher quality at the same size
Real throughput gains on RTX 5090, Intel i7, and even Raspberry Pi

This is just the starting point. In the future, we plan to:

Expand to more architectures and sizes
Show how this extends to mixture-of-experts models
Demonstrate how to maintain "thinking" capabilities under extreme quantization
Quantize models for different tasks and domains
Illustrate the approach on other architectures, such as diffusion models
Explore activations and KV cache quantization more aggressively
Keep publishing transparent benchmarks against strong open baselines

If you care about running good models on limited hardware, we'd love for you to:

Try the released models
Sanity-check them on your own workloads
Tell us what you think and what you'd like to see next — while we can't promise specific releases, we'll do our best to factor community ideas into our roadmap

You can find us on Reddit and Hugging Face. We hope you get something useful out of this work.

The ByteShape team