If It Fits, It Sits:
Qwen 3.6 35B

Published by ByteShape Team • 19 May 2026

We are releasing our next pair of ByteShape models: Qwen 3.6 35B NTP (next-token prediction) and Qwen 3.6 35B MTP (multi-token prediction).

The short version: clear improvements across all hardware we tested. For most devices, the recommendation is refreshingly simple:

Pick the largest ByteShape model (Model 5) if it fits in your memory. It is possible to squeeze out better performance with more aggressively quantized models, but Model 5 stands out for both quality and TPS. If you are not fortunate enough to be able to fit it, Models 3 and 2 are good alternatives.

That is not always how quantized model selection works. Sometimes the fastest model gives up too much quality. Sometimes the highest-quality model is too slow. Sometimes the “obvious” choice changes from one GPU to another.

For this release, the pattern is cleaner. Model size mostly decides what fits. Among the models that fit, Models 5, 3, and 2 generally give the best quality-speed trade-off.

TL;DR

  • We are releasing NTP and MTP ShapeLearn GGUF quantizations for Qwen 3.6 35B.
  • For most GPU users, the recommendation is simple: choose the largest ByteShape model that fits your memory and context needs.
  • On 24+GB GPUs, GPU-5 is the main recommendation for both quality and throughput.
  • On 16GB GPUs, GPU-3 is the main NTP recommendation, while MTP-GPU-2 is the practical MTP recommendation because of the extra MTP memory footprint.
  • MTP delivers a meaningful token-generation boost on GPUs, generally around 20–40% in our tests, while preserving benchmark quality.
  • MTP is currently not a great fit for CPU inference because prompt processing is already the bottleneck, and MTP exacerbates that pressure.

A bit of benchmarking methodology updates

We made two small but meaningful benchmarking methodology changes for this release:

1. Broader comparison set

We expanded the set of quantized models we compare against. Instead of focusing heavily on one publisher, we now include more publishers and select fewer models from each.

The goal is to compare against the models that each publisher appears to treat as their strongest or most recommended variants. We infer this from their published benchmark results, release notes, and model-card recommendations.

This is not perfect, but it keeps the benchmarking scope manageable. Since full quality evaluation takes time, we focus on a smaller set of strong candidates rather than a large set of models few users are likely to choose.

2. MMLU excluded for this release

We excluded MMLU from the headline benchmark score for this release.

The reason is not that Qwen 3.6 35B lacks knowledge. The issue is answer-format compliance. MMLU is strict: the model has to answer in the expected format. Qwen 3.6 35B sometimes ignores the requested answer format and rules, even in full precision. When it fails, the failure mode is often not “the model does not know the answer.” It is “the model did not follow the benchmark’s answer format.” For example, this could be: instead of responding with “A”, the model will likely try to reason about the answer. In other words: too much reasoning text, not enough benchmark-compatible answering. We will revisit this example in a separate evaluation post.

Since this is already a baseline-model issue, we do not think MMLU is a clean way to compare quantized variants for this release.

Now, let’s look at the models we are releasing.

Next-Token Prediction (NTP)

First, let’s look at the standard single-token prediction models. For most devices, the message is simple: Pick ByteShape model 5 if it fits. If not, look at models 3 and 2.

You should get a strong quality-speed trade-off, and the recommendation is fairly consistent across the hardware we tested.

GPUs

Compared with some of our previous releases, GPU model selection is unusually straightforward here. The ordering is mostly consistent across devices. Memory decides which models are in the race, and once a model fits, the larger ByteShape variant is usually the best choice.

24+GB GPUs: RTX 4090, 5090, & Pro 6000

GPU-5 is our main recommendation. It reaches about 99% of the baseline quality while delivering strong token-generation and prompt-processing throughput. Across the 24+GB GPUs we tested, it gives the best overall quality-speed trade-off.

If you need more room for context, or if you want a bit more throughput, GPU-4 and GPU-3 can be reasonable fallbacks. But for most 24+GB GPU users, GPU-5 is the one to start with.

RTX 4090: tokens per second vs quality (NTP)
RTX 4090: Tokens per second vs quality (NTP) Tap Show Legend below for model details.
RTX 4090: Tokens per second vs quality (NTP) Hover over the bubbles, or click Show Legend below, for model details.
Show Legend
#ModelAccTPSBPW
GPU-1IQ2_S-2.17bpw0.8870237.092.17
GPU-2IQ3_S-3.00bpw0.9600224.263.00
GPU-3IQ3_S-3.48bpw0.9659218.553.48
GPU-4IQ4_XS-3.93bpw0.9871217.453.93
GPU-5IQ4_XS-4.15bpw0.9927214.544.15
AUD-IQ3_S0.9695190.903.15
BUD-Q3_K_XL0.9826187.913.89
CUD-IQ4_XS0.9946186.624.09
DUD-IQ4_NL0.9907187.754.16
EUD-Q4_K_XL0.9935179.255.16
AIQ4_XS0.9887207.694.34
BQ4_K_M0.9887198.844.93
CQ4_K_L0.9848193.855.02
AAPEX-I-Mini0.9505215.883.30
BAPEX-I-Compact0.9707205.323.99
CAPEX-I-Quality0.9950189.715.26
AIQ3_S0.9671189.263.13
BIQ4_XS0.9866181.804.06
CQ4_K_M0.9879178.315.11
RTX 5090: tokens per second vs quality (NTP)
RTX 5090: Tokens per second vs quality (NTP) Tap Show Legend below for model details.
RTX 5090: Tokens per second vs quality (NTP) Hover over the bubbles, or click Show Legend below, for model details.
Show Legend
#ModelAccTPSBPW
GPU-1IQ2_S-2.17bpw0.8870278.882.17
GPU-2IQ3_S-3.00bpw0.9600269.093.00
GPU-3IQ3_S-3.48bpw0.9659271.143.48
GPU-4IQ4_XS-3.93bpw0.9871270.863.93
GPU-5IQ4_XS-4.15bpw0.9927270.544.15
AUD-IQ3_S0.9695237.383.15
BUD-Q3_K_XL0.9826246.163.89
CUD-IQ4_XS0.9946245.594.09
DUD-IQ4_NL0.9907247.984.16
EUD-Q4_K_XL0.9935240.985.16
FUD-Q5_K_XL0.9916238.096.14
AIQ4_XS0.9887263.064.34
BQ4_K_M0.9887261.394.93
CQ4_K_L0.9848256.425.02
DQ5_K_L0.9906245.765.84
AAPEX-I-Mini0.9505256.153.30
BAPEX-I-Compact0.9707257.113.99
CAPEX-I-Quality0.9950247.425.26
DAPEX-I-Balanced0.9855248.945.91
AIQ3_S0.9671235.473.13
BIQ4_XS0.9866239.644.06
CQ4_K_M0.9879239.445.11
DQ5_K_M0.9880236.296.06
RTX Pro 6000: tokens per second vs quality (NTP)
RTX Pro 6000: Tokens per second vs quality (NTP) Tap Show Legend below for model details.
RTX Pro 6000: Tokens per second vs quality (NTP) Hover over the bubbles, or click Show Legend below, for model details.
Show Legend
#ModelAccTPSBPW
GPU-1IQ2_S-2.17bpw0.8870281.402.17
GPU-2IQ3_S-3.00bpw0.9600272.533.00
GPU-3IQ3_S-3.48bpw0.9659273.633.48
GPU-4IQ4_XS-3.93bpw0.9871270.593.93
GPU-5IQ4_XS-4.15bpw0.9927269.814.15
AUD-IQ3_S0.9695237.063.15
BUD-Q3_K_XL0.9826245.373.89
CUD-IQ4_XS0.9946245.114.09
DUD-IQ4_NL0.9907246.984.16
EUD-Q4_K_XL0.9935238.225.16
FUD-Q5_K_XL0.9916234.696.14
AIQ4_XS0.9887261.544.34
BQ4_K_M0.9887258.704.93
CQ4_K_L0.9848253.735.02
DQ5_K_L0.9906242.635.84
AAPEX-I-Mini0.9505257.923.30
BAPEX-I-Compact0.9707257.113.99
CAPEX-I-Quality0.9950243.475.26
DAPEX-I-Balanced0.9855244.745.91
AIQ3_S0.9671235.033.13
BIQ4_XS0.9866239.384.06
CQ4_K_M0.9879237.355.11
DQ5_K_M0.9880233.396.06

16GB GPUs: RTX 4080 & 5060 Ti

On 16GB GPUs, GPU-5 does not fit, so the recommendation has to change.

For NTP, GPU-3 is the most obvious choice. It gives the best practical balance of quality and throughput among the models that fit.

Here, we see the more familiar trade-off curve: you can trade some quality for more token-generation throughput by choosing a smaller model. Prompt processing is less predictable and does not follow the same clean relationship, but the ByteShape models remain strong across the tested devices.

RTX 4080: tokens per second vs quality (NTP)
RTX 4080: Tokens per second vs quality (NTP) Tap Show Legend below for model details.
RTX 4080: Tokens per second vs quality (NTP) Hover over the bubbles, or click Show Legend below, for model details.
Show Legend
#ModelAccTPSBPW
GPU-1IQ2_S-2.17bpw0.8870204.742.17
GPU-2IQ3_S-3.00bpw0.9600190.523.00
GPU-3IQ3_S-3.48bpw0.9659183.293.48
AUD-IQ3_S0.9695157.143.15
AAPEX-I-Mini0.9505181.203.30
AIQ3_S0.9671155.863.13
RTX 5060 Ti: tokens per second vs quality (NTP)
RTX 5060 Ti: Tokens per second vs quality (NTP) Tap Show Legend below for model details.
RTX 5060 Ti: Tokens per second vs quality (NTP) Hover over the bubbles, or click Show Legend below, for model details.
Show Legend
#ModelAccTPSBPW
GPU-1IQ2_S-2.17bpw0.8870132.142.17
GPU-2IQ3_S-3.00bpw0.9600120.733.00
GPU-3IQ3_S-3.48bpw0.9659115.553.48
AUD-IQ3_S0.9695100.743.15
AAPEX-I-Mini0.9505113.813.30
AIQ3_S0.967199.843.13

CPUs

CPU inference shows a different pattern.

For token generation, we see a fairly clean trade-off curve: smaller models are faster, but quality drops. Balancing the two is essential.

High-memory CPU setups: i7, Ultra 7 & Ryzen 9

When memory is not the main constraint, CPU-5 is the strongest default choice.

You can choose a smaller model for more token-generation throughput, but the trade-off is not free. You give up quality, and in our measurements prompt processing may also get worse. Unless your workload specifically benefits from the smaller variants, CPU-5 is likely the better starting point.

Intel i7: tokens per second vs quality (NTP)
Intel i7: Tokens per second vs quality (NTP) Tap Show Legend below for model details.
Intel i7: Tokens per second vs quality (NTP) Hover over the bubbles, or click Show Legend below, for model details.
Show Legend
#ModelAccTPSBPW
CPU-1Q3_K_S-2.69bpw0.918617.082.69
CPU-2Q3_K_S-2.71bpw0.933916.332.71
CPU-3Q3_K_S-3.39bpw0.965115.083.39
CPU-4Q4_K_S-3.80bpw0.978113.293.80
CPU-5Q4_K_S-4.22bpw0.991513.724.22
AUD-IQ3_S0.969511.523.15
BUD-Q3_K_XL0.98269.953.89
CUD-IQ4_XS0.99469.794.09
DUD-IQ4_NL0.99079.764.16
EUD-Q4_K_XL0.99359.085.16
FUD-Q5_K_XL0.99168.756.14
AIQ4_XS0.988712.104.34
BQ4_K_M0.988710.864.93
CQ4_K_L0.984810.335.02
DQ5_K_L0.99069.455.84
AAPEX-I-Mini0.950513.663.30
BAPEX-I-Compact0.970712.233.99
CAPEX-I-Quality0.995010.435.26
DAPEX-I-Balanced0.985510.105.91
AIQ3_S0.967111.553.13
BIQ4_XS0.98669.394.06
CQ4_K_M0.98799.135.11
DQ5_K_M0.98808.796.06
AMD Ryzen 9: tokens per second vs quality (NTP)
AMD Ryzen 9: Tokens per second vs quality (NTP) Tap Show Legend below for model details.
AMD Ryzen 9: Tokens per second vs quality (NTP) Hover over the bubbles, or click Show Legend below, for model details.
Show Legend
#ModelAccTPSBPW
CPU-1Q3_K_S-2.69bpw0.918622.082.69
CPU-2Q3_K_S-2.71bpw0.933921.242.71
CPU-3Q3_K_S-3.39bpw0.965119.603.39
CPU-4Q4_K_S-3.80bpw0.978117.833.80
CPU-5Q4_K_S-4.22bpw0.991518.114.22
AUD-IQ3_S0.969515.373.15
BUD-Q3_K_XL0.982613.483.89
CUD-IQ4_XS0.994613.224.09
DUD-IQ4_NL0.990713.284.16
EUD-Q4_K_XL0.993512.625.16
FUD-Q5_K_XL0.991612.196.14
AIQ4_XS0.988716.284.34
BQ4_K_M0.988714.914.93
CQ4_K_L0.984814.195.02
DQ5_K_L0.990613.035.84
AAPEX-I-Mini0.950518.253.30
BAPEX-I-Compact0.970716.673.99
CAPEX-I-Quality0.995014.325.26
DAPEX-I-Balanced0.985513.905.91
AIQ3_S0.967115.303.13
BIQ4_XS0.986612.794.06
CQ4_K_M0.987912.605.11
DQ5_K_M0.988012.196.06
Intel Ultra 7: tokens per second vs quality (NTP)
Intel Ultra 7: Tokens per second vs quality (NTP) Tap Show Legend below for model details.
Intel Ultra 7: Tokens per second vs quality (NTP) Hover over the bubbles, or click Show Legend below, for model details.
Show Legend
#ModelAccTPSBPW
CPU-1Q3_K_S-2.69bpw0.918619.852.69
CPU-2Q3_K_S-2.71bpw0.933918.092.71
CPU-3Q3_K_S-3.39bpw0.965116.753.39
CPU-4Q4_K_S-3.80bpw0.978115.723.80
CPU-5Q4_K_S-4.22bpw0.991515.474.22
AUD-IQ3_S0.969514.853.15
BUD-Q3_K_XL0.982612.043.89
CUD-IQ4_XS0.994611.834.09
DUD-IQ4_NL0.990710.154.16
EUD-Q4_K_XL0.993511.255.16
FUD-Q5_K_XL0.991611.896.14
AIQ4_XS0.988715.044.34
BQ4_K_M0.988712.724.93
CQ4_K_L0.984811.905.02
DQ5_K_L0.990612.205.84
AAPEX-I-Mini0.950514.943.30
BAPEX-I-Compact0.970714.413.99
CAPEX-I-Quality0.995014.605.26
DAPEX-I-Balanced0.985513.975.91
AIQ3_S0.967113.893.13
BIQ4_XS0.986610.824.06
CQ4_K_M0.987911.755.11
DQ5_K_M0.988012.166.06

Memory-limited CPU setup: Raspberry Pi 5

On a 16GB Raspberry Pi 5, the choice gets more interesting.

CPU-3 gives the best quality among the practical options, with “near” real-time generation. Moving down to CPU-2 improves both prompt-processing and token-generation throughput by roughly 10%, but it comes at a meaningful quality cost. In our measurements, the error rate roughly doubles.

So the recommendation depends on what you care about:

  • choose CPU-3 if you want the best quality that fits,
  • choose CPU-2 if you are willing to trade quality for a modest speed increase.
Raspberry Pi 5: tokens per second vs quality (NTP)
Raspberry Pi 5: Tokens per second vs quality (NTP) Tap Show Legend below for model details.
Raspberry Pi 5: Tokens per second vs quality (NTP) Hover over the bubbles, or click Show Legend below, for model details.
Show Legend
#ModelAccTPSBPW
CPU-1Q3_K_S-2.69bpw0.91865.522.69
CPU-2Q3_K_S-2.71bpw0.93395.332.71
CPU-3Q3_K_S-3.39bpw0.96515.053.39
AUD-IQ3_S0.96953.193.15
AAPEX-I-Mini0.95054.443.30
AIQ3_S0.96713.613.13

Multi-Token Prediction (MTP)

We are also releasing MTP models alongside the standard NTP models.

MTP gives a token-generation throughput boost by predicting multiple future tokens. Prediction is still a guess. MTP helps when those guesses are correct often enough to outweigh the overhead of making and validating them. Whether this results in a boost or not depends on the overhead for making and validating the prediction, and on how frequently the prediction proves correct. In our tests, that boost is real on GPUs, and the quality impact is small enough that the MTP models are worth considering.

The trade-off is memory and prompt processing. MTP increases the runtime footprint, so a model that fits comfortably in NTP mode may not fit in MTP mode with the same context length. Prompt processing also becomes less attractive, especially on CPUs.

So the short version is:

MTP is useful for GPU generation speed. Be careful with memory. Be very careful with prompt processing.

MTP on GPUs

On GPUs, MTP makes sense. Across our tested devices, it gives roughly 20–40% higher token-generation throughput.

For most GPU users, the recommendation remains similar to NTP: choose the largest ByteShape model that fits your memory and context needs.

24+GB GPUs: RTX 4090, 5090, & Pro 6000

On 24+GB GPUs, the recommendation is mostly consistent with NTP: use MTP-GPU-5.

It gives the strongest overall quality-speed trade-off among the MTP models that fit.

If you want more throughput and can accept some trade-off, MTP-GPU-4 is also worth considering on the RTX 4090 and RTX Pro 6000. In those runs, MTP-GPU-4 gets a larger MTP boost than MTP-GPU-5.

The RTX 5090 result is less clean: MTP-GPU-4 underperforms in both quality and throughput compared with what we see on the other 24+GB GPUs. Because of that, we do not recommend it for the RTX 5090.

We also do not recommend MTP-GPU-3 on the 24+GB GPUs we tested. It has lower quality and lower throughput than the stronger options, so there is not much reason to choose it unless you have a very specific memory constraint.

RTX 4090: tokens per second vs quality (MTP vs NTP)
RTX 4090: Tokens per second vs quality (MTP vs NTP) Tap Show Legend below for model details.
RTX 4090: Tokens per second vs quality (MTP vs NTP) Hover over the bubbles, or click Show Legend below, for model details.
Show Legend
#ModelAccTPSBPW
MTP-GPU-1IQ2_S-2.25bpw0.8870305.472.25
MTP-GPU-2IQ3_S-3.06bpw0.9600295.003.06
MTP-GPU-3IQ4_XS-3.53bpw0.9659289.083.53
MTP-GPU-4IQ4_XS-3.97bpw0.9871289.203.97
MTP-GPU-5IQ4_XS-4.19bpw0.9927285.534.19
GPU-1IQ2_S-2.17bpw0.8870237.092.17
GPU-2IQ3_S-3.00bpw0.9600224.263.00
GPU-3IQ3_S-3.48bpw0.9659218.553.48
GPU-4IQ4_XS-3.93bpw0.9871217.453.93
GPU-5IQ4_XS-4.15bpw0.9927214.544.15
RTX 5090: tokens per second vs quality (MTP vs NTP)
RTX 5090: Tokens per second vs quality (MTP vs NTP) Tap Show Legend below for model details.
RTX 5090: Tokens per second vs quality (MTP vs NTP) Hover over the bubbles, or click Show Legend below, for model details.
Show Legend
#ModelAccTPSBPW
MTP-GPU-1IQ2_S-2.25bpw0.8870333.182.25
MTP-GPU-2IQ3_S-3.06bpw0.9600331.283.06
MTP-GPU-3IQ4_XS-3.53bpw0.9659326.953.53
MTP-GPU-4IQ4_XS-3.97bpw0.9871333.363.97
MTP-GPU-5IQ4_XS-4.19bpw0.9927333.554.19
GPU-1IQ2_S-2.17bpw0.8870278.882.17
GPU-2IQ3_S-3.00bpw0.9600269.093.00
GPU-3IQ3_S-3.48bpw0.9659271.143.48
GPU-4IQ4_XS-3.93bpw0.9871270.863.93
GPU-5IQ4_XS-4.15bpw0.9927270.544.15
RTX Pro 6000: tokens per second vs quality (MTP vs NTP)
RTX Pro 6000: Tokens per second vs quality (MTP vs NTP) Tap Show Legend below for model details.
RTX Pro 6000: Tokens per second vs quality (MTP vs NTP) Hover over the bubbles, or click Show Legend below, for model details.
Show Legend
#ModelAccTPSBPW
MTP-GPU-1IQ2_S-2.25bpw0.8870349.942.25
MTP-GPU-2IQ3_S-3.06bpw0.9600342.603.06
MTP-GPU-3IQ4_XS-3.53bpw0.9659331.133.53
MTP-GPU-4IQ4_XS-3.97bpw0.9871349.343.97
MTP-GPU-5IQ4_XS-4.19bpw0.9927343.114.19
GPU-1IQ2_S-2.17bpw0.8870281.402.17
GPU-2IQ3_S-3.00bpw0.9600272.533.00
GPU-3IQ3_S-3.48bpw0.9659273.633.48
GPU-4IQ4_XS-3.93bpw0.9871270.593.93
GPU-5IQ4_XS-4.15bpw0.9927269.814.15

16GB GPUs: RTX 4080 & 5060 Ti

This is where the extra MTP memory footprint matters. With a reasonable context length, MTP-GPU-3 does not fit in 16GB, so our practical recommendation is MTP-GPU-2. It still reaches about 96% of baseline quality, while improving token-generation throughput by roughly 32–42% in our measurements. That makes it a good option for users who care about generation speed and are working within a 16GB memory budget.

RTX 4080: tokens per second vs quality (MTP vs NTP)
RTX 4080: Tokens per second vs quality (MTP vs NTP) Tap Show Legend below for model details.
RTX 4080: Tokens per second vs quality (MTP vs NTP) Hover over the bubbles, or click Show Legend below, for model details.
Show Legend
#ModelAccTPSBPW
MTP-GPU-1IQ2_S-2.25bpw0.8870262.392.25
MTP-GPU-2IQ3_S-3.06bpw0.9600249.333.06
GPU-1IQ2_S-2.17bpw0.8870204.742.17
GPU-2IQ3_S-3.00bpw0.9600190.523.00
GPU-3IQ3_S-3.48bpw0.9659183.293.48
RTX 5060 Ti: tokens per second vs quality (MTP vs NTP)
RTX 5060 Ti: Tokens per second vs quality (MTP vs NTP) Tap Show Legend below for model details.
RTX 5060 Ti: Tokens per second vs quality (MTP vs NTP) Hover over the bubbles, or click Show Legend below, for model details.
Show Legend
#ModelAccTPSBPW
MTP-GPU-1IQ2_S-2.25bpw0.8870169.762.25
MTP-GPU-2IQ3_S-3.06bpw0.9600169.503.06
GPU-1IQ2_S-2.17bpw0.8870132.142.17
GPU-2IQ3_S-3.00bpw0.9600120.733.00
GPU-3IQ3_S-3.48bpw0.9659115.553.48

MTP on CPUs

For now, we do not recommend MTP for CPU inference. CPU inference is already heavily constrained by prompt processing. MTP improves token generation, but it adds extra compute load and makes the prompt-processing side less attractive. The trade-off is not worth it.

If that changes in future backends or hardware, we will revisit CPU MTP releases. For now, our CPU recommendation remains NTP.

Conclusion

This release is unusually easy to summarize:

  1. For NTP, choose the largest ByteShape model that fits your hardware and context needs.
  2. For MTP on GPUs, do the same, but remember that the extra memory footprint can change what fits. On 24+GB GPUs, MTP-GPU-5 is the default recommendation. On 16GB GPUs, MTP-GPU-2 is the practical choice.
  3. For CPUs, stick with NTP for now.

We are not covering KLD or perplexity in this release post. In our experience with instruction-tuned models, they are useful for catching clearly broken quantizations, but once models are roughly in the right range, we have not found them to reliably predict downstream benchmark performance. We will discuss this, along with how we approach evaluation, in a dedicated blog post.

As always, the best model is not the smallest model, the fastest model, or the model that wins one diagnostic metric. The best model is the one that fits your hardware and gives you the best quality-speed trade-off for the workload you actually run. That is what this release is designed for.