Open-source models
Frontier-class, but you own the weights.
Llama 4 Maverick · DeepSeek V3.2 · Qwen3 235B · Mistral Large 3 — self-host, fine-tune, run anywhere. No data leaves your infra.
Open-source flagships
Model pricing
| Provider | Model | Context | Input /M | Output /M |
|---|---|---|---|---|
| Llama 4 Maverick ★ latest | 1M | $0.15 | $0.60 | |
| Llama 4 Scout | 10M | $0.08 | $0.30 | |
| DeepSeek V3.2 ★ latest | 131K | $0.25 | $0.38 | |
| DeepSeek R1-0528 | 164K | $0.50 | $2.15 | |
| V4-Flash | 1M | $0.14 | $0.28 | |
| V4-Pro | 1M | $0.44 | $0.87 | |
| Qwen3 235B ★ latest | 256K | $0.46 | $1.82 | |
| Qwen3 32B | 41K | $0.08 | $0.28 | |
| Qwen3 30B-A3B | 41K | $0.08 | $0.28 | |
MMistral | Mistral Large 3 ★ latest | 256K | $0.50 | $1.50 |
| Mistral Medium 3 | 131K | $0.40 | $2.00 | |
| Mistral Small 3.2 | 131K | $0.075 | $0.20 |
Benchmarks
MMLU-Pro
GPQA Diamond
MMLU
LiveCodeBench
Performance comparison
| Benchmark |
|
|
| MMistral Lg 3 |
|---|---|---|---|---|
Broad knowledge MMLU | 85.5% | 88.5% | 93.1% | ~85.5% |
Expert reasoning MMLU-Pro | 80.5% | 85.0% | 83.0% | 73.1% |
Graduate-level STEM GPQA Diamond | 69.8% | 82.4% | 77.5% | ~43.9% |
Agentic coding SWE-bench Verified | ~34% | 73.1% | — | — |
Competitive coding LiveCodeBench | 43.4% | 73.3%† | 51.8% | — |
Context window max tokens | 1M | 131K | 256K | 256K |
Bold values indicate the highest score per benchmark. †LiveCodeBench score is for DeepSeek R1-0528 (reasoning model); DeepSeek V3.2 has no published LiveCodeBench score. Sources: official model cards (Meta, DeepSeek, Qwen3, Mistral), DeepSeek-V3.2 technical report (arXiv:2512.02556), and CodeSOTA Open LLM Leaderboard (codesota.com, May 2026). Mistral GPQA is a third-party estimate. — = no published score.
Pick the right model
Long-context document and code analysis
Use Llama 4 Scout — 10M token context, the longest of any open-weight model. Ingest a full codebase, year-long conversation logs, or a multi-volume document set without chunking. Llama 4 Community License; self-host royalty-free.
Deep reasoning, math, and agentic coding
Use DeepSeek R1-0528 — leads coding benchmarks (73.3% LiveCodeBench, 73.1% SWE-bench Verified). MIT licensed; deploy on your own GPU cluster for near-frontier reasoning at a fraction of closed-model cost.
Multilingual and Asian-language tasks
Use Qwen3 235B — highest MMLU in this lineup (93.1%), trained on 36+ languages with strong coverage of Chinese, Japanese, Korean, and Arabic. Apache 2.0 licensed; fine-tune for language-specific domains without restrictions.
EU-regulated and privacy-sensitive workloads
Use Mistral Large 3 — built by a French AI lab, deployable on AWS Paris or Azure EU, GDPR-native with no data leaving European infrastructure. Apache 2.0 licensed; multimodal (text + image) at predictable cost.