Open-source models

Frontier-class, but you own the weights.

Llama 4 Maverick · DeepSeek V3.2 · Qwen3 235B · Mistral Large 3 — self-host, fine-tune, run anywhere. No data leaves your infra.

Llama 4 Maverick

DeepSeek V3.2

Qwen3 235BMistral Large 3

Open-source flagships

Model pricing

Provider	Model	Context	Input /M	Output /M
Meta	Llama 4 Maverick ★ latest	1M	$0.15	$0.60
Meta	Llama 4 Scout	10M	$0.08	$0.30
DeepSeek	DeepSeek V3.2 ★ latest	131K	$0.25	$0.38
	DeepSeek R1-0528	164K	$0.50	$2.15
	V4-Flash	1M	$0.14	$0.28
	V4-Pro	1M	$0.44	$0.87
Alibaba	Qwen3 235B ★ latest	256K	$0.46	$1.82
	Qwen3 32B	41K	$0.08	$0.28
	Qwen3 30B-A3B	41K	$0.08	$0.28
MMistral	Mistral Large 3 ★ latest	256K	$0.50	$1.50
	Mistral Medium 3	131K	$0.40	$2.00
	Mistral Small 3.2	131K	$0.075	$0.20

USD per million tokens. All models open weights (MIT or Apache 2.0). DeepSeek V4 prices reflect official API promotion through May 2026. Sources: OpenRouter · DeepSeek · Qwen · Mistral — May 2026.

Benchmarks

Expert knowledge
MMLU-Pro

80.585.083.073.1

Graduate-level STEM
GPQA Diamond

69.882.477.543.9

Broad knowledge
MMLU

85.588.593.185.5

Competitive coding
LiveCodeBench

43.473.3†51.8—

Performance comparison

Benchmark	Llama 4 Mav	DeepSeek V3.2	Qwen3 235B	MMistral Lg 3
Broad knowledge MMLU	85.5%	88.5%	93.1%	~85.5%
Expert reasoning MMLU-Pro	80.5%	85.0%	83.0%	73.1%
Graduate-level STEM GPQA Diamond	69.8%	82.4%	77.5%	~43.9%
Agentic coding SWE-bench Verified	~34%	73.1%	—	—
Competitive coding LiveCodeBench	43.4%	73.3%†	51.8%	—
Context window max tokens	1M	131K	256K	256K

Bold values indicate the highest score per benchmark. †LiveCodeBench score is for DeepSeek R1-0528 (reasoning model); DeepSeek V3.2 has no published LiveCodeBench score. Sources: official model cards (Meta, DeepSeek, Qwen3, Mistral), DeepSeek-V3.2 technical report (arXiv:2512.02556), and CodeSOTA Open LLM Leaderboard (codesota.com, May 2026). Mistral GPQA is a third-party estimate. — = no published score.

Pick the right model

Long-context document and code analysis

Use Llama 4 Scout — 10M token context, the longest of any open-weight model. Ingest a full codebase, year-long conversation logs, or a multi-volume document set without chunking. Llama 4 Community License; self-host royalty-free.

Deep reasoning, math, and agentic coding

Use DeepSeek R1-0528 — leads coding benchmarks (73.3% LiveCodeBench, 73.1% SWE-bench Verified). MIT licensed; deploy on your own GPU cluster for near-frontier reasoning at a fraction of closed-model cost.

Multilingual and Asian-language tasks

Use Qwen3 235B — highest MMLU in this lineup (93.1%), trained on 36+ languages with strong coverage of Chinese, Japanese, Korean, and Arabic. Apache 2.0 licensed; fine-tune for language-specific domains without restrictions.

EU-regulated and privacy-sensitive workloads

Use Mistral Large 3 — built by a French AI lab, deployable on AWS Paris or Azure EU, GDPR-native with no data leaving European infrastructure. Apache 2.0 licensed; multimodal (text + image) at predictable cost.