Local models
Run on your own hardware.
Zero cost per token. Full data sovereignty. APPI-compliant by default. Llama 4 Scout · Qwen 3.6-27B · DeepSeek-R1 · Gemma 4 — the 2026 local lineup.
Key models
Hardware guide
| Size class | GPU VRAM | Apple Silicon | Example models |
|---|---|---|---|
| Nano · 3–4B | 4 GB | 8 GB | Phi-4 Mini, Qwen3-4B |
| Small · 7–8B | 6 GB | 16 GB | Llama 3.1-8B, Qwen3-8B |
| Medium · 14B | 10–12 GB | 32 GB | Phi-4-14B, DeepSeek-R1-14B |
| Large · 27–31B | 14–20 GB | 64 GB | Qwen 3.6-27B, Gemma 4-31B |
| XL · 70B | 35–40 GB | 192 GB | Llama 3.3-70B, Qwen3-72B |
| Server · 100B+ | 2× H100 | — | Llama 4 Scout bf16, Mistral Small 4 |
Running locally
Install Ollama
Ollama is the fastest path on macOS and Linux. One command installs the runtime, a local API server, and the MLX backend for Apple Silicon.
brew install ollama # macOScurl -fsSL https://ollama.com/install.sh | sh # LinuxPull a model and start serving
Ollama downloads the quantized GGUF automatically. The server starts on port 11434 with an OpenAI-compatible endpoint.
ollama pull qwen3:27b # ~16 GB downloadollama serve # http://localhost:11434Or use LM Studio for a GUI
LM Studio 0.4+ ships a model browser, GGUF downloader, and local server toggle. Enable the server and it listens on port 1234 — same OpenAI-compatible API.
# After enabling Local Server in LM Studio settings:LifeOS integration
Every local inference tool exposes an OpenAI-compatible REST endpoint. No adapter layer needed — swap the base_url and every LifeOS agent that already calls Claude can call a local model instead.
from openai import OpenAI
client = OpenAI( base_url="http://localhost:11434/v1", api_key="ollama", # required, unused)response = client.chat.completions.create( model="qwen3:27b", messages=[{"role": "user", "content": "..."}],)from langchain_ollama import ChatOllama
llm = ChatOllama(model="qwen3:27b")
# Drop-in replacement for ChatAnthropic# or ChatOpenAI in any LangChain chain| Tool | Endpoint | Framework packages |
|---|---|---|
| Ollama | localhost:11434/v1 | langchain-ollama · llama-index-llms-ollama |
| LM Studio | localhost:1234/v1 | openai SDK (base_url override) |
| vLLM / llama.cpp | configurable | openai SDK (base_url override) |
Pick the right model
Long-context ingestion — entire codebase or year of documents
Use Llama 4 Scout — 10M context window is unique in local models. Load a full codebase, a year of email, or a legal document library in one call. Requires 24–32 GB VRAM (int4); run on a single RTX 4090 or H100.
Japanese-language tasks · multilingual coding agents
Use Qwen 3.6-27B — highest Japanese benchmark scores among open weights as of mid-2025. Apache 2.0 license means commercial use is unrestricted. Fits in 16 GB VRAM at Q4 — a single laptop GPU handles it.
Reasoning · math · structured analysis
Use DeepSeek-R1-Distill-14B — distilled from the full R1 reasoning chain into 14B parameters. MIT licensed. Matches o1-mini on MATH and AIME at 10 GB VRAM. Best reasoning-per-GB of any local model.
Fast routing · classification · lightweight tasks
Use Phi-4 Mini 3.8B (Microsoft, MIT) — runs at 4 GB VRAM with ~200 tokens/s on consumer hardware. In LifeOS, deploy as the task router: classify inbound requests and forward to the right specialist model. Cost: zero per call.