Skip to content

Local models

Run on your own hardware.

Zero cost per token. Full data sovereignty. APPI-compliant by default. Llama 4 Scout · Qwen 3.6-27B · DeepSeek-R1 · Gemma 4 — the 2026 local lineup.

Llama 4Qwen 3.6DeepSeek-R1Gemma 4OllamaLM Studio

Key models

Hardware guide

Size classGPU VRAMApple SiliconExample models
Nano · 3–4B4 GB8 GBPhi-4 Mini, Qwen3-4B
Small · 7–8B6 GB16 GBLlama 3.1-8B, Qwen3-8B
Medium · 14B10–12 GB32 GBPhi-4-14B, DeepSeek-R1-14B
Large · 27–31B14–20 GB64 GBQwen 3.6-27B, Gemma 4-31B
XL · 70B35–40 GB192 GBLlama 3.3-70B, Qwen3-72B
Server · 100B+2× H100Llama 4 Scout bf16, Mistral Small 4

Running locally

01

Install Ollama

Ollama is the fastest path on macOS and Linux. One command installs the runtime, a local API server, and the MLX backend for Apple Silicon.

Terminal window
brew install ollama # macOS
curl -fsSL https://ollama.com/install.sh | sh # Linux
02

Pull a model and start serving

Ollama downloads the quantized GGUF automatically. The server starts on port 11434 with an OpenAI-compatible endpoint.

Terminal window
ollama pull qwen3:27b # ~16 GB download
ollama serve # http://localhost:11434
03

Or use LM Studio for a GUI

LM Studio 0.4+ ships a model browser, GGUF downloader, and local server toggle. Enable the server and it listens on port 1234 — same OpenAI-compatible API.

1234/v1
# After enabling Local Server in LM Studio settings:

LifeOS integration

Every local inference tool exposes an OpenAI-compatible REST endpoint. No adapter layer needed — swap the base_url and every LifeOS agent that already calls Claude can call a local model instead.

Direct — OpenAI SDK
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama", # required, unused
)
response = client.chat.completions.create(
model="qwen3:27b",
messages=[{"role": "user", "content": "..."}],
)
LangChain agents
from langchain_ollama import ChatOllama
llm = ChatOllama(model="qwen3:27b")
# Drop-in replacement for ChatAnthropic
# or ChatOpenAI in any LangChain chain
ToolEndpointFramework packages
Ollamalocalhost:11434/v1langchain-ollama · llama-index-llms-ollama
LM Studiolocalhost:1234/v1openai SDK (base_url override)
vLLM / llama.cppconfigurableopenai SDK (base_url override)

Pick the right model

01

Long-context ingestion — entire codebase or year of documents

Use Llama 4 Scout — 10M context window is unique in local models. Load a full codebase, a year of email, or a legal document library in one call. Requires 24–32 GB VRAM (int4); run on a single RTX 4090 or H100.

02

Japanese-language tasks · multilingual coding agents

Use Qwen 3.6-27B — highest Japanese benchmark scores among open weights as of mid-2025. Apache 2.0 license means commercial use is unrestricted. Fits in 16 GB VRAM at Q4 — a single laptop GPU handles it.

03

Reasoning · math · structured analysis

Use DeepSeek-R1-Distill-14B — distilled from the full R1 reasoning chain into 14B parameters. MIT licensed. Matches o1-mini on MATH and AIME at 10 GB VRAM. Best reasoning-per-GB of any local model.

04

Fast routing · classification · lightweight tasks

Use Phi-4 Mini 3.8B (Microsoft, MIT) — runs at 4 GB VRAM with ~200 tokens/s on consumer hardware. In LifeOS, deploy as the task router: classify inbound requests and forward to the right specialist model. Cost: zero per call.

Local vs. frontier