Local models

Run on your own hardware.

Zero cost per token. Full data sovereignty. APPI-compliant by default. Llama 4 Scout · Qwen 3.6-27B · DeepSeek-R1 · Gemma 4 — the 2026 local lineup.

Llama 4

Qwen 3.6

DeepSeek-R1

Gemma 4OllamaLM Studio

Key models

Hardware guide

Size class	GPU VRAM	Apple Silicon	Example models
Nano · 3–4B	4 GB	8 GB	Phi-4 Mini, Qwen3-4B
Small · 7–8B	6 GB	16 GB	Llama 3.1-8B, Qwen3-8B
Medium · 14B	10–12 GB	32 GB	Phi-4-14B, DeepSeek-R1-14B
Large · 27–31B	14–20 GB	64 GB	Qwen 3.6-27B, Gemma 4-31B
XL · 70B	35–40 GB	192 GB	Llama 3.3-70B, Qwen3-72B
Server · 100B+	2× H100	—	Llama 4 Scout bf16, Mistral Small 4

Running locally

Install Ollama

Ollama is the fastest path on macOS and Linux. One command installs the runtime, a local API server, and the MLX backend for Apple Silicon.

brew install ollama        # macOS
curl -fsSL https://ollama.com/install.sh | sh  # Linux

Pull a model and start serving

Ollama downloads the quantized GGUF automatically. The server starts on port 11434 with an OpenAI-compatible endpoint.

ollama pull qwen3:27b      # ~16 GB download
ollama serve               # http://localhost:11434

Or use LM Studio for a GUI

LM Studio 0.4+ ships a model browser, GGUF downloader, and local server toggle. Enable the server and it listens on port 1234 — same OpenAI-compatible API.

# After enabling Local Server in LM Studio settings:

LifeOS integration

Every local inference tool exposes an OpenAI-compatible REST endpoint. No adapter layer needed — swap the base_url and every LifeOS agent that already calls Claude can call a local model instead.

Direct — OpenAI SDK

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",       # required, unused
)
response = client.chat.completions.create(
    model="qwen3:27b",
    messages=[{"role": "user", "content": "..."}],
)

LangChain agents

from langchain_ollama import ChatOllama

llm = ChatOllama(model="qwen3:27b")

# Drop-in replacement for ChatAnthropic
# or ChatOpenAI in any LangChain chain

Tool	Endpoint	Framework packages
Ollama	localhost:11434/v1	langchain-ollama · llama-index-llms-ollama
LM Studio	localhost:1234/v1	openai SDK (base_url override)
vLLM / llama.cpp	configurable	openai SDK (base_url override)

Pick the right model

Long-context ingestion — entire codebase or year of documents

Use Llama 4 Scout — 10M context window is unique in local models. Load a full codebase, a year of email, or a legal document library in one call. Requires 24–32 GB VRAM (int4); run on a single RTX 4090 or H100.

Japanese-language tasks · multilingual coding agents

Use Qwen 3.6-27B — highest Japanese benchmark scores among open weights as of mid-2025. Apache 2.0 license means commercial use is unrestricted. Fits in 16 GB VRAM at Q4 — a single laptop GPU handles it.

Reasoning · math · structured analysis

Use DeepSeek-R1-Distill-14B — distilled from the full R1 reasoning chain into 14B parameters. MIT licensed. Matches o1-mini on MATH and AIME at 10 GB VRAM. Best reasoning-per-GB of any local model.

Fast routing · classification · lightweight tasks

Use Phi-4 Mini 3.8B (Microsoft, MIT) — runs at 4 GB VRAM with ~200 tokens/s on consumer hardware. In LifeOS, deploy as the task router: classify inbound requests and forward to the right specialist model. Cost: zero per call.