llama.cpp

AI that lives on your computer. Open-source, private & always local.

Run frontier AI entirely on your machine. No API keys, no telemetry, no limits. Own your models and conversation data.

curl -LsSf https://llama.app/install.sh | sh

Prefer Brew or Winget? Package managers · Rather build from source? Follow instructions

Pair it with a local coding agent.

Run llama serve, install the pi-llama plugin and launch Pi. It will automatically discover your local model. No config, no API keys. Files stay on your machine, requests never leave it.

# 1. Serve a model
llama serve

# 2. Install the pi-llama plugin
pi install git:github.com/huggingface/pi-llama

# 3. Run Pi, everything is set
pi

Optimized for any hardware.

From your laptop to a cluster, llama.cpp runs on whatever you have. Same binary, same models, same hand-tuned kernels for every GPU and CPU.

Apple Silicon

M Ultra

RTX 5090

CPU

Jetson

H100

MI300

RTX 4090

A100

M Pro

M Max

DGX Spark

Radeon RX

B200

Intel Arc

RTX 3090

Run your first model

Qwen 3.6

Alibaba's next-gen natively multimodal reasoning models. Dense and MoE variants that rival models many times their size on coding and vision tasks.

Gemma 4

Google's most capable open models, built from Gemini 3 technology. Supports multimodal reasoning, agentic workflows, and 140+ languages.

GPT-OSS

OpenAI's first open-weight models since GPT-2. Built for reasoning, agentic tasks, and developer use with function calling and tool use capabilities.

Gemma 3

Google's multimodal models built from Gemini technology. Supports 140+ languages, vision, and text tasks with up to 128K context for edge to cloud deployment.

Browse all models