Ollama
An open-source tool to run open-weight LLMs locally with a single command, including a local OpenAI-compatible API.
Ollama is the simplest way to run open-weight LLMs on your own machine. Install it, run ollama run llama3, and you have a model answering prompts locally — no API key, no account, and nothing leaving your computer. It handles downloading and quantizing models, manages a local model library, and exposes a local API (including OpenAI-compatible endpoints) so you can build against a model running on localhost.
It is aimed at developers who want a model for local development, prototyping, privacy-sensitive work, or offline use. Ollama is about single-machine convenience — it's how you try an open model or wire one into an app on your laptop, not how you serve thousands of concurrent users.
Highlights
- One-command run —
ollama run <model>pulls and runs a model with no setup; a curated library covers popular open models. - Local API — a REST API plus OpenAI-compatible endpoints, so app code can target a local model by changing the base URL.
- Customizable — a
Modelfilelets you set system prompts, parameters, and templates, or import your own GGUF weights. - Cross-platform — native apps for macOS, Windows, and Linux; runs on CPU or GPU depending on your hardware.
- Private and offline — models run entirely on your machine, so no data leaves it and it works without a connection.
In an AI-assisted workflow
Run a model and call its local OpenAI-compatible endpoint from your app:
ollama run llama3.1 # pull + chat in the terminal
# or serve and call it like OpenAI:
# base_url="http://localhost:11434/v1" (any OpenAI client)TIP
Model size and quantization decide whether a model fits your RAM/VRAM and how fast it runs — start with a smaller or more-quantized variant and size up. For a GUI alternative to the CLI, see LM Studio.
Good to know
Ollama is free and open source under MIT for local use on your own machine, and runs on macOS, Windows, and Linux; an optional paid Ollama Cloud (Pro/Max) runs larger hosted models but isn't required. It's built for local, single-user use; when you need to serve a model to many concurrent users in production, move to a dedicated serving engine like vLLM and weigh the trade-offs in Self-Host vs API.
Related
- LM StudioA desktop app for discovering, downloading, and running open-weight LLMs locally with a GUI and a local OpenAI-compatible server.
- Self-Host vs API: When Does Running Your Own LLM Actually Pay Off?The real economics of self-hosting an LLM vs. calling a hosted API — GPU utilization, privacy, latency, and the hidden ops costs that decide the crossover.
- LLM Inference EngineerUse this agent to serve and optimize self-hosted LLM inference — sizing GPUs, configuring a serving engine like vLLM (continuous batching, PagedAttention, tensor parallelism), applying quantization, and tuning throughput and tail latency against a cost and p95 budget. Examples — "serve Llama-3-70B at p95 under 2s on our GPUs", "our self-hosted model is slow and the GPUs sit half-idle — raise throughput", "quantize this model to fit one GPU without wrecking quality".
- vLLMA high-throughput, memory-efficient inference and serving engine for LLMs, with PagedAttention, continuous batching, and an OpenAI-compatible API server.
- Scaffold a vLLM Serving ConfigScaffold a vLLM serving config for a model on a target GPU — pick precision/quantization and parallelism to fit, set batching and context length, and expose an OpenAI-compatible server.