Ollama
| Field | Details |
|---|---|
| Category | Local AI runtime and model manager |
| Primary use | Download, run, customize, and serve open models on developer machines and small servers |
| Interfaces | CLI, local REST API, OpenAI-compatible API, Python library, JavaScript library |
| Default API | http://localhost:11434/api |
| Model format | Primarily GGUF-based packaged models distributed through Ollama libraries |
| Best fit | Fast local experimentation, privacy-sensitive prototypes, offline assistants, and lightweight app integration |
| Last reviewed | 2026-04-29 |
English
Overview
Ollama is a local AI runtime that makes open-weight model usage feel like a package manager plus a local inference server. A user can pull a model, run it from the CLI, and expose it through a local API without manually wiring tokenizers, quantized model files, inference backends, and server processes.
It is strongest when the priority is simplicity: getting models such as Llama, Gemma, Qwen, DeepSeek, Mistral, or embedding models running quickly on macOS, Windows, Linux, workstations, and small servers.
Why it matters
- It lowers the setup cost for local LLM development.
- It gives teams a local endpoint for privacy-sensitive testing before using cloud inference.
- It standardizes model lifecycle tasks such as pull, list, show, copy, delete, and run.
- It supports streaming generation and chat APIs that are easy to wire into tools.
- It can act as a local backend for editors, agents, notebooks, and internal apps.
Architecture/Concepts
| Concept | Meaning |
|---|---|
| Ollama daemon | Local service that manages loaded models and serves API requests. |
| CLI | Commands such as ollama run, ollama pull, ollama list, and ollama rm. |
| Model tag | Model names use a model:tag pattern; omitted tags default to latest. |
| Modelfile | Configuration for creating customized model variants with parameters, templates, adapters, or system prompts. |
| Local API | REST API for generation, chat, embeddings, model management, and running-model inspection. |
| OpenAI-compatible API | Compatibility layer for clients that expect OpenAI-style chat/completion endpoints. |
| Quantization | Smaller model representations reduce memory use and make local inference practical. |
Practical usage
Use Ollama when:
- You want the fastest path from "install" to a working local model.
- You are building a prototype that should run on a laptop or single workstation.
- You need simple local APIs for chat, generation, or embeddings.
- You want users to manage models with familiar CLI commands.
- You value operational simplicity over maximum serving throughput.
Basic workflow:
ollama pull gemma3
ollama run gemma3
curl http://localhost:11434/api/generate -d '{
"model": "gemma3",
"prompt": "Explain local model serving in one paragraph."
}'
Operational cautions:
- Check model memory requirements before distributing a local workflow.
- Pin model tags for reproducible demos and tests.
- Do not expose the local API on a network without access controls.
- Benchmark with the exact context length and quantization level you plan to use.
- Use vLLM or SGLang instead when multi-GPU throughput and production batching are the main goal.
Learning checklist
- Install Ollama and run one chat model locally.
- Pull, list, inspect, and delete models with the CLI.
- Call
/api/generateand/api/chatfrom a script. - Understand model tags and why pinning matters.
- Create a simple customized model with a Modelfile.
- Compare latency and memory use across two quantization levels or model sizes.
- Decide when Ollama is enough and when a production serving engine is needed.
繁體中文
概覽
Ollama 是本機 AI runtime,將開源模型的使用體驗簡化成「模型套件管理器」加「本機推論伺服器」。使用者可以拉取模型、用 CLI 執行,並透過本機 API 供應服務,不需要手動組合 tokenizer、量化模型檔、推論 backend 與 server process。
它最適合追求簡單快速的情境,例如在 macOS、Windows、Linux、工作站或小型伺服器上快速執行 Llama、Gemma、Qwen、DeepSeek、Mistral 或 embedding 模型。
為什麼重要
- 降低本機 LLM 開發的設定成本。
- 讓團隊能先用本機 endpoint 測試隱私敏感工作流。
- 標準化模型拉取、列表、檢視、複製、刪除與執行。
- 支援容易整合到工具中的串流生成與聊天 API。
- 可作為編輯器、Agent、Notebook 與內部應用的本機 backend。
架構/概念
| 概念 | 說明 |
|---|---|
| Ollama daemon | 管理已載入模型並提供 API 的本機服務。 |
| CLI | ollama run、ollama pull、ollama list、ollama rm 等指令。 |
| Model tag | 模型名稱採用 model:tag 格式;省略 tag 時預設為 latest。 |
| Modelfile | 用於建立客製模型變體,可設定參數、template、adapter 或 system prompt。 |
| 本機 API | 提供生成、聊天、embedding、模型管理與執行中模型查詢的 REST API。 |
| OpenAI-compatible API | 讓 OpenAI 風格 client 可連接 Ollama 的相容介面。 |
| Quantization | 透過較小的模型表示降低記憶體需求,使本機推論更可行。 |
實務使用
適合使用 Ollama 的情境:
- 想用最短路徑從安裝進到可用的本機模型。
- 要建立可在 laptop 或單機工作站執行的原型。
- 需要簡單的本機聊天、生成或 embedding API。
- 希望使用者能用熟悉的 CLI 管理模型。
- 操作簡單比極限吞吐量更重要。
基本流程:
ollama pull gemma3
ollama run gemma3
curl http://localhost:11434/api/generate -d '{
"model": "gemma3",
"prompt": "Explain local model serving in one paragraph."
}'
營運注意事項:
- 發布本機工作流前,先確認模型記憶體需求。
- 為可重現的展示與測試固定 model tag。
- 不要在沒有存取控制的情況下把本機 API 暴露到網路。
- 用實際要使用的 context length 與量化等級 benchmark。
- 若主要目標是多 GPU 吞吐量與生產級 batching,應改評估 vLLM 或 SGLang。
學習檢核表
- 安裝 Ollama 並在本機執行一個聊天模型。
- 使用 CLI 拉取、列出、檢視與刪除模型。
- 從 script 呼叫
/api/generate與/api/chat。 - 理解 model tag 以及固定版本的重要性。
- 用 Modelfile 建立簡單客製模型。
- 比較兩種量化等級或模型大小的延遲與記憶體使用。
- 判斷何時 Ollama 足夠,何時需要生產級 serving engine。