Ollama - 인공지능 > 언어모델 | AI Insight Note

Ollama는 로컬 환경에서 대규모 언어 모델(LLM)을 손쉽게 실행하는 오픈소스 도구다. Llama 3, Mistral, Gemma, Phi, DeepSeek, Qwen 등 수백 개의 모델을 단 한 줄의 명령으로 다운로드하고 실행할 수 있으며, OpenAI 호환 REST API 서버를 자동으로 제공한다. 내부적으로 llama.cpp(Georgi Gerganov 제작)를 추론 엔진으로 사용하며, 서버 레이어는 Go(67.5%), 추론 코어는 C/C++(27.6%)로 구현됐다.

핵심 특징

항목	설명
설치 방식	macOS·Linux·Windows 원클릭 설치
지원 모델	Llama 3, Mistral, Gemma, Phi, DeepSeek-R1, Qwen3, CodeLlama 등 100+
API 서버	OpenAI 호환 REST API 자동 제공 (port 11434)
GPU 지원	NVIDIA CUDA, AMD ROCm, Apple Metal (M 시리즈), Vulkan(실험)
모델 포맷	GGUF, Safetensors 지원
추론 엔진	llama.cpp (C/C++) — CPU·GPU 최적화 GGUF 추론
구현 언어	Go (서버·API) + C/C++ (llama.cpp 코어)
프라이버시	로컬 실행 시 데이터 외부 미전송

설치

플랫폼	방법
macOS / Windows	https://ollama.com/download 에서 설치 파일 다운로드
Linux	`curl -fsSL https://ollama.com/install.sh
Docker	CPU: `docker run -d -p 11434:11434 -v ollama:/root/.ollama ollama/ollama`

bash

# Docker — NVIDIA GPU 사용 (nvidia-container-toolkit 사전 설치 필요)
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 ollama/ollama

# 컨테이너 내 모델 실행
docker exec -it ollama ollama run llama3.2

CLI 명령어

모델 실행 및 관리

bash

# 모델 실행 (없으면 자동 다운로드)
ollama run llama3.2

# 멀티모달: 이미지와 함께 실행
ollama run gemma3 ./image.png "이 이미지에 뭐가 있어?"

# 백그라운드 서버로 실행
ollama serve

# 모델 다운로드
ollama pull mistral

# 설치된 모델 목록
ollama list        # 또는 ollama ls

# 현재 실행 중인 모델 확인
ollama ps

# 모델 중지
ollama stop llama3.2

# 모델 삭제
ollama rm llama3.2

# 모델 정보 확인
ollama show llama3.2
ollama show --modelfile llama3.2   # Modelfile 출력

커스텀 모델 및 통합

bash

# Modelfile로 커스텀 모델 생성
ollama create my-model -f ./Modelfile

# 모델 양자화 (FP16 → q4_K_M)
ollama create my-model -f ./Modelfile -q q4_K_M

# 통합 도구 실행 (Claude Code, Codex, VS Code 등)
ollama launch claude
ollama launch codex --model qwen3

Modelfile

Modelfile은 커스텀 모델을 정의하는 설정 파일이다. Dockerfile과 유사한 문법을 사용한다.

지시어 전체 목록

dockerfile

# 기본 모델 지정 (필수)
FROM llama3.2
FROM ./model.gguf           # GGUF 파일 경로
FROM /path/to/safetensors/  # Safetensors 디렉토리

# 실행 파라미터
PARAMETER temperature 0.7   # 창의성 (기본: 0.8)
PARAMETER num_ctx 8192       # 컨텍스트 윈도우 크기 (기본: 2048)
PARAMETER top_k 40           # 다양성 제어 (기본: 40)
PARAMETER top_p 0.9          # 토큰 확률 필터 (기본: 0.9)
PARAMETER repeat_penalty 1.1 # 반복 페널티 (기본: 1.1)
PARAMETER repeat_last_n 64   # 반복 방지 범위 (기본: 64)
PARAMETER seed 42            # 난수 시드 (기본: 0, 랜덤)
PARAMETER num_predict -1     # 최대 생성 토큰 (-1 = 무제한)
PARAMETER stop "User:"       # 생성 중단 시퀀스
PARAMETER min_p 0.05         # 최소 확률 임계값

# 시스템 프롬프트
SYSTEM """
당신은 한국어 전문 코딩 어시스턴트입니다.
코드를 작성할 때 항상 주석을 한국어로 달아주세요.
"""

# 프롬프트 템플릿 (Go 템플릿 문법)
TEMPLATE """{{ if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}<|im_start|>user
{{ .Prompt }}<|im_end|>
<|im_start|>assistant
"""

# LoRA 어댑터 적용
ADAPTER ./lora-adapter.gguf

# 대화 예시 (few-shot)
MESSAGE user 서울은 한국에 있나요?
MESSAGE assistant 네, 서울은 대한민국의 수도입니다.

# 최소 Ollama 버전 요구사항
REQUIRES 0.14.0

# 라이선스
LICENSE """MIT License ..."""

실사용 예시

dockerfile

# Modelfile
FROM llama3.2
PARAMETER temperature 0.3
PARAMETER top_p 0.9
PARAMETER num_ctx 8192

SYSTEM """
당신은 한국어 전문 코딩 어시스턴트입니다.
"""

bash

ollama create my-coder -f Modelfile
ollama run my-coder

REST API

기본 베이스 URL: http://localhost:11434

채팅 완성 — `POST /api/chat`

bash

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "user", "content": "파이썬으로 피보나치 수열을 구현해줘"}
  ],
  "stream": false
}'

요청 주요 필드

필드	타입	설명
`model`	string	모델 이름 (필수)
`messages`	array	role + content 메시지 배열 (필수)
`stream`	bool	스트리밍 여부 (기본: true)
`format`	string/object	응답 형식 (`"json"` 또는 JSON 스키마)
`options`	object	temperature, top_k 등 모델 파라미터
`tools`	array	함수 도구 목록
`think`	bool/string	사고 과정 반환 (high/medium/low)
`keep_alive`	string	모델 메모리 유지 시간 (예: `"5m"`)

응답 주요 필드

필드	설명
`message.content`	모델 응답 텍스트
`message.thinking`	사고 과정 (think 활성 시)
`message.tool_calls`	도구 호출 목록
`done`	생성 완료 여부
`total_duration`	전체 처리 시간 (나노초)
`prompt_eval_count`	입력 토큰 수
`eval_count`	출력 토큰 수

텍스트 생성 — `POST /api/generate`

bash

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "하늘이 파란 이유는?",
  "stream": false
}'

임베딩 생성 — `POST /api/embed`

bash

curl -X POST http://localhost:11434/api/embed -d '{
  "model": "embeddinggemma",
  "input": ["텍스트1", "텍스트2"]
}'

모델 관리

bash

GET  /api/tags          # 설치된 모델 목록
GET  /api/ps            # 실행 중인 모델
POST /api/pull          # 모델 다운로드
POST /api/push          # 모델 업로드
POST /api/copy          # 모델 복사
DELETE /api/delete      # 모델 삭제
POST /api/create        # Modelfile로 모델 생성
POST /api/show          # 모델 상세 정보
GET  /api/version       # Ollama 버전 조회

OpenAI 호환 API

기존 OpenAI SDK를 그대로 사용해 Ollama에 연결할 수 있다.

python

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama',  # 아무 문자열 가능
)

response = client.chat.completions.create(
    model='llama3.2',
    messages=[{'role': 'user', 'content': '파이썬으로 피보나치 수열을 구현해줘'}],
    temperature=0.7,
)
print(response.choices[0].message.content)

지원 엔드포인트

엔드포인트	설명
`/v1/chat/completions`	채팅 완성 (스트리밍, 비전, 도구 호출 지원)
`/v1/completions`	텍스트 완성
`/v1/embeddings`	임베딩 생성
`/v1/responses`	도구 호출·추론 요약 (v0.13.3+)
`/v1/images/generations`	이미지 생성 (실험적)
`/v1/models`	모델 목록 조회

Anthropic API 호환

Ollama는 OpenAI 호환 외에 Anthropic Claude API도 에뮬레이션한다. Claude Code·Claude CLI 등 Anthropic SDK 기반 도구를 오픈소스 모델로 대체 실행하는 데 활용된다.

bash

# Claude Code를 Ollama 모델로 실행 (설치 불필요)
ollama launch claude

# 다른 모델 지정
ollama launch claude --model qwen3

기능 (Capabilities)

구조화된 출력 (Structured Outputs)

JSON 스키마를 강제해 일관된 구조의 응답을 받는다.

python

from ollama import chat
from pydantic import BaseModel

class Country(BaseModel):
    name: str
    capital: str
    population: int

response = chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': '대한민국에 대해 알려줘'}],
    format=Country.model_json_schema(),
)
country = Country.model_validate_json(response.message.content)
print(country.capital)  # 서울

도구 호출 (Tool Calling / Function Calling)

python

import ollama

def get_weather(city: str) -> str:
    return f"{city}의 현재 기온은 22°C입니다."

response = ollama.chat(
    model='qwen3',
    messages=[{'role': 'user', 'content': '서울 날씨 알려줘'}],
    tools=[get_weather],  # Python SDK가 함수를 자동으로 스키마 변환
)

# 도구 호출 결과 처리
for tool_call in response.message.tool_calls or []:
    result = get_weather(**tool_call.function.arguments)
    # 결과를 다음 메시지에 포함

지원 도구 호출 유형: 단일 / 병렬 / 다중 턴(에이전트 루프) / 스트리밍

임베딩 (Embeddings)

텍스트를 벡터로 변환해 의미론적 검색, RAG 파이프라인에 활용한다.

python

import ollama

result = ollama.embed(
    model='embeddinggemma',   # 또는 qwen3-embedding, all-minilm
    input=['텍스트1', '텍스트2']  # 배치 처리 지원
)
vectors = result.embeddings   # 384~1024 차원 벡터 배열

비전 (Vision)

이미지와 텍스트를 함께 전달해 이미지를 분석·설명한다.

bash

# CLI
ollama run gemma3 ./photo.jpg "이 사진에 뭐가 있어?"

# API
curl http://localhost:11434/api/chat -d '{
  "model": "gemma3",
  "messages": [{
    "role": "user",
    "content": "이 이미지를 설명해줘",
    "images": ["<base64-encoded-image>"]
  }]
}'

이미지 생성 (Image Generation)

macOS에서 로컬 이미지 생성 모델을 실행할 수 있다 (실험적 기능).

bash

# 이미지 생성 모델 실행
ollama run gemma3   # 또는 지원 이미지 생성 모델

# OpenAI 호환 API로 이미지 생성
curl http://localhost:11434/v1/images/generations -d '{
  "model": "...",
  "prompt": "A futuristic city at night"
}'

사고 과정 (Thinking)

추론 모델(DeepSeek-R1, QwQ 등)의 내부 사고 과정을 반환받을 수 있다.

python

response = ollama.chat(
    model='deepseek-r1',
    messages=[{'role': 'user', 'content': '9.11과 9.9 중 어느 게 더 커?'}],
    think=True,   # 또는 "high" / "medium" / "low"
)
print("사고:", response.message.thinking)
print("답변:", response.message.content)

모델 임포트

외부 모델 파일을 Ollama에 가져올 수 있다.

dockerfile

# GGUF 파일 임포트
FROM /path/to/model.gguf

# Safetensors 디렉토리 임포트
FROM /path/to/safetensors/

# LoRA 어댑터 결합
FROM llama3.2
ADAPTER /path/to/lora.gguf

양자화 변환: FP16/FP32 모델을 경량화

bash

ollama create my-model -f Modelfile -q q4_K_M
# 지원: q8_0, q4_K_S, q4_K_M, q5_K_M 등

GPU 지원

플랫폼	요구사항
NVIDIA CUDA	컴퓨팅 능력 5.0+, 드라이버 531+ (GTX 750 ~ RTX 50xx)
AMD ROCm	Linux: ROCm v7 / Windows: ROCm v6.1 (일부 GPU)
Apple Metal	M 시리즈 칩 (자동 활성화)
Apple MLX	Apple Silicon 전용 고성능 추론 (v0.6+ preview, 자동 활성화)
Vulkan	`OLLAMA_VULKAN=1` 환경변수 설정 (실험적)

Apple Silicon(M 시리즈)에서는 2026년 3월부터 Metal 대신 MLX 엔진이 기본 적용돼 추론 속도가 향상됐다.

bash

# 사용할 GPU 지정
CUDA_VISIBLE_DEVICES=0,1 ollama serve        # NVIDIA
ROCR_VISIBLE_DEVICES=0 ollama serve          # AMD

# GPU 메모리 로드 상태 확인
ollama ps
# PROCESSOR 열: "100% GPU" / "100% CPU" / "48%/52% CPU/GPU"

환경 변수

변수	기본값	설명
`OLLAMA_HOST`	`127.0.0.1:11434`	서버 바인드 주소 (외부 공개: `0.0.0.0`)
`OLLAMA_MODELS`	`~/.ollama/models`	모델 저장 경로
`OLLAMA_CONTEXT_LENGTH`	`4096`	기본 컨텍스트 윈도우 크기
`OLLAMA_MAX_LOADED_MODELS`	`1`	동시 로드 최대 모델 수
`OLLAMA_NUM_PARALLEL`	`1`	병렬 처리 요청 수
`OLLAMA_MAX_QUEUE`	`512`	요청 대기열 최대 크기
`OLLAMA_FLASH_ATTENTION`	`0`	Flash Attention 활성화
`OLLAMA_KV_CACHE_TYPE`	`f16`	KV 캐시 양자화 (`q8_0`, `q4_0`)
`OLLAMA_ORIGINS`	`127.0.0.1,0.0.0.0`	CORS 허용 출처
`HTTPS_PROXY`	—	프록시 설정
`OLLAMA_NO_CLOUD`	`0`	클라우드 기능 비활성화

메모리 관리

bash

# 모델은 기본 5분간 메모리 유지 후 언로드
# keep_alive 파라미터로 제어:
#   -1  = 영구 유지
#   0   = 즉시 언로드
#   "10m" = 10분 유지

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "keep_alive": -1
}'

# 즉시 언로드
ollama stop llama3.2

모델 저장 위치

플랫폼	경로
macOS	`~/.ollama/models`
Linux	`/usr/share/ollama/.ollama/models`
Windows	`C:\Users\%username%\.ollama\models`

Ollama vs vLLM 비교

항목	Ollama	vLLM
대상	개인·개발자 로컬 환경	프로덕션 서빙
설정 복잡도	매우 쉬움 (원클릭)	중간
처리량	단일 요청 최적	대용량 배치 최적
GPU 요구사항	낮음 (CPU 폴백 지원)	높음
OpenAI 호환	지원	지원
모델 관리	내장 (ollama pull/list)	별도 관리 필요
양자화	GGUF 기본 지원	AWQ, GPTQ 등
스트리밍	지원	지원

참고문헌

1.Ollama. (2024). Ollama — Get up and running with large language models. GitHub
2.Ollama. (2024). Ollama API Documentation
3.Gerganov, G. (2023). llama.cpp: LLM inference in C/C++. GitHub
4.Ollama. (2024). Ollama is now available as an official Docker image. Ollama Blog
5.Ollama. (2026). Ollama is now powered by MLX on Apple Silicon. Ollama Blog

OllamaOllama (로컬 LLM 실행)

핵심 특징

설치

CLI 명령어

모델 실행 및 관리

커스텀 모델 및 통합

Modelfile

지시어 전체 목록

실사용 예시

REST API

채팅 완성 — `POST /api/chat`

텍스트 생성 — `POST /api/generate`

임베딩 생성 — `POST /api/embed`

모델 관리

OpenAI 호환 API

기능 (Capabilities)

구조화된 출력 (Structured Outputs)

도구 호출 (Tool Calling / Function Calling)

임베딩 (Embeddings)

비전 (Vision)

이미지 생성 (Image Generation)

사고 과정 (Thinking)

모델 임포트

GPU 지원

환경 변수

메모리 관리

모델 저장 위치

Ollama vs vLLM 비교

관련 개념

참고문헌

관련 노트

프론티어 AI 모델Frontier AI Models

에이전틱 AIAgentic AI

AutoGPTAutoGPT

OllamaOllama (로컬 LLM 실행)

핵심 특징

설치

CLI 명령어

모델 실행 및 관리

커스텀 모델 및 통합

Modelfile

지시어 전체 목록

실사용 예시

REST API

채팅 완성 — POST /api/chat

텍스트 생성 — POST /api/generate

임베딩 생성 — POST /api/embed

모델 관리

OpenAI 호환 API

기능 (Capabilities)

구조화된 출력 (Structured Outputs)

도구 호출 (Tool Calling / Function Calling)

임베딩 (Embeddings)

비전 (Vision)

이미지 생성 (Image Generation)

사고 과정 (Thinking)

모델 임포트

GPU 지원

환경 변수

메모리 관리

모델 저장 위치

Ollama vs vLLM 비교

관련 개념

참고문헌

관련 노트

프론티어 AI 모델Frontier AI Models

에이전틱 AIAgentic AI

AutoGPTAutoGPT

채팅 완성 — `POST /api/chat`

텍스트 생성 — `POST /api/generate`

임베딩 생성 — `POST /api/embed`