Text Generation Inference - 인공지능 > 언어모델 | AI Insight Note

TGI(Text Generation Inference)는 Hugging Face가 개발한 프로덕션급 LLM 서빙 툴킷이다. Rust로 작성된 서버 코어로 높은 성능을 발휘하며, Flash Attention, Paged Attention, 연속 배치를 기본 지원한다.

주요 특징

항목	설명
핵심 언어	Rust (서버) + Python (모델 로딩)
텐서 병렬	멀티 GPU 분산 추론
Flash Attention	메모리 효율적 어텐션 구현
연속 배치	동적 요청 배치 처리
양자화	bitsandbytes, AWQ, GPTQ
스트리밍	SSE 토큰 스트리밍
Safetensors	빠른 모델 로딩

Docker 실행

bash

model=meta-llama/Llama-3.1-8B-Instruct
volume=$PWD/data

docker run --gpus all   -p 8080:80   -v $volume:/data   ghcr.io/huggingface/text-generation-inference:latest   --model-id $model   --num-shard 2   --max-input-length 4096   --max-total-tokens 8192   --quantize bitsandbytes-nf4

Python 클라이언트

python

from huggingface_hub import InferenceClient

client = InferenceClient("http://localhost:8080")

# 기본 생성
response = client.text_generation(
    "한국의 수도는 어디인가요?",
    max_new_tokens=200,
    temperature=0.7,
    do_sample=True,
)
print(response)

# 스트리밍
for token in client.text_generation(
    "Python 비동기 프로그래밍을 설명해줘",
    max_new_tokens=500,
    stream=True,
):
    print(token, end="", flush=True)

TGI vs vLLM 비교

항목	TGI	vLLM
개발사	Hugging Face	UC Berkeley
코어 언어	Rust	Python
HF Hub 통합	네이티브	지원
처리량	높음	매우 높음
커뮤니티	HF 생태계	빠른 성장

Text Generation InferenceTGI (Text Generation Inference)

주요 특징

Docker 실행

Python 클라이언트

TGI vs vLLM 비교

관련 문서

관련 노트

프론티어 AI 모델Frontier AI Models

에이전틱 AIAgentic AI

AutoGPTAutoGPT