Text Classification - 인공지능 > 자연어 처리 | AI Insight Note

텍스트 분류(Text Classification)는 텍스트를 사전 정의된 카테고리로 분류하는 NLP 태스크다. 감성 분석, 스팸 감지, 토픽 분류, 의도 파악 등에 활용된다.

접근 방식 비교

방식	도구	장점	단점
전통 ML	TF-IDF + SVM/NB	빠름, 해석 가능	성능 한계
RNN/LSTM	PyTorch	순서 정보 반영	학습 느림
Fine-tuning	BERT, RoBERTa	고성능	리소스 필요
Zero-shot	GPT, BART	학습 불필요	비용

BERT Fine-tuning (HuggingFace)

python

from transformers import (
    AutoTokenizer, AutoModelForSequenceClassification,
    TrainingArguments, Trainer,
)
from datasets import Dataset
import numpy as np
from sklearn.metrics import accuracy_score, f1_score

# 모델 및 토크나이저 로드
model_name = "klue/bert-base"  # 한국어 BERT
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(
    model_name, num_labels=3  # 긍정/중립/부정
)

# 데이터 준비
def tokenize(examples):
    return tokenizer(
        examples["text"],
        truncation=True,
        padding="max_length",
        max_length=128,
    )

train_dataset = Dataset.from_dict({
    "text": ["정말 좋아요!", "그저 그래요", "별로예요"],
    "label": [2, 1, 0],
}).map(tokenize, batched=True)

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return {
        "accuracy": accuracy_score(labels, preds),
        "f1": f1_score(labels, preds, average="macro"),
    }

trainer = Trainer(
    model=model,
    args=TrainingArguments(
        output_dir="./results",
        num_train_epochs=3,
        per_device_train_batch_size=16,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
    ),
    train_dataset=train_dataset,
    eval_dataset=train_dataset,
    compute_metrics=compute_metrics,
)
trainer.train()

Zero-shot 분류

python

from transformers import pipeline

classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

text = "The new iPhone features a remarkable camera system"
candidate_labels = ["technology", "sports", "politics", "entertainment"]

result = classifier(text, candidate_labels)
print(f"분류: {result['labels'][0]} (점수: {result['scores'][0]:.3f})")

Text Classification텍스트 분류

접근 방식 비교

BERT Fine-tuning (HuggingFace)

Zero-shot 분류