Model Compression - 인공지능 > 딥러닝 | AI Insight Note

모델 압축(Model Compression)은 딥러닝 모델의 크기와 연산량을 줄이면서 성능 저하를 최소화하는 기법이다. 모바일·엣지 배포, 추론 비용 절감에 필수적이다.

압축 기법 비교

기법	원리	압축률	성능 손실
양자화 (Quantization)	부동소수점 → 저정밀도 정수	2~8x	낮음
가지치기 (Pruning)	중요도 낮은 가중치 제거	2~10x	중간
지식 증류 (KD)	큰 모델 → 작은 모델 학습	5~50x	낮음
저순위 분해 (LoRA)	가중치 행렬 분해	10~100x	낮음
가중치 공유	클러스터링으로 중복 제거	2~5x	중간

Post-Training Quantization (PTQ)

python

import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig

# 4비트 양자화 (bitsandbytes)
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",  # NormalFloat4
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3-8B",
    quantization_config=quantization_config,
    device_map="auto",
)
# 원본 16GB → ~5GB로 감소

지식 증류 (Knowledge Distillation)

python

import torch
import torch.nn as nn
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.7):
    """
    T: 온도 파라미터 (높을수록 소프트 레이블 평탄화)
    alpha: KD 손실 vs CE 손실 비율
    """
    # 소프트 레이블 손실 (교사 모델의 지식)
    soft_loss = F.kl_div(
        F.log_softmax(student_logits / T, dim=-1),
        F.softmax(teacher_logits / T, dim=-1),
        reduction='batchmean',
    ) * (T ** 2)

    # 하드 레이블 손실 (정답 레이블)
    hard_loss = F.cross_entropy(student_logits, labels)

    return alpha * soft_loss + (1 - alpha) * hard_loss

Structured Pruning (구조적 가지치기)

python

from torch.nn.utils import prune

# L1 비정형 가지치기: 가중치의 30% 제거
prune.l1_unstructured(model.fc1, name='weight', amount=0.3)

# 채널 단위 구조적 가지치기
prune.ln_structured(model.conv1, name='weight', amount=0.2, n=2, dim=0)

# 영구 적용
prune.remove(model.fc1, 'weight')

# 희소성 확인
print(f"희소성: {(model.fc1.weight == 0).sum() / model.fc1.weight.numel():.1%}")

Model Compression모델 압축 기법

압축 기법 비교

Post-Training Quantization (PTQ)

지식 증류 (Knowledge Distillation)

Structured Pruning (구조적 가지치기)

관련 노트

생성적 적대 신경망Generative Adversarial Network

합성곱 신경망Convolutional Neural Network

완전 연결 신경망Fully Connected Network