Data Catalog - 데이터 분석 > 데이터 엔지니어링 | AI Insight Note

데이터 카탈로그(Data Catalog)는 조직 내 모든 데이터 자산의 메타데이터를 중앙에서 관리하는 시스템이다. "데이터의 구글"로 비유되며, 누가 어떤 데이터를 어디서 어떻게 사용하는지 파악할 수 있게 한다.

핵심 기능

기능	설명
데이터 검색	키워드, 태그로 데이터셋 탐색
메타데이터 관리	스키마, 설명, 소유자 정보
데이터 리니지	데이터 흐름 추적 (소스 → 변환 → 소비)
데이터 프로파일링	통계, 분포 자동 계산
접근 제어	민감 데이터 태깅, 거버넌스
협업	코멘트, 평점, 문서화

Apache Atlas 예시

python

from pyapache_atlas import AtlasClient

client = AtlasClient("http://atlas-server:21000", ("admin", "admin"))

# 엔티티 생성 (hive_table 등록)
entity = {
    "typeName": "hive_table",
    "attributes": {
        "name": "orders",
        "db": {"typeName": "hive_db", "uniqueAttributes": {"qualifiedName": "default@cluster"}},
        "qualifiedName": "orders@cluster",
        "description": "주문 트랜잭션 테이블",
        "owner": "data-team",
        "columns": [
            {"typeName": "hive_column", "attributes": {"name": "order_id", "type": "bigint"}},
            {"typeName": "hive_column", "attributes": {"name": "amount", "type": "double"}},
        ]
    }
}
result = client.entity.create(entity)

# 검색
results = client.discovery.dsl("hive_table where name='orders'")

주요 솔루션 비교

솔루션	유형	특징
Apache Atlas	오픈소스	Hadoop 생태계 통합
DataHub (LinkedIn)	오픈소스	실시간 메타데이터
Amundsen (Lyft)	오픈소스	검색 중심
Collibra	상용	거버넌스 특화
Google Data Catalog	클라우드	GCP 통합

Data Catalog데이터 카탈로그

핵심 기능

Apache Atlas 예시

주요 솔루션 비교

관련 개념