Data Quality Management - 데이터 분석 > 데이터 엔지니어링 | AI Insight Note

데이터 품질 관리(Data Quality Management)는 ML 및 분석 시스템에서 정확하고 신뢰할 수 있는 데이터를 보장하는 프로세스와 기술이다. 나쁜 데이터는 뛰어난 모델도 무용지물로 만든다.

데이터 품질 차원

차원	정의	예시 체크
완전성(Completeness)	NULL 비율 최소화	null_count / total < 5%
정확성(Accuracy)	실제 값과의 일치	나이 > 0 and < 150
일관성(Consistency)	테이블 간 일치	orders.user_id ∈ users.id
적시성(Timeliness)	데이터 최신성	max(updated_at) > now() - 1h
유일성(Uniqueness)	중복 없음	count(distinct id) = count(*)
유효성(Validity)	형식/범위 준수	email LIKE '%@%.%'

Great Expectations 사용

python

import great_expectations as gx
import pandas as pd

context = gx.get_context()
df = pd.read_csv("orders.csv")

# 기대값(Expectation) 정의
validator = context.sources.pandas_default.read_dataframe(df)

validator.expect_column_to_exist("order_id")
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_be_unique("order_id")
validator.expect_column_values_to_be_between("amount", min_value=0, max_value=1_000_000)
validator.expect_column_values_to_match_regex("email", r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+.[a-zA-Z]{2,}$")

# 검증 실행
results = validator.validate()
print(f"통과: {results.statistics['successful_expectations']}")
print(f"실패: {results.statistics['unsuccessful_expectations']}")

dbt 테스트와 통합

yaml

# dbt schema.yml
models:
  - name: orders
    tests:
      - dbt_expectations.expect_table_row_count_to_be_between:
          min_value: 1000
    columns:
      - name: order_id
        tests: [unique, not_null]
      - name: amount
        tests:
          - dbt_utils.accepted_range:
              min_value: 0

Data Quality Management데이터 품질 관리

데이터 품질 차원

Great Expectations 사용

dbt 테스트와 통합

관련 개념