data2al icon data2al Data Engineering Notes and Code Patterns

Concept note

Data Cleanup Before Loading

A lightweight dataframe cleanup pattern for standardizing files before loading them into a warehouse.

2025-03-18
Alan
Programming
Beginner
Python Pandas ETL

This is a practical preprocessing step for raw CSV data before you pass it into an ETL or ELT workflow.

Example

import pandas as pd

df = pd.read_csv("input/customers.csv")
df.columns = [column.strip().lower().replace(" ", "_") for column in df.columns]
df = df.drop_duplicates(subset=["customer_id"])
df["updated_at"] = pd.to_datetime(df["updated_at"], utc=True)

What this does

  • normalizes column names
  • removes duplicate business keys
  • converts timestamps into a predictable format

Notes

  • expand the cleanup rules when source systems have inconsistent types
  • keep column normalization rules stable across pipelines
  • log dropped duplicates when the volume matters operationally

Similar Posts