data2al Data Engineering Notes and Code Patterns

Search articles

Concept note

Data Cleanup Before Loading

A lightweight dataframe cleanup pattern for standardizing files before loading them into a warehouse.

2025-03-18

Alan

Programming

Beginner

Python Pandas ETL

Example
What this does
Notes

This is a practical preprocessing step for raw CSV data before you pass it into an ETL or ELT workflow.

Example

import pandas as pd

df = pd.read_csv("input/customers.csv")
df.columns = [column.strip().lower().replace(" ", "_") for column in df.columns]
df = df.drop_duplicates(subset=["customer_id"])
df["updated_at"] = pd.to_datetime(df["updated_at"], utc=True)

What this does

normalizes column names
removes duplicate business keys
converts timestamps into a predictable format

Notes

expand the cleanup rules when source systems have inconsistent types
keep column normalization rules stable across pipelines
log dropped duplicates when the volume matters operationally

Similar Posts

Move Large Datasets With Resumable and Verified Transfers

Previous Article Setting up a Snowflake Connector in Python

Next Article Deduplicate Rows With Row Number