- Example utility module
- Use the utility from a notebook
- Why this helps
- Good candidates for shared utilities
- Notes
One of the easiest ways for a Databricks workspace to become hard to maintain is letting every notebook carry its own copy of validation logic, path handling, API calls, and cleanup functions. A better pattern is to move common logic into small Python utilities that notebooks and jobs can import consistently.
Example utility module
from pyspark.sql import DataFrame
from pyspark.sql import functions as F
REQUIRED_COLUMNS = {"customer_id", "event_ts", "amount"}
def normalize_columns(df: DataFrame) -> DataFrame:
renamed = df
for column in df.columns:
clean_name = column.strip().lower().replace(" ", "_")
renamed = renamed.withColumnRenamed(column, clean_name)
return renamed
def validate_required_columns(df: DataFrame) -> None:
missing = REQUIRED_COLUMNS.difference(set(df.columns))
if missing:
raise ValueError(f"Missing required columns: {sorted(missing)}")
def drop_blank_customer_ids(df: DataFrame) -> DataFrame:
return df.filter(F.col("customer_id").isNotNull())
Use the utility from a notebook
from project_utils.ingestion import (
drop_blank_customer_ids,
normalize_columns,
validate_required_columns,
)
df = spark.read.option("header", True).csv("/Volumes/raw/customers.csv")
df = normalize_columns(df)
validate_required_columns(df)
df = drop_blank_customer_ids(df)
Why this helps
- the notebook stays focused on business logic
- repeated ingestion rules live in one place
- fixes can be rolled into multiple jobs more easily
- teams stop copy-pasting helper cells across notebooks
Good candidates for shared utilities
- schema validation
- retry-safe API wrappers
- path conventions for volumes and checkpoints
- lightweight quality checks before writes
- configuration loading for jobs and notebook tasks
Notes
- keep shared modules small and readable
- avoid hiding core transformation logic behind too much abstraction
- version utility code with the same discipline as production jobs
- prefer explicit imports over one giant helper file