10 Data Engineering Tips & Tricks for 2025

Edition #15 - Tips for SQL, Python, Delta Lake, Spark

Prashanth Xavier

Jan 06, 2025

Data Engineering Tip #1

Exclude columns in SELECT query

If you want to exclude columns in a Spark SQL Select clause with 100s of columns, use the keyword EXCEPT.

Data Engineering Tip #2

Compare dataframes

Easily compare two spark data frames for schema and data equality using the library chispa

pip install chispa

Data Engineering Tip #3

Size of Delta Table

Find the size of a databricks delta table and other useful properties of your table easily using the following command.

Data Engineering Tip #4

3 ways to read a parquet using Python

Pandas

import pandas as pd

df = pd.read_parquet(path=file_name.parquet, engine='pyarrow')

print(df.head(100))

DuckDb

import duckdb as dd

df = dd.sql(f"Select * from {file} limit 100 ")

print(df)

Polars

import polars as pl

df = pl.read_parquet(source=file)

print(df.head(100))

Which one is faster?

Data Engineering Tip #5

Generate DDL for Delta Table

Quickly generate the DDL for an existing Delta table without having to write it yourself

Data Engineering Tip #6

Testing datetime operations

So many of our ETL processes run datetime operations. Sometimes we need to use current_time or current_date to generate metadata, calculate time difference etc.

So how do you test functions that make use of current_time?

Use the freezegun library

pip install freezegun

from freezegun import freeze_time
import datetime

@freeze_time("2023-11-05")
def test():
    assert datetime.datetime.now() == datetime.datetime(2023, 11, 05)

Once the decorator has been invoked all calls to the datetime functions will return the datetime that has been frozen.

Data Engineering Tip #7

Spark Execution Plan

Understand the internals of your Spark code. Explore, and fine-tune the execution plan of your Spark applications by using the explain function.

df.explain()

df.explain(extended=False) - display the physical plan.

df.explain(extended=True) - display both logical and physical plan.

Data Engineering Tip #8

Clone your Delta Table

Make a faithful copy of your Delta table including the partitioning structure and state of the delta log. Quite handy for testing and backup use cases.

Data Engineering Tip #9

Any Column Filter

Have you ever explored a dataset with hundreds of columns and you have a value that you want to filter across all columns. In other words find if the value exists anywhere in the table?

It is possible with a new filter only available on DBR 15.1 or a preview mode SQL Warehouse. I like to call this Any column filter*

The query will automatically look for the value across all columns and return the records that match the filter. Neat trick!

Data Engineering Tip #10

Most efficient way to read a CSV in Spark

When reading a CSV, Spark triggers a job. A job is triggered every time we are physically required to touch the data. In this case, Spark knows nothing about our data and hence cannot judge the best execution plan. Therefore Spark has to peek at the first line of the file to figure out how many columns of data we have in the file.

Solution: Predefine the schema

As a result of pre-defining the schema for your data, Spark triggers zero jobs. Spark did not see the need to peek into the file since we took care of the schema. This is known as lazy evaluation which is a crucial optimization technique in Spark.

Urban Data Engineer's Newsletter

Discussion about this post

Ready for more?