Whats on the list today?
What is DuckDB? Get started with DuckDB CLI in 2 minutes.
Data Engineering Tips
SQL Variables Syntax in Databricks
Any column filtering in Databricks
Interesting Reads:
Okta’s multi-engine Data stack using Duck DB.
The fastest way to process 1 billion records in Python.
🦆DuckDB - The Mother of in-process OLAP
I have been exploring DuckDB and it has been making all the right kind of noises out there in the Data Engineering world. Check out the interesting read section below on How Okta used DuckDB to process trillions of records.
What is DuckDB and What does it mean for a Data Engineer?
DuckDB is an in-process SQL analytical database.
In otherwords it is the SQLite of the OLAP world.
Features
DuckDB is blazing fast using a columnar-vectorized query execution engine.
It is so simple to install no external dependencies and runs completely on the host process.
It is portable and can run on any OS, and provides API support for all major languages.
Best part - It is free and open-source.
Get started with DuckDB CLI in under 2 minutes.
Lets install the CLI.
#Windows
winget install DuckDb.cli
#Macos
brew install duckdb
Init duckdb
duckdb
Running duckdb without any arguments will instantiate a temporary in-memory database and a prompt with D
Now simply run any SQL in the prompt. I read a parquet file stored locally as below.
That was it, did not break a sweat!
What you notice at first glance
It is indeed easy to install.
It is indeed blazingly fast.
It is lightweigt and well formatted.
Lets explore the new found love for ducks in the upcoming Newsletters. Give it a try and let me know how you like it and what use cases would this suit.
🐳Data Engineering Tip
Using Variables in Databricks SQL
Starting from DBR 14.1 and above you can use temporary variables scoped to the session. Here is an example running an SQL notebook on either DBR 14.1 and above or SQL Warehouse.
Any Column Filter
Have you ever explored a dataset with hundreds of columns and you have a value that you want to filter across all columns. In other words find if the value exists anywhere in the table?
It is possible with a new filter only available on DBR 15.1 or a preview mode SQL Warehouse. I like to call this Any column filter*
The query will automatically look for the value across all columns and return the records that match the filter. Neat trick!
📰Interesting Reads