Polars: Forget the Cluster, Keep the Speed – A Fast DataFrame Library
Polars series, Databricks PowerBI integration and more on #edition21
What’s on the list today?
Polars series - Part 1: An Intro for Data Engineers
Databricks News
Power BI refresh in Databricks workflows
Tabs for notebooks
🐻Polars series - Part 1: An Intro for Data Engineers
Starting from this edition (#21), we begin a 5-part series on Polars, the highly popular data frame library that rivals pandas in functionality. In this first installment, we'll introduce the basics, laying the groundwork for our in-depth exploration of more advanced topics to come. For those interested in following along, the repository is now available to explore further Github-Polarexpress
Why Polars?
Polars is an open-source data processing library, renowned for being one of the fastest data processing solutions on a single machine, offering up to 50x performance improvements compared to other libraries.
Polars was created by Richie Vink, who saw the limitations of traditional Python-based data frame libraries like Pandas. Instead of tweaking existing tools he decided to build Polars from scratch in ⚡Rust ⚡ which allowed for every performance critical data structure to be controlled effectively.
In fact, as a funny anecdote, the very first implementation of Polars that replicated a simple join 🤯 fared terribly against Pandas and was very slow, which set out a unique challenge for the creator to improve, and now Polars is one of the fastest data processing engines going around ⚡️.
Why Polars is fast?⚡
Built from the ground up using the high-performance language Rust ⚡️, which is revered for its emphasis on speed, safety, and efficiency 💪, Polars comes equipped with an in-built optimizer that enables lazy operations. This optimizer ensures that queries are optimized before they're executed, maximizing performance.
The use of vectorization and SIMD (Single Instruction, Multiple Data) techniques also plays a crucial role in Polars' exceptional performance. By leveraging these techniques, Polars can deliver unparalleled speed, achieving a mind-boggling 10-50x performance gain over competing tools 🚀
Getting Started with Polars
Polars comes with a python binding so getting started is as easy as getting a cup of tea.
pip install polars
Or if you want to jump on the Rust hype train use UV package manager.
uv add polars
Creating a static Data frame
import polars as pl
df = pl.DataFrame(
{
"name":["Adam","Joshua","Moses","Jonah"],
"age":[55,500,1000,100]
}
)
print(df)
Reading parquet file
df_flights = pl.read_parquet("data/flights.parquet")
print(df_flights.limit(10))
Tip: Unlike Pandas, Polars operations are vectorized and run in parallel.
What’s Next?
In the next edition, we’ll dive deep into Lazy Execution, query optimization and query planner. Stay tuned. The code is available on Github
📰Databricks News
📈PowerBI refresh in Workflows
Databricks has strengthened its Power BI direct integration with a new feature 💻.
Databricks has announced the public preview of a new Task type in Databricks Workflows, which enables automatic publishing and updating of Power BI semantic models.
Navigate to the Tasks tab in the Jobs UI, for the job to which you want to add a task.
Click + Add task.
Enter a Task name.
In the Type drop-down menu, select Power BI.
Configure the task properties (see this table for the properties and their use).
Key benefits include:
Automation, which simplifies orchestration and eliminates the need for manual overhead 🔄.
Reduced refresh costs: models are updated only when the data changes, reducing unnecessary expenses 💸
📄 Notebook Tabs
If you've been frustrated, like me, with switching between notebooks in Databricks by navigating back to the workspace and then losing your train of thought, this new feature will likely be a welcome solution.
Databricks has updated its interface to make switching between notebooks, files, and queries a lot easier. Now, all these essentials are just a tab away 📄. No more back-and-forth navigation, no more lost sessions... just a fresh start each time 💻
Simply enable this experimental feature Go to Settings > Developer, scroll down to Experimental features, and toggle on Tabs for notebooks and files.
Have fun!