The best of Data + AI Summit 2025 for Data Engineers

Data Engineering Newsletter edition #25

Prashanth Xavier

Jun 15, 2025

📣Key Announcements

💰 Databricks Free Edition

Buckle up, folks, because Databricks just became the ultimate playground for data enthusiasts!

No credit card or business email needed.

Get instant access to a fully functional Databricks workspace. This means you can try out Databricks, test its features, and explore the world of Lakehouse, whether you're a student, breaking into the data engineering field, or an experienced Databricks professional(with some limitations).

What's really impressive is that the free edition comes with a wide range of features, giving you almost all the capabilities of a production environment, including:

Setting up Git for version control
Running workflows
Using structured streaming
Creating interactive dashboards

Full Apache Iceberg Support

We're excited to see Iceberg becoming a first-class citizen in the Databricks/Unity Catalog ecosystem, bringing the full benefits of Unity to Iceberg users, including predictive optimization and governance.

What’s new?

Iceberg now supports managed tables with Unity Catalog governance, making data management easier than ever.
Unity Catalog is now a fully open and compatible catalog for Iceberg, streamlining data integration.
❤️ The Iceberg REST Catalog API enables seamless reading and writing from any external engine that supports Iceberg, including Snowflake, Trino, DuckDB, and EMR❤️.
Delta Sharing technology is also now available for Iceberg tables, making it easier to share data crpss organization.

Spark Declarative Pipelines

DLT(Delta live tables) that help build end to end production pipelines with declarative SQL syntax is now open source available in Apache Spark as Spark Declarative Pipelines. This aims to eliminate the tedious tasks of dependency management, error handling, and retries, allowing you to focus on what matters most: collecting, transforming, and analyzing your data.

LakeBase

Remember that time we spilled the beans about Databricks's acquisition of Neon? Yeah, it's been a while... but the secret's out!

OLTP is here at Databricks.

Lakebase is fully managed Postgres OLTP engine, which can be provisioned as a new compute type database instance inside of Databricks.

This sets the stage for:

Running both OLTP and OLAP workloads in a single, unified platform
Unified governance through Unity Catalog
Seamless data syncing between Unity Catalog and Postgres, both ways
Direct database access via SQL editor, Notebooks, and external tools like DBeaver
Branching of schema and database using copy-on-write clones for effortless testing and production isolation

Some important details to keep in mind:

2TB maximum instance size
1000 concurrent connections per instance
Up to 10 instances per workspace

📈Unity Catalog Metrics

Getting back to basics, Unity Catalog's latest innovation, Metrics, aims to bridge the gap between data engineers and business users, much like the classic combination of SQL Server Analysis Services and DAX in Power BI. With Metrics now generally available (GA) within Databricks' Unity Catalog, we're one step closer to realizing a more unified and semantic data model.

Say goodbye to the frustrations of inconsistent metric definitions across different tools and teams, and hello to:

Single-source truth for metrics, making it easy to create once and use everywhere
Enhanced governance using Unity Catalog, ensuring that metrics adhere to your organization's standards and regulations

🐳Delta Lake 4.0

🧪 New VARIANT data type: Seamlessly manage semi-structured data like JSON in a strongly typed format — great for evolving schema needs works for both delta and iceberg.
Simple example:
```
CREATE TABLE device_data (device_info VARIANT);

Insert into device_data
Select parse_json(
'{
"device":{
          "id":19,
          "color":"red"
        }
}'
) as device_info
```
Query variant data type using : to access top-level fields and .(dot) notation for nested key fields.
```
SELECT device_info:device.id FROM 
```
Enhanced DROP FEATURE: Instantly drop table features without truncating history.
File Statistics in Delta Log: Enhances queries through efficient data skipping.

🔥 Apache Spark 4.0

Spark is moving at a relentless pace, 4.0 is a major release and packs a punch. Here is a short list of interesting new features coming soon to a cluster near you.

Real Time Mode for Streaming (RTM) - a game-changing capability that enables true millisecond-latency processing. Unlike traditional structured streaming, which relies on microbatches, RTM runs long-lasting tasks continuously, polling for data and processing it in real-time."
Native Plot API - Quickly visualize data with built-in `.plot()` support powered by Plotly — no extra setup needed.
SQL Pipe Syntax - Write SQL in a brandnew way as a sequence of independent clauses arranged in any order, much like DataFrames.

# Pipe syntax
FROM customer
|> LEFT OUTER JOIN orders ON c_custkey = o_custkey
      AND o_comment NOT LIKE '%unusual%packages%'
|> AGGREGATE COUNT(o_orderkey) c_count
   GROUP BY c_custkey
|> AGGREGATE COUNT(*) AS custdist
   GROUP BY c_count
|> ORDER BY custdist DESC, c_count DESC;

🔄 Structured Streaming upgrades: The new `transformWithState` API enables rich stateful stream processing with TTLs, timers, and custom logic.

Urban Data Engineer's Newsletter

Discussion about this post