The best of Data + AI Summit 2024 for Data Engineers

Most anticipated features for Data Engineers to watch out for.

Prashanth Xavier

Jun 22, 2024

Databricks is a ~~Lakehouse platform~~, scratch that, it is now The Data Intelligence Platform.

📣Key Announcements

💰 Databricks + Tabular

A few days before the event, Databricks announced the successful acquisition of Tabular, Inc,

Tabular was founded by the original creators of Apache Iceberg, an open table format similar to Delta. This means Delta and Iceberg are no longer in a race but now sit at the same table.

🖥️ Databricks + NVIDIA

Databricks announced the integration of NVIDIA’s accelerated computing for the photon engine. NVIDIA’s CEO captioned that “accelerated computing and generative AI are the two most important technological trends today”

🔓 Unity catalog OSS

In one of the best demos you will ever see in a keynote presentation, Matei Zaharia open-sourced Unity Catalog live on stage.

Coined as a catalog for unified data governance across clouds, formats and platforms.

What 0.1 open-source version of Unity Catalog brings forth:

Management of tables and volumes(unstructured data) in a single catalog.
Access from the Iceberg engine ecosystem via Iceberg REST Catalog API.
Unity REST APIs enabling open source community to build powerful integrations.

🔌Serverless Everything

Serverless is coming to workflows and notebooks currently in public preview. As Data Engineers, this allows you to run jobs, and notebooks without having to worry about infrastructure. With serverless, Databricks manages the compute including optimization and scaling so you don’t have to.

No setting up cluster, no configuring instance type, no configuring DBR runtime, and no more paying for idle time.

Will Data Engineers benefit from losing control over infrastructure? Only time will tell.

🚰LakeFlow

Ingest was never the strong suit of Databricks, to ingest data from different source systems you often had to rely on tools like Azure Data Factory, Fivetran, and other specialized ETL systems.

Enter→LakeFlow the one tool for Ingest—>Transform—>Orchestrate.

LakeFlow is a combination of 3 features.

LakeFlow Connect - Ingestion connectors to various databases and enterprise applications using a simple point-and-click setup. 🟢Private preview
LakeFlow Pipelines - Declarative data flows similar to Azure data factory Mapping Data flows to create data transformations using plain SQL.🟠Coming soon
LakeFlow Jobs - Orchestration of Lakeflow pipelines based on the foundation of Workflows. 🟠Coming soon

📈Unity Catalog Metrics

Remember SQL Server Analysis Services, DAX in Power BI? Metrics is setting out to be the semantic data model within Databricks coming soon this year.

🐳Delta Lake 4.0

Delta Lake UniForm is GA - Universal Format is the vision to allow lakehouse format interoperability. Delta, Icerberg and Hudi have been at the center of the lakehouse format wars. UniForm allows to essentially write data as Delta and read as Iceberg or Hudi.
Open Variant datatype - 🟠 Public preview - In the era of generative AI, there is a lot more unstructured data and variant datatype allows to store and query effeciently as opposed to just dumping them as strings.
Simple example:
```
CREATE TABLE device_data (device_info VARIANT);

Insert into device_data
Select parse_json(
'{
"device":{
          "id":19,
          "color":"red"
        }
}'
) as device_info
```
Query variant data type using : to access top-level fields and .(dot) notation for nested key fields.
```
SELECT device_info:device.id FROM 
```
Type widening - 🟠 Public preview - New feature that allows for changing column datatype to a wider type without rewriting data files. For example change the datatype of an integer column to either long, decimal, or double without rewriting any of the data.

🔥 Apache Spark 4.0

Spark is moving at a relentless pace, 4.0 is a major release and packs a punch. Here is a short list of interesting new features coming soon to a cluster near you.

Urban Data Engineer's Newsletter

Discussion about this post