Share
Tweet
Share

Blog - Delta lake 2.0, Z-Order

Boost your performance with Delta Lake 2.0

What is Delta Lake?

Delta Lake was released in April 2019 by Databricks. Delta Lake is a layer on top of a data lake, which provides ACID capabilities, ensuring data quality, reliability and consistency.

During a process, transactions are stored in a log file; this helps you restore your data in case of a problem.

What are the new features of Delta Lake 2.0?

At the 2022 Summit, Databricks revealed several new features.

- Change Data Feed: allows you to follow the changes of rows between versions of a delta table.

- Deleting columns: you can delete a column with the "DROP COLUMN" command.

- Multi-cluster writes (S3): writing to a table from multiple clusters simultaneously does not corrupt the table; this feature already exists in Azure.

- Z-order: allows you to optimize the organization of files in a Delta table; you will see this feature in detail later in this article.

open-sourcing all of delta

Other improvements have been made to existing features, notably for data skipping using column statistics: data skipping increases performance during read operations. When writing files, the delta log will store the maximum and minimum value of each column in the file. This speeds up the read response by ignoring unnecessary files. This works very well if you have a small range of values in each file. And this is where the Z-order comes in!

Z-order

Databricks has introduced "Optimize Z-order". It collocates the column information in the same set of files. These files are sorted and grouped according to the specified column. When a search is run, only the files with the query values are analyzed.

Below, the serviceTbl table is not ordered. A query such as "Select * from serviceTbl = 'IoT hub'", can scan all files. The Z-order implementation will ask the table to optimize the scan for our query.

When the table is ordered, the search for data is faster.

the serviceTbl table

The behavior of Z-order is very similar to that of the SQL Server cluster index: multiple columns can be specified. You can choose the Z-order columns according to the query patterns. Z-order will then reduce the number of files to improve the performance of the query reading. Let's see this in practice.

We will use a delta sales table with 2 billion rows.

Let's execute the first query o

request 1

The result is obtained in 33 seconds. Depending on the query template, you can modify the file structure with Z-order.

The delta table is reconstructed based on the Z-order BY clause. You get 121 files instead of 342, which means fewer files to read. The row of values in each file is smaller. These structural changes have a considerable impact and improve the performance of our query.

The result of the query is displayed after 15 seconds. You save 18 seconds on the same query!

Why is it faster?

As we have seen, Delta Lake records the maximum and minimum values for each column in each file in the Delta log, which makes reading queries faster. Adding Z-order to the predicate ("where" clause) will group your data and make data skipping more efficient. You need to create Z-order on a column with high cardinality.

What you need to remember when implementing Z-order:

- On high cardinality columns
- Use Z-order on predicate columns
- Multiple columns can be specified
- If the query model changes, update Z-order.
- The Vacuum command is needed to clean the delta log from the previous file structure.

Conclusion

Delta Lake is continually improving by combining the best of traditional data warehouses with the best of data lakes. With Z-order, you've seen how quickly you can improve performance. Change Data Feed (CDF) allows Delta Lake to track changes at the row level, which improves performance.  

Let us know if you would like more information about CDF!    

"Databricks vs Snowflake" | Journal du Net

"Databricks vs Snowflake" | Journal du Net

Thomas Dallemagne answers all your questions about Databricks vs Snowflake in...
AWS Public Sector Summit 2024: A dive into innovation and sustainability

AWS Public Sector Summit 2024: A deep dive into...

Explore the latest innovations and commitments to sustainability at ...
Blog - Real-time analysis with Microsoft Fabric and Azure Event Hubs

Blog - Real-time analysis with...

Find out how to connect your Microsoft Fabric KQL database (Kusto Qu...
ACCELERATE WITH US
Are you passionate about data?

Contact us