Share
Tweet
Share

Blog - BigLake: Unifying data lakes and warehouses across clouds

BigLake: why and how to use it?

Organizations are seeing their data grow at an accelerated rate. And as teams build solutions with the right tools, data often ends up in different locations, formats and even cloud platforms. This increasingly distributed data leads to silos, and data silos present their own risks and problems. Yet this data has great analytical value that can address new and increasingly demanding customer use cases.

To evolve in a growing digital world, companies need to break down data silos and enable new analytic use cases, regardless of where the data is stored or its format.

To bridge this gap between data and value, Google has introduced a new feature: BigLake. BigLake builds on years of innovation from BigQuery, it's a storage engine that unifies data lakes and data warehouses, while providing high granularity access control, performance acceleration on multi-cloud storage and support for open file formats.

Architecture - What is BigLake?

biglake architecture

By creating BigLake tables in BigQuery or using the BigLake connector on open-source engines, such as Apache Spark, you can extend your access to data in Amazon S3, Azure Data Lake Storageand of course Google Cloud Storage.

Data is accessible via supported open data formats Avro, CSV, JSON, ORC and Parquet. And since Google Cloud Next 2022, also : Apache Iceberg, Delta Lake and Apache Hudi.

BigLake extends BigQuery's high granularity security access control to the table, row or column level. With BigQuery Omni, this security policy is governed consistently across other cloud platforms. It enables interoperability between data warehouses and data lakes by managing a single copy of data.

Because BigLake tables are a type of BigQuery table, they can also be discovered centrally in the data catalog and managed at scale using Dataplex BigLake: unify governance and management at scale.

Demo - How to use BigLake?

In this introductory demo, we will explore the results of an international triathlon competition. The file contains the following information about the participants:

  • ID
  • Last Name
  • Email
  • Country
  • Duration in minutes


The file is stored in Google Cloud Storage. Let's see who has the fastest time while maintaining the privacy of the participants.

1. First, create a connection

The first thing we need to do is create an external connection to our data. The BigLake connection to Google Cloud Storage can be created on the service page BigQuery.

At the top left, click on "+ ADD DATA". A new window will appear on the right, click on "Connections to external data sources", and fill in the following fields:

  1. Connection type: choose "BigLake and remote functions".
  2. Connection ID: "gcs_triathlon_demo". 3.
  3. Data location: "europe-west1 (Belgium)
  4. Name and description of the friend: optional, but useful.
  5. Click on "CREATE A CONNECTION".
external data source

When the connection is created, we can see it in the left pane, under "External connections" :

connection

Let's copy the "Service account ID", as we will need to grant read permission to this service account on the object storage.

2. Give the service account read access

Following the principle of least privilege, do not grant global permissions to service accounts that do not need them. In our case, we'll grant object viewing permission at the bucket level.

Navigate to the page Google Cloud Storage and grant read permissions to the bucket containing our data file.

Our file is stored in the bucket called "biglake_demo_triathlon_results". We will open the bucket, and navigate to the "PERMISSIONS" tab. Here, we'll grant "Storage Object Viewer" to the connections service account " bqcx-dummy-34l3@gcp-sa-bigquery-condel.iam.gserviceaccount.com " 

biglake_demo_triathlon_results

3. Create the BigLake table

The next step is to create our BigLake table in a dataset called "triathlon_demo", it is located in the "europe-west1" region, the same region where our data is located.

Instead of creating the BigLake table via the user interface, we will change things and create it via SQL, using our BigLake connection:

CREATE EXTERNAL TABLE `YOUR-PROJECT.triathlon_demo.gcs_triathlon_results` (

               id STRING,

               name STRING,

               email STRING,

               STRING country,

               duration INT64

)

- Our BigLake to GCS connector

WITH CONNECTION `Your-Project.europe-west1.gcs_triathlon_demo`

OPTIONS (

               FORMAT = 'CSV',

               uris = ['gs://biglake_demo_triathlon_results/*.csv']]

)

BigQuery supports loading multiple source files into a single table. To do this, use the wildcard ("\*").

Running the above query creates the BigLake table:

loading of several source files

4. Protecting participants' personal information

With Dataplex we can seamlessly integrate unified governance and management across our BigTable. In this demo, we will see how we can restrict access at the column level using taxonomies.

To create a tag policy, a taxonomy must first be defined, which is done via the Dataplex Taxonomy page.

Click on "CREATE A NEW TAXONOMY" and we will fill in the form:

new taxonomy

Once created, let's apply it to the columns we want to restrict that contain sensitive information. Go back to BigQuery and select our BigLake table. Click "EDIT SCHEMA", select the "name" and "email" columns, and click "ADD POLICY TAG". Select the policy and save the changes.

protected accounts

Now the columns are protected. Let's test it!

5. Result: Who has the best time?

When everything is configured, and the tag policy has been applied, let's find out who was the fastest.

configured

When we execute the request, we are greeted with a message telling us that we do not have permission to display the names or emails of the participants. So our policy works!

By excluding these fields from the query, we can finally see the result:

result

The unknown participant ID'ed "51b93cd6" from Ivory Coast has the shortest race time of 115 minutes.

6. Conclusion

BigLake closes the gap between value and data by eliminating data silos on cloud platforms, while supporting high granularity access control and multiple governance of distributed data. It's a tool that integrates well into a multi-cloud organization.

Key learnings from BigLake

- Manage a single copy of data with a uniform set of features across data lakes and data warehouses.

- Highly granular access control on the multi-cloud platform.

- Integration with open-source software and support for open data formats

- Benefits of BigQuery

Want to know more?

Let's talk about it!

"Databricks vs Snowflake" | Journal du Net

"Databricks vs Snowflake" | Journal du Net

Thomas Dallemagne answers all your questions about Databricks vs Snowflake in...
AWS Public Sector Summit 2024: A dive into innovation and sustainability

AWS Public Sector Summit 2024: A deep dive into...

Explore the latest innovations and commitments to sustainability at ...
Blog - Real-time analysis with Microsoft Fabric and Azure Event Hubs

Blog - Real-time analysis with...

Find out how to connect your Microsoft Fabric KQL database (Kusto Qu...
ACCELERATE WITH US
Are you passionate about data?

Contact us