12/5/2023 0 Comments Apache iceberg spark![]() ![]() For GCP customers who store their data on BigQuery Storage and Google Cloud Storage, BigLake now further unifies data lake and warehouse workloads. This set of capabilities enables customers to store a single copy of data on object stores using Iceberg and run BigQuery as well as Dataproc workloads on it in a secure, governed, and performant manner, eliminating the need to duplicate data or write custom infrastructure. BigQuery utilizes Iceberg’s metadata for query execution, providing a performant query experience to end users. Administrators can further secure Iceberg tables using fine-grained access policies, such as row, column level access control, or data masking, extending the existing BigLake governance framework to Iceberg tables. The end user access is delegated through BigLake, simplifying access management and governance. Regardless of your choice of Spark, BigLake automatically makes those Iceberg tables available for end users to query.Īdministrators can now use Iceberg tables, similar to BigLake tables, and don’t need to provide end users access to the underlying GCS bucket. Customers can run Spark using Dataproc (managed clusters or serverless), or use built-in support for Apache Spark in BigQuery (stored procedures) to process Iceberg tables hosted on Google Cloud Storage. SELECT COL1, COL2 FROM bigquery_table LIMIT 10 Īpache Spark already has rich support for Iceberg, allowing customers to use Iceberg’s core capabilities, such as DML, transactions, and schema evolution, to carry out large-scale transformation and data processing. Build a secure and governed Iceberg data lake with BigLake’s fine-grained security modelīigLake enables multi-compute architecture: Iceberg tables created in supported open source analytics engines can be read using BigQuery. Manager of Data and Insights at Snap Inc. Our users now have the ability to realize most BigQuery benefits on GCS data as if this was stored natively.” - Bo Chen, Sr. BigLake integration makes this even easier by making this data available to our large BigQuery user base and leverage its powerful UI. Our Datalake leveraged Iceberg to tap into this data in an efficient and scalable way on top of incredibly large datasets. “Besides BigQuery, a large segment of our data is stored on GCS. Today, we are excited to announce that this support now extends to the Apache Iceberg format, enabling customers to take advantage of Iceberg’s capabilities to build an open format data lake while benefiting from native GCP integration using BigLake. BigLake unifies data warehouses and lakes by enabling BigQuery and open source frameworks like Spark to access data with fine-grained access control. Backed by a growing community of contributors, Apache Iceberg is becoming the de facto open standard for data lakes, bringing interoperability across clouds for hybrid analytical workloads and systems to exchange data.Įarlier this year, we announced BigLake, a storage engine that enables customers to store data in open file formats (such as Parquet) on Google Cloud Storage and run GCP and open source query engines on it in a secure, governed, and performant manner. Iceberg’s open specification allows customers to run multiple query engines on a single copy of data stored in an object store. It provides many features found in enterprise data warehouses, such as transactional DML, time travel, schema evolution, and advanced metadata that unlocks performance optimization. Apache Iceberg is a popular open source table format for customers looking to build data lakes.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |