Using Apache Iceberg for Developing Modern Data Tables

October 3, 2023

407

Apache Iceberg, a data management powerhouse, empowers businesses with its remarkable flexibility, seamless tool compatibility, and endorsement by the major tech players in the industry.

The cloud has made it possible to collect vast quantities of data and store it at a reasonable cost. It is also responsible today for the emergence of data lakes, data mesh and other modern data architectures. However, handling large amounts of data using a generic cloud offers its own challenges and limitations.

For instance, query engines often cannot react with typical blob storage systems smoothly. To address this issue, applying a table format to data becomes extremely useful.

Factor	Apache Iceberg	Apache Hudi	Delta Lake
ACID transaction	Present	Present	Present
Partition evolution	Present	Absent	Absent
Schema evolution	Present	Partial	Limited
Time-travel	Present	Present	Present
Project governance	Apache projects with a diverse PMC (top level project)	Apache projects with a diverse PMC (top level project)	Linux foundation projects with all data bricks TSC
Tool read compatibility	Apache Hive, Demio Sonar, Apache Flink, Apache Spark, Presto, Trino, Athena, Snowflake, Databricks Spark, Apache Impact, Apache Drill	Apache Hive, Demio Sonar, Apache Flink, Apache Spark, Presto, Trino, Athena, Databricks Spark, Apache Impact, Red Shift, Big Query	Apache Hive, Demio Sonar, Apache Flink, Databricks SQL, Analytics, Presto, Trino, Athena, Snowflake, Red Shift, Apache Bea
Tool write compatibility	Apache Hive, Demio Sonar, Apache Flink, Apache Spark, Presto, Trino, Athena, Snowflake, Databricks Spark, Debezium	Apache Flink, Apache Spark, Databricks Spark, Debezium, Kafka Connect	OSS Delta Lake, Apache Spark, Trino, Apache Flink, Databricks Spark, Kafka Connect, Debezium, Distributed Delta Lake

Table 1: A comparison of Apache Iceberg, Apache Hudi and Delta Lake (Table credit: dremio.com)

Table format data was originally developed by Netflix and was open sourced as an Apache incubator project in 2018. It graduated from this in 2020 and became Apache Iceberg. Apache Iceberg can meet certain challenges faced by Apache Hive.

Process engines used by a company may change from time to time. For example, many business firms have moved from Hadoop to Spark or Trino. Iceberg provides greater flexibility and choice for processing engines.

Additionally, data lakes have undesirable qualities due to table formats obtained from older technologies. Apache Iceberg is built to address such shortcomings in Apache Hive. Moreover, Iceberg also supports multiple file formats, including Apache Parquet, Apache Avro and Apache ORC. Hence it provides more flexibility and long-term adaptability for file formats that may emerge in the future.

Reasons to choose Apache Iceberg

Apache Iceberg is a new open table format developed by the open source community. The modern data industry needs many metadata tables such as data files tables, all manifest tables, all entries tables, history tables, partition tables, and snapshot tables, etc. Iceberg provides an effective solution for the creation of such tables.

Table 1 compares several features of three major data lake table format creators: Apache Iceberg, Apache Hudi and Delta Lake.

One of the important features of Iceberg is partitioning, a technique used for optimising data lakes. For example, for sales data where queries often rely on when the sales occurred, you may partition data by the day, month, or year of the sale.

Data lakes aim to democratise data. However, it is necessary to reduce the complexity of data structures and physical data storage. In 2009, Facebook released a Hive table format that could not answer this problem. In contrast, Iceberg, however, gives table format specifications with a set of APIs and libraries to interact with tables.

Unless Apache Hive adds it, removing and updating data is very fast and easy in Iceberg. It supports SQL commands. So, it can be used to work with Hive, Spark, Impala, and so on. Such table formats are becoming the format of choice for large data sets (of sizes ranging from hundreds of terabytes to hundreds of petabytes) across the data industry.
Over the years, since Netflix donated Iceberg to Apache, it has been adopted by several major tech companies, including Adobe, Airbnb, Apple, Google, LinkedIn, Snapshot and Snowflake.

Apache Iceberg and AI

As more organisations are navigating the uncharted sea called ‘Enterprise AI’, data management expert Cloudera is now adopting Apache Iceberg.

With the open data lakes created by Apache Iceberg, customers will get strong analytics and AI capabilities for their enterprise data, irrespective of whether the data is cloud based or on-premise. It is becoming a key technology for multi-vendor data ecosystems to get the most from AI. Cloudera recently claimed that its Iceberg-based solutions are managing 25 million terabytes of data.

With the increased support of Apache Iceberg, generative AI and large language models (LLMs), organisations can utilise data effectively. This further enables users to tap into more of their data and leverage it in different ways.