Apache Iceberg is a cornerstone of any open data lakehouse, providing the transactional foundation upon which highly scalable and flexible analytics can flourish. Along with Trino, it can be used to build a robust, scalable, and high-performance data lakehouse.
Over the past ten years, the emergence of Big Data has transformed how organisations store and process their data. The performance and reliability of traditional data warehouses lacks flexibility and cost-effectiveness. At the same time, data lakes have scale and affordability but are challenged with governance, schema enforcement, and performance limitations.
Enter the data lakehouse — a new data architecture that combines the scale-out store of data lakes with the transactional and governance features of data warehouses. By allowing SQL workloads natively on object storage with capabilities such as ACID transactions, schema evolution, and time-travel queries, the lakehouse offers a single platform for BI, data science, and real-time analytics.
What is a data lakehouse?
A data lakehouse integrates the best practices of data lakes and data warehouses, filling the gap between scalable, flexible storage and transactional, dependable analytics. It provides an integrated platform where organisations can oversee the entire lifecycle of the data — from ingestion and processing to analytics and machine learning.
Traditional data lakes are planned for raw, bulk storage but do not include the capabilities necessary for enterprise-level analytics, including schema enforcement, data versioning, and ACID guarantees. Data warehouses have these features but are expensive, inflexible, and usually associated with proprietary vendors.
The lakehouse resolves this trade-off by putting warehouse-like capability into open data structures (such as Parquet and ORC), while keeping the scalability and economics of object storage systems (such as S3, HDFS, or GCS).
Data lake vs data warehouse vs data lakehouse
| Feature | Data warehouse | Data lake | Data lakehouse |
| Storage format | Proprietary | Open (Parquet, ORC) | Open (Parquet, ORC) |
| Cost | High | Low | Moderate |
| Data types supported | Structured | Semi/Unstructured | All |
| Performance | High | Low | High |
| Governance | Strong | Weak | Strong |
| Schema enforcement | Rigid | Loose | Flexible |
| ACID compliance | Yes | No | Yes |
Apache Iceberg: Modern table format for the lakehouse
With the changing paradigm of data platforms, storing data files in a data lake is not enough anymore. Organisations need transactional consistency, schema evolution, time travel, and performance at scale with open file formats and cloud-native principles intact. This is where Apache Iceberg comes into play.
Originally developed by Netflix and now an Apache top-level project, Iceberg rewrites the rules for table management in big data lakes. Unlike Hive-style tables, which are based on directory hierarchies and static metadata, Iceberg provides a highly optimised metadata layer that supports correct operation on huge datasets.
Iceberg is built to address the problems experienced by data analysts and engineers — the lack of safe evolution of schemas, variability in query results between engines, and performance bottlenecks due to poor partitioning. With Iceberg, these issues are solved with a modular design that separates storage from metadata so that it allows for authentic multi-engine interoperability between Spark, Trino, Flink, and others.
Versioned snapshots in the table format enable time travel, schema evolution, and even partition evolution so that data organisations can change over time without having to rewrite the entire dataset. These features make your data lake a trusted, enterprise-quality analytical system without vendor lock-in.
Trino: Federated SQL engine for the modern lakehouse
In the era of distributed, multi-source data, organisations need a fast, scalable SQL engine that can unify access without centralising storage. Trino (formerly Presto SQL) is designed to meet this need, and it plays a central role in modern lakehouse architectures.
Here’s what makes Trino exceptional.
-
Federated query engine
- Trino supports federated queries across diverse source object stores, warehouses, RDBMS, NoSQL, and streams.
- It eliminates the need to move data; instead, it pushes down queries to wherever the data lives.
- Ideal for data mesh environments where different teams own different data domains.
-
Massively parallel and cloud native
- Trino is a distributed MPP (massively parallel processing) engine.
- It separates compute from storage, enabling elastic scalability.
- Executes queries using ANSI SQL, making it familiar to data analysts and engineers.
-
Broad connector ecosystem
- It supports a wide range of backends like Apache Iceberg, Hive, Delta Lake, Kafka, Elasticsearch, MongoDB, PostgreSQL, MySQL, and many others.
- Perfect for hybrid and multi-cloud architectures.
-
Optimised for Iceberg
- Offers native support for Apache Iceberg with full access to:
- Time travel and snapshots
- Hidden partitioning and schema evolution
- Metadata-based planning for fast query execution
- Trino + Iceberg equals a truly open lakehouse stack.
- Offers native support for Apache Iceberg with full access to:
-
Enterprise-ready and extensible
- Integrates easily with BI tools (e.g., Tableau, Superset, Power BI).
- Plugs into data catalogues (e.g., Glue, Hive, Nessie) for metadata management.
- Supports fine-grained access control and query observability.
Building a lakehouse with Apache Iceberg and Trino
Designing a modern data platform no longer means choosing between scale and structure. The lakehouse architecture, powered by Apache Iceberg and Trino, allows organisations to build an open, high-performance data system that supports everything from BI to AI without vendor lock-in.
At the core of this architecture is a simple but powerful idea: decouple storage, metadata, and compute. Each layer is built using open technologies that work together seamlessly.
Best practices and recommendations
To ensure your data lakehouse remains performant, scalable, and manageable, it’s important to follow architectural and operational best practices. Start by optimising data layout store files in open formats like Parquet with sizes between 100-500MB to balance performance and metadata overhead. Use Apache Iceberg’s hidden partitioning and schedule regular compaction jobs using Spark or Flink to mitigate the small file problem. Maintaining clean and consistent metadata is equally vital: enforce naming conventions, periodically prune unused snapshots, and consider using versioned catalogues like Project Nessie for better control over schema evolution and rollback.
On the performance and governance front, enable predicate pushdown and metadata pruning in Trino to accelerate SQL queries across Iceberg tables. For security, integrate access control solutions such as Apache Ranger or AWS Lake Formation to implement table- and column-level permissions. Data lineage and auditability can be enhanced using tools like Marquez or OpenLineage. Finally, monitor system health through Trino’s query metrics and Iceberg’s snapshot lifecycle, setting alerts for ingestion failures or metadata bloat. These practices ensure your lakehouse remains production-ready, cost-efficient, and aligned with enterprise-grade analytics and compliance needs.
The integration of Apache Iceberg and Trino allows for a robust and adaptable method of constructing new data lakehouse architecture. This architecture combines data lakes’ scalability and cost-effectiveness with data warehouses’ transactional integrity and query performance while open standards are embraced. By isolating storage, metadata and compute, and federating SQL access across heterogeneous sources, organisations can enable real-time analytics, BI, and machine learning from a single platform. As data continues to get denser and more complex, the Iceberg-Trino stack gives us a future-proof base to deliver governed, high-performance insights at scale.














































































