Data quality native support for Iceberg format (preview)

Microsoft Purview supports data quality assessment on Apache Iceberg data assets. You can curate, govern, and scan Iceberg data on these storage sources:

Azure Data Lake Storage Gen2
Microsoft Fabric Lakehouse
Amazon Web Services (AWS) S3
Google Cloud Platform (GCP) Cloud Storage (GCS)

This article explains the Iceberg file structure, shows how to set up data quality scans, and lists tips for each catalog and storage type.

Iceberg file structure

An Iceberg table is more than just a collection of data files. It includes various metadata files that track the state of the table and facilitate operations like reads, writes, and schema evolution. An Iceberg table includes a catalog, a metadata layer, and a data layer. The data files in an Iceberg table are typically stored in columnar formats like Apache Parquet, Apache Avro, or Apache Optimized Row Columnar (ORC). These files contain the actual data that users interact with during queries.

Catalog layer overview

The Iceberg catalog sits at the top of the hierarchy. It stores the current metadata pointer for each table. The catalog enables tracking the most recent state of a table by referencing the current metadata file.

Metadata layer overview

The metadata layer is central to Iceberg's functionality. It includes these key elements:

Metadata file: Contains information about the table's schema, partitioning, and snapshots. In the diagram, s0 refers to a snapshot, a record of the table's state at a given point in time. If multiple snapshots exist (such as s0 and s1), the metadata file tracks both.
Manifest list: Points to one or more manifest files. A manifest list acts as a container of references to these manifests. It helps Iceberg manage which data files should be read or written during different operations. Each snapshot might have its own manifest list.

Data layer overview

In the data layer, the manifest files act as an intermediary between the metadata and the actual data files. Each manifest file points to a collection of data files, providing a map of the physical files stored in the data lake.

Manifest files: Store the metadata for a group of data files, including row counts, partition information, and file paths. They allow Iceberg to prune and access specific files quickly.
Data files: Contain the actual data, stored in formats like Parquet, ORC, or Avro. Iceberg organizes data files based on partitions, which minimizes unnecessary data scans during query execution.

How Iceberg components work together

When you query or update data, Iceberg looks up the table's metadata file via the catalog. The metadata file references the current snapshot (or multiple snapshots). Each snapshot points to a manifest list, which references manifest files. The manifest files list the data files. This layered lookup lets Iceberg manage large datasets while keeping reads and writes consistent. Readers and writers always see a coherent table state. The snapshot-based design also enables time travel (querying earlier table states) and schema evolution.

The same layered approach boosts performance for batch and streaming operations. Only the needed data files are read, and updates use snapshots without changing the full dataset.

Iceberg data in OneLake

Important

Iceberg data in AWS S3 and GCS also needs to be auto synced as Delta to curate, govern, and to measure and monitor data quality.

You can seamlessly consume Iceberg-formatted data across Microsoft Fabric with no data movement or duplication. Use OneLake shortcuts to directly point to a data layer.

Iceberg data is stored in OneLake and written using Snowflake or another Iceberg writer. OneLake virtualizes the table as a Delta Lake table, which ensures broad compatibility across Fabric engines. For example, you can create a volume in Snowflake and point it directly to the Fabric Lakehouse. Once the Iceberg table is created in OneLake, autosync reflects data updates in real time. Get further details in Configure an Iceberg external volume for Azure in the Snowflake documentation.

Data quality for Iceberg data

For all users natively hydrating data in Iceberg on Parquet, ORC, or Avro on Data Lake Storage Gen2 or Fabric Lakehouse, configure a scan pointing to the ___location of the directory hosting the data and metadata Iceberg directories. To configure Iceberg data quality support, complete these steps:

Configure and run a scan in Microsoft Purview Data Map.
Configure the directory hosting data and metadata as a data asset and associate it to a data product. Associating the directory as a data asset and linking it to a data product forms the Iceberg dataset. Learn to associate data assets to a data product.
In Unified Catalog, under Health management select Data quality view to find your Iceberg files (data asset) and to set up data source connection.
1. To set up a Data Lake Storage Gen2 connection, follow the steps in Set up data source connection for data quality.
2. To set up a Fabric OneLake connection, follow the steps in Set up data quality for Fabric Lakehouse data.
On the Schema page of the selected Iceberg file (data asset), select Import schema to import the schema from the data source.
On the Iceberg file's Overview page, at the Data asset dropdown menu, select Iceberg.
Apply data quality rules, and run data quality scans for column- and table-level data quality scoring.

Profiling and data quality scanning

Important

Before you run data profiling or data quality scans, you need to retrieve and set the schema from the Data Quality Schema page. Consumers don't see the schema in the data asset view because Data Map doesn't yet support the Iceberg open table format. Data Quality stewards can import the schema from the Data Quality schema page.

After you finish the connection setup and data asset file format selection, you can profile your data, create and apply rules, and run a data quality scan of your data in Iceberg open format files. For step-by-step guidance, see the following articles:

Important

Support for the Iceberg open format in catalog discovery, curation, data profiling, and data quality scanning features is in preview.

Current limitations for Iceberg data quality

The current preview release of Microsoft Purview supports data created in Iceberg format with Apache Hadoop catalog (a file-based catalog implementation) only. Snowflake catalog scenarios are supported only through Delta virtualization using OneLake shortcuts.

Lakehouse Path and Data Lake Storage Gen2 Path

Iceberg Metadata stores the complete path for the data and metadata. Ensure you use the complete path for Data Lake Storage Gen2 and Fabric Lakehouse. Additionally for Fabric Lakehouse path during the write, ensure operating (WRITES, UPSERTS) with the ID paths. The following example shows the required ABFSS path format. Replace <filesystem-ID> and <lakehouse-ID> with your actual GUIDs:
```
abfss://<filesystem-ID>@onelake.dfs.fabric.microsoft.com/<lakehouse-ID>/Files/CustomerData
```
Filesystem as ID and Lakehouse as ID. Absolute and not relative paths are necessary for Microsoft Purview to perform data quality on Iceberg. To validate, ensure you check the snapshots path to point as complete Fully Qualified Name (FQN) paths.

Schema detection

Data Map can't detect the Iceberg schema. When curating the Iceberg directories on Fabric Lakehouse or Data Lake Storage Gen2, you can't review the schema. However, the data quality fetch schema can pull up the schema for the curated asset.

Recommendations for Iceberg data quality configuration

Choose the approach that matches your catalog and storage:

Source/Catalog	Storage	Approach	Supported formats
Snowflake Catalog	Data Lake Storage Gen2, AWS S3, or GCP GCS (VOLUME)	Use a Fabric OneLake Table shortcut. Run data quality as a Delta table.	Parquet only
Hadoop catalog	Data Lake Storage Gen2	Scan the directory directly. Use the Iceberg engine for data quality.	Parquet, ORC, Avro
Snowflake	Fabric Lakehouse (VOLUME pointed to Lakehouse)	Use OneLake Table to create a Delta-compatible version.	Parquet only
Hadoop catalog	Fabric Lakehouse	Scan the Lakehouse directory directly. Use the Iceberg engine for data quality.	Parquet, ORC, Avro

These articles cover data quality setup and scanning:

Feedback

Was this page helpful?

Last updated on 2026-06-12

Data quality native support for Iceberg format (preview)

Iceberg file structure

Catalog layer overview

Metadata layer overview

Data layer overview

How Iceberg components work together

Iceberg data in OneLake

Data quality for Iceberg data

Profiling and data quality scanning

Current limitations for Iceberg data quality

Recommendations for Iceberg data quality configuration

Related resources

Feedback

Additional resources