반응형
Notice
Recent Posts
Recent Comments
Link
관리 메뉴

지구정복

[Iceberg] Iceberg Guide Book Summary | CHAPTER 5. Iceberg Catalogs 본문

데이터 엔지니어링 정복/Iceberg

[Iceberg] Iceberg Guide Book Summary | CHAPTER 5. Iceberg Catalogs

noohhee 2025. 3. 9. 21:10
728x90
반응형


CHAPTER 5 Iceberg Catalogs

 

Requirements of an Iceberg Catalog

Iceberg provides a catalog interface that requires the implementation of a set of functions, primarily ones to list existing tables, create tables, drop tables, check whether a table exists, and rename tables.

 

Hive Metastore, AWS Glue, and a filesystem catalog (Hadoop).

 

with a filesystem as the catalog, there’s a file called version-hint.text in the table’s metadata folder that contains the version number of the current metadata file.

With the Hive Metastore as the catalog, the table entry in the Hive Metastore has a table property called location that stores the location of the current metadata file.

 

The primary requirement for an Iceberg catalog to be used in production is that it must support atomic operations for updating the current metadata pointer.

 

 

Catalog Comparison

The Hadoop Catalog

the Hadoop Distributed File System (HDFS),

Amazon Simple Storage Service (Amazon S3), 

Azure Data Lake Storage (ADLS),

Google Cloud Storage (GCS).

 

Generally, however, engines will write a file called version-hint.txt in the table’s metadata folder containing a single number that the engine/tool then uses to retrieve the indicated metadata file by name—for example, v{n}.metadata.json.

 

Pros and cons of the Hadoop catalog

 

The main pro of the Hadoop catalog is that it doesn’t require any external systems to run.

All it requires is a filesystem.

However, there is a big downside to the Hadoop catalog: it is not recommended for production usage. 

There are a few reasons for this.

 

One reason is that it requires the filesystem to provide a file/object rename operation that is atomic to prevent data loss when concurrent writes occur.

 

For example, ADLS and HDFS provide an atomic rename operation so that you won’t have data loss if concurrent writes are made to the same table, but S3 does not. 

For S3, you can leverage a DynamoDB table to achieve the atomicity needed to prevent data loss during concurrent writes.

 

A second reason is that when a system is configured to use the Hadoop catalog for a source, it can only use one warehouse directory, as it depends on the warehouse location for listing tables.

For example, you can only use a single bucket if you’re using cloud object storage such as S3 or ADLS.

 

A third reason is that when you’re doing things that require listing the namespaces (aka databases) and/or tables, you may hit performance issues, especially when you have a large number of namespaces and tables.

This is because the listing of namespaces and tables is performed by doing a list operation on the filesystem.

 

A final reason is that there is no ability to drop the table from the catalog or remove the reference to it in the catalog and keep the data.

Dropping the table without removing the data allows you to undo the operation if needed (similar to time travel).

 

 

Configuring Spark to use the Hadoop catalog

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.0\
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.my_catalog1=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog1.type=hadoop \
--conf spark.sql.catalog.my_catalog1.warehouse=<protocol>://<path>

 

 

The Hive Catalog

The Hive catalog is a popular implementation used for an Iceberg catalog.

It maps a table’s path to its current metadata file by using the location table property in the table’s entry in the Hive Metastore.

 

Pros and cons of the Hive catalog

The primary pro of the Hive catalog is that it is compatible with a wide variety of engines and tools.

 

There are two cons to the Hive catalog. 

The first is that it requires running an additional service yourself.

the Hive Metastore used for the Hive catalog needs to be set up and managed by the user.

 

The second con is that it doesn’t provide support for multitable transactions.

Multitable transactions are a key capability in databases, providing support for consistency and atomicity for one or
more operations that involve more than one table.

Keep in mind, though, that if you need multitable transactions, the Hive catalog doesn’t support that.

 

Configuring Spark to use the Hive catalog

spark-sql --packages org.apache.iceberg:iceberg-spark-runtime-3.3_2.12:0.14.0\
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.my_catalog1=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog1.type=hive \
--conf spark.sql.catalog.my_catalog1.uri=thrift://<metastore-host>:<port>

 

 

The AWS Glue Catalog

It maps a table’s path to its current metadata file as a table property called metadata_location in the table’s entry in Glue.

The value for this property is the absolute path to the metadata file in the filesystem.

 

Pros and cons of the AWS Glue catalog

One big pro is that AWS Glue is a managed service, so it reduces operational overhead compared to managing your own metastore, as you would in Hive.

Another advantage is that because it’s a native AWS service, it has tight integration with other AWS services.

 

Like the Hive catalog, it does not support multitable transactions.

 

The AWS Glue catalog is a good choice if you’re heavily invested in AWS services, don’t need a multicloud solution, and/or need a managed solution for a catalog.

 

 

Configuring Spark to use the AWS Glue catalog

spark-sql --packages "org.apache.iceberg:iceberg-sparkruntime-x.x_x.xx:x.x.x,software.amazon.awssdk:bundle:x.xx.xxx,software.amazon.awssdk:url-connection-client:x.xx.xxx" \
--conf spark.sql.catalog.my_catalog1=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.my_catalog1.warehouse=s3://<path> \
--conf spark.sql.catalog.my_catalog1.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.my_catalog1.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY \
--conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY

 

 

Catalog Migration

One nice thing about the vast majority of an Iceberg table residing in data lake storage is that it makes migrating from one catalog instance to another or one catalog type to another a very lightweight operation—you’re just changing where the mapping of the table path to the current metadata file is.

 

If you’re changing the location of your environment. An example of this is if you’re on premises and using the Hive Metastore from your old Hadoop deployment, and you’re migrating to AWS and want to use AWS Glue because it’s a hosted offering.

 

 

Using an Engine

The standard way to migrate catalogs is to use an engine such as Apache Spark.

For example, if you wanted to have a Hadoop catalog as your source catalog and an AWS Glue catalog as your target catalog, you could configure Spark SQL like this:

spark-sql --packages "org.apache.iceberg:iceberg-sparkruntime-x.x_x.xx:x.x.x,software.amazon.awssdk:bundle:x.xx.xxx,software.amazon.awssdk:url-connection-client:x.xx.xxx" \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.source_catalog1=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.source_catalog1.type=hadoop \
--conf spark.sql.catalog.source_catalog1.warehouse=<protocol>://<path> \
--conf spark.sql.catalog.target_catalog1=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.target_catalog1.warehouse=s3://<path> \
--conf spark.sql.catalog.target_catalog1.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.target_catalog1.io-impl=org.apache.iceberg.
aws.s3.S3FileIO \
--conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY \
--conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY

 

Note that executing this command will just launch a Spark SQL shell that has the two catalogs configured (source_catalog1 and target_catalog1), which allows  commands to use these catalogs, rather than doing any immediate migration of tables.

There are two main procedures to consider when migrating tables between catalogs.

We’ll go through these procedures next.

 

-register_table()

The register_table() Spark SQL procedure creates a lightweight copy of the source table using the source table’s datafiles.

Any changes made to the target catalog’s table will be done in the source catalog table’s directories, but as long as the changes don’t physically delete any of the datafiles (e.g., expire_snapshot()), the changes made to the tables registered in the target catalog won’t be seen by the source catalog’s table. That said, making changes to the target catalog’s tables in this situation is not recommended.

 

 

This method can also be useful if you want to migrate catalogs, but you want to keep the table’s file location on data lake storage the same before and after migration.

 

Table 5-1 details the arguments needed to run the register_table() procedure in Spark SQL.

 

Table 5-2 details the output fields returned when the procedure is executed.

 

Following is an example usage of the register_table procedure, based on the Spark SQL shell configuration in the previous section:

CALL target_catalog.system.register_table(
'target_catalog.db1.table1', '/path/to/source_catalog_warehouse/db1/table1/metadata/
xxx.json'
)

 

-snapshot()

the snapshot() Spark SQL procedure creates a lightweight copy of the source table, using the source table’s datafiles. 

 

However, unlike register_table(), any changes made to the target catalog’s table will be done in the target table’s table location, meaning any changes made to the target table won’t interfere with the source table.

That said, any changes made to the source table won’t be visible to the users of the target table, and vice versa.

 

This method can be useful for testing migration where changes are required to be made to the target table for validation purposes, but you don’t want anything using the source table to see these changes.

 

Another consequence of the target catalog’s table not owning the datafiles is that it is not allowed to run expire_snapshots() on the target table, since that would entail physically deleting datafiles owned by the source catalog’s table.

 

Table 5-3 details the arguments needed to run the snapshot() procedure in Spark SQL.

Table 5-4 details the output fields returned when the procedure is executed.

 

Following is an example usage of the snapshot () procedure, based on the Spark SQL shell configuration in the preceding section:

CALL target_catalog.system.snapshot(
'source_catalog.db1.table1',
'target_catalog.db1.table1'
)

 

 

 

728x90
반응형
Comments