Skip to content

Table.add_files fails for Parquet files with DecimalType columns stored as FIXED_LEN_BYTE_ARRAY when precision allows INT32/INT64 #2057

Open
@CaptainEureka

Description

@CaptainEureka

Apache Iceberg version

0.9.1 (latest release)

Please describe the bug 🐞

When attempting to add Parquet files to an Iceberg table using Table.add_files, the operation fails if a column defined as DecimalType in the Iceberg schema is physically stored as FIXED_LEN_BYTE_ARRAY in the Parquet file, even if the decimal's precision would typically map to INT32 or INT64 according to Iceberg's preferred Parquet mapping.

I see in the Iceberg Spec that on-write the mapping is correct. However, the current behaviour seems to overly restrict the physical Parquet type for decimals during the file addition process. I believe this greatly limits the kinds of parquet files that can be "added" to an Iceberg table this way.

Steps to Reproduce:

  1. Define an Iceberg table schema with a DecimalType column, for example, Decimal(10, 2).
    • Iceberg's preferred Parquet physical type for Decimal(10, 2) would be INT64.
  2. Create a Parquet file where the corresponding column for this Decimal(10, 2) is physically stored as FIXED_LEN_BYTE_ARRAY. The data itself is valid for Decimal(10, 2).
  3. Attempt to add this Parquet file to the Iceberg table using Table.add_files.

Behavior:

The Table.add_files operation fails, with the following error:

ValueError: Unexpected physical type FIXED_LEN_BYTE_ARRAY for DecimalType(10, 2) expected INT32

indicating a mismatch between the expected physical type (e.g., INT64) and the actual physical type (FIXED_LEN_BYTE_ARRAY) found in the Parquet file for the decimal column.

Expected Behavior:

The Table.add_files operation should succeed and correctly read the decimal values from the FIXED_LEN_BYTE_ARRAY physical storage. The Iceberg reader/writer should be lenient with the physical storage format of decimals OR otherwise Table.add_files should document these limitations.

Environment:

  • Python version: 3.12.9
  • Parquet library and version: pyarrow 20.0.0

P.S. If this is just user error and I shouldn't be trying to do things this way I'd be happy to hear alternatives.

Willingness to contribute

  • I can contribute a fix for this bug independently
  • I would be willing to contribute a fix for this bug with guidance from the Iceberg community
  • I cannot contribute a fix for this bug at this time

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions