Skip to content

walk method in GCSFSWrapper returns empty string as one of filenames #558

@alekswithakayy

Description

@alekswithakayy

To recreate:

import gcsfs
from petastorm.gcsfs_helpers.gcsfs_wrapper import GCSFSWrapper
path = "gs://your/bucket/path"
fs = GCSFSWrapper(gcsfs.GCSFileSystem())
_, directories, files = next(fs.walk(path))
print(files)
# returns ['', 'file1', 'file2']

This becomes a problem in petastorm.utils.add_to_dataset_metadata where we have the following line:

arrow_metadata = compat_get_metadata(dataset.pieces[0], dataset.fs.open)

The empty string ends up as pieces[0] and pyarrow ultimately throws the following error since this is not a valid filename:

Traceback (most recent call last):                                              
  File "build_petastorm_dataset.py", line 103, in <module>
    run(args)
  File "build_petastorm_dataset.py", line 79, in run
    .parquet(args.output_url)
  File "/opt/conda/default/lib/python3.6/contextlib.py", line 88, in __exit__
    next(self.gen)
  File "/opt/conda/default/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py", line 113, in materialize_dataset
    _generate_unischema_metadata(dataset, schema)
  File "/opt/conda/default/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py", line 206, in _generate_unischema_metadata
    utils.add_to_dataset_metadata(dataset, UNISCHEMA_KEY, serialized_schema)
  File "/opt/conda/default/lib/python3.6/site-packages/petastorm/utils.py", line 115, in add_to_dataset_metadata
    arrow_metadata = compat_get_metadata(dataset.pieces[0], dataset.fs.open)
  File "/opt/conda/default/lib/python3.6/site-packages/petastorm/compat.py", line 31, in compat_get_metadata
    arrow_metadata = piece.get_metadata()
  File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 676, in get_metadata
    f = self.open()
  File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 683, in open
    reader = self.open_file_func(self.path)
  File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 1054, in _open_dataset_file
    buffer_size=dataset.buffer_size
  File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, in __init__
    read_dictionary=read_dictionary, metadata=metadata)
  File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open
  File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet file size is 0 bytes

@megaserg @selitvin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions