-
Notifications
You must be signed in to change notification settings - Fork 286
Open
Description
To recreate:
import gcsfs
from petastorm.gcsfs_helpers.gcsfs_wrapper import GCSFSWrapper
path = "gs://your/bucket/path"
fs = GCSFSWrapper(gcsfs.GCSFileSystem())
_, directories, files = next(fs.walk(path))
print(files)
# returns ['', 'file1', 'file2']This becomes a problem in petastorm.utils.add_to_dataset_metadata where we have the following line:
arrow_metadata = compat_get_metadata(dataset.pieces[0], dataset.fs.open)The empty string ends up as pieces[0] and pyarrow ultimately throws the following error since this is not a valid filename:
Traceback (most recent call last):
File "build_petastorm_dataset.py", line 103, in <module>
run(args)
File "build_petastorm_dataset.py", line 79, in run
.parquet(args.output_url)
File "/opt/conda/default/lib/python3.6/contextlib.py", line 88, in __exit__
next(self.gen)
File "/opt/conda/default/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py", line 113, in materialize_dataset
_generate_unischema_metadata(dataset, schema)
File "/opt/conda/default/lib/python3.6/site-packages/petastorm/etl/dataset_metadata.py", line 206, in _generate_unischema_metadata
utils.add_to_dataset_metadata(dataset, UNISCHEMA_KEY, serialized_schema)
File "/opt/conda/default/lib/python3.6/site-packages/petastorm/utils.py", line 115, in add_to_dataset_metadata
arrow_metadata = compat_get_metadata(dataset.pieces[0], dataset.fs.open)
File "/opt/conda/default/lib/python3.6/site-packages/petastorm/compat.py", line 31, in compat_get_metadata
arrow_metadata = piece.get_metadata()
File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 676, in get_metadata
f = self.open()
File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 683, in open
reader = self.open_file_func(self.path)
File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 1054, in _open_dataset_file
buffer_size=dataset.buffer_size
File "/opt/conda/default/lib/python3.6/site-packages/pyarrow/parquet.py", line 210, in __init__
read_dictionary=read_dictionary, metadata=metadata)
File "pyarrow/_parquet.pyx", line 1023, in pyarrow._parquet.ParquetReader.open
File "pyarrow/error.pxi", line 85, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Parquet file size is 0 bytes
Metadata
Metadata
Assignees
Labels
No labels