feat: optimize pmtiles generation even more #1408
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Closes: Too many features in routes.pmtiles for datasets without shapes #1391
Closes: Optimize pmtiles generation even more #1383
Refactored the PMTiles creation code so that each GTFS file is processed by a single class (sort of, the exceptions being RoutesProcessor and RoutesProcessorForColors that both process routes.txt). It makes it easier to know when to download the file and delete it after use.
Also corrected the problem with #1391 where each trip resulted in its own feature in routes.pmtiles, for datasets witout shapes. This was corrected by having multiple trips with the same stops put togetner in a single feature.
For #1391, tested with mdb-1026-202510060055 which has 431204 trips in trips.txt.
Before fixing, routes-output.geojson would have 431204 features.
After fixing there is 10292 features.
From our AI friend
This pull request refactors and improves the GTFS CSV processing pipeline by introducing a new base processor class, streamlining CSV parsing, and enhancing work directory management. The changes emphasize modularity, performance, and safer file handling, while removing unused or redundant code. Key improvements include the addition of a fast CSV parser, a context-managed work directory utility, and a simplified, robust interface for accessing CSV data.
Core pipeline and processing improvements:
BaseProcessorclass inbase_processor.pyto standardize CSV processing, including encoding detection, safe file existence checks, and metric tracking. Subclasses overrideprocess_file()for custom logic.AgenciesProcessoras an example subclass ofBaseProcessor, demonstrating how to extract agency information from a GTFS file using the new pipeline.CSV parsing and access enhancements:
FastCsvParserfor efficient line-by-line parsing, using a heuristic to optimize for unquoted lines and tracking quoted line usage.CsvCacheto focus on path resolution and safe value extraction by index, removing in-memory caching and several relationship-building methods. Added static helpers for column index lookup and safe type conversion. [1] [2] [3] [4] [5]Work directory and resource management:
EphemeralOrDebugWorkdir, a context manager for temporary or debug-mode working directories, with automatic cleanup of old directories and configurable behavior via environment variables.Minor fixes and API consistency:
stop_txt_is_lat_log_requiredtostop_txt_is_lat_lon_requiredingtfs.pyfor clarity; added a helperis_lat_lon_required. [1] [2]These changes collectively modernize the pipeline, improve performance and maintainability, and lay the groundwork for further modular processors and robust CSV handling.Summary:
Please make sure these boxes are checked before submitting your pull request - thanks!
./scripts/api-tests.shto make sure you didn't break anything