Skip to content

Conversation

@jcpitre
Copy link
Collaborator

@jcpitre jcpitre commented Oct 20, 2025

Closes: Too many features in routes.pmtiles for datasets without shapes #1391
Closes: Optimize pmtiles generation even more #1383

Refactored the PMTiles creation code so that each GTFS file is processed by a single class (sort of, the exceptions being RoutesProcessor and RoutesProcessorForColors that both process routes.txt). It makes it easier to know when to download the file and delete it after use.
Also corrected the problem with #1391 where each trip resulted in its own feature in routes.pmtiles, for datasets witout shapes. This was corrected by having multiple trips with the same stops put togetner in a single feature.

For #1391, tested with mdb-1026-202510060055 which has 431204 trips in trips.txt.
Before fixing, routes-output.geojson would have 431204 features.
After fixing there is 10292 features.

From our AI friend

This pull request refactors and improves the GTFS CSV processing pipeline by introducing a new base processor class, streamlining CSV parsing, and enhancing work directory management. The changes emphasize modularity, performance, and safer file handling, while removing unused or redundant code. Key improvements include the addition of a fast CSV parser, a context-managed work directory utility, and a simplified, robust interface for accessing CSV data.

Core pipeline and processing improvements:

  • Added a new BaseProcessor class in base_processor.py to standardize CSV processing, including encoding detection, safe file existence checks, and metric tracking. Subclasses override process_file() for custom logic.
  • Introduced AgenciesProcessor as an example subclass of BaseProcessor, demonstrating how to extract agency information from a GTFS file using the new pipeline.

CSV parsing and access enhancements:

  • Added FastCsvParser for efficient line-by-line parsing, using a heuristic to optimize for unquoted lines and tracking quoted line usage.
  • Refactored CsvCache to focus on path resolution and safe value extraction by index, removing in-memory caching and several relationship-building methods. Added static helpers for column index lookup and safe type conversion. [1] [2] [3] [4] [5]

Work directory and resource management:

  • Added EphemeralOrDebugWorkdir, a context manager for temporary or debug-mode working directories, with automatic cleanup of old directories and configurable behavior via environment variables.

Minor fixes and API consistency:

  • Renamed stop_txt_is_lat_log_required to stop_txt_is_lat_lon_required in gtfs.py for clarity; added a helper is_lat_lon_required. [1] [2]

These changes collectively modernize the pipeline, improve performance and maintainability, and lay the groundwork for further modular processors and robust CSV handling.Summary:

Please make sure these boxes are checked before submitting your pull request - thanks!

  • Run the unit tests with ./scripts/api-tests.sh to make sure you didn't break anything
  • Add or update any needed documentation to the repo
  • Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
  • Linked all relevant issues
  • Include screenshot(s) showing how this pull request works and fixes the issue(s)

@jcpitre jcpitre linked an issue Oct 20, 2025 that may be closed by this pull request
@jcpitre jcpitre changed the title 1383 optimize pmtiles generation even more feat: optimize pmtiles generation even more Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Optimize pmtiles generation even more

1 participant