Skip to content

fix: create multiple dataframes from same CSVReader #44

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

AuPath
Copy link
Collaborator

@AuPath AuPath commented Mar 27, 2025

The behaviour is now to read the whole CSV file (or String) into a List<NamedCsvRecord> and not in a stream as was previously (accidentaly) the case. While this does work and means that #38 is corrected this also means that the whole file is always kept in memory.

This does not necessarily make sense as i might have a huge CSV file where i am interested in a single column that i can obtain with the getDataframe(String... columns) method. While the created dataframe is small i still have the whole file in memory, even if no other dataframe with more than that column is ever created.

We should discuss possible optimizations for this scenario.

- fixed by moving to version 3.6.0 of the fastcsv library (closes #38)
- fixed handling of file encoded with BOM (closes #42)
@AuPath AuPath requested a review from marioscrock March 27, 2025 08:26
@AuPath AuPath self-assigned this Mar 27, 2025
@marioscrock
Copy link
Member

Thanks for fixing this! Reading the file as a stream was not "accidental" but on purpose to reduce the memory footprint. We should check the best trade-off to solve the issue of multiple readings while keeping the possibility of processing iteratively the CSV (e.g., large CSV where I need to access only one column). Maybe we can have a CSVStreamReader and CSVReader or something similar?

@marioscrock
Copy link
Member

@AuPath I pushed an attempt to introduce CSVStreamReader and refactor duplicated operations in CSVReaderAbstract. Please review the latest commit

@marioscrock marioscrock assigned marioscrock and unassigned AuPath Apr 9, 2025
@marioscrock
Copy link
Member

Fix also setOnlyDistinct behaviour in CSVReader and other readers (cf. #45)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UTF-8 encoded CSV file CsvReader is consumed after extracting DataFrame
2 participants