-
Notifications
You must be signed in to change notification settings - Fork 1
Add memmap to gambit #5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will finalize review when validation dataset received from analysts. Comments are notes to self for further review or conceptual, with 1 potential requested change/note for the future.
self.logger.debug(f"Reading distance matrix index from {index_filename}") | ||
with open(index_filename, 'r') as f: | ||
dist_matrix_index_labels = [line.strip() for line in f] | ||
pairwise_distances_index = pandas.Index(dist_matrix_index_labels) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: will need to ensure index is being appropriately assigned
|
||
# Memory-map the pairwise distances file instead of loading it | ||
self.logger.debug(f"Memory-mapping distance matrix from {self.pairwise_distances_filename}") | ||
pairwise_distances_matrix = np.memmap(self.pairwise_distances_filename, dtype='float32', mode='r') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note for future: if RAM/disk is still an issue, we could convert the matrix to whatever integer corresponds to the significant figures we need and change the matrix to "int16" or whatever. for example, if we only need two significant figures between 0-1, we could multiply the matrix by 100 and use int8, which would reduce memory by 75%.
species['ngenomes'] = species['ngenomes'].astype(int) | ||
# Fill any potential NaN values with 0 before casting to integer. | ||
# This makes the script more robust against unexpected missing values. | ||
species['ngenomes'] = species['ngenomes'].fillna(0).astype(int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the NaN's a) intentionally introduced/permitted upstream, or b) are they an artifact of potential errors?
- a) no change requested
- b) it would be ideal to address those upstream and raise an error here if detected
min_inter[i, j] = min_inter[j, i] = mi | ||
#Add Species data | ||
|
||
# If the assembly accessions list is not empty, calculate diameter and min_inter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to self: review further following receipt of validation dataset
Dockerfile failing to build. Tests are corrupted |
gambitdb-curate argparse script description needs to be updated with new required inputs |
Add numpy memmap as an option to process data coming from npy file instead of a csv for the pairwise distance matrix.