Skip to content

Conversation

Michal-Babins
Copy link
Collaborator

Add numpy memmap as an option to process data coming from npy file instead of a csv for the pairwise distance matrix.

@Michal-Babins Michal-Babins requested a review from xonq August 14, 2025 13:33
Copy link
Collaborator

@xonq xonq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will finalize review when validation dataset received from analysts. Comments are notes to self for further review or conceptual, with 1 potential requested change/note for the future.

self.logger.debug(f"Reading distance matrix index from {index_filename}")
with open(index_filename, 'r') as f:
dist_matrix_index_labels = [line.strip() for line in f]
pairwise_distances_index = pandas.Index(dist_matrix_index_labels)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: will need to ensure index is being appropriately assigned


# Memory-map the pairwise distances file instead of loading it
self.logger.debug(f"Memory-mapping distance matrix from {self.pairwise_distances_filename}")
pairwise_distances_matrix = np.memmap(self.pairwise_distances_filename, dtype='float32', mode='r')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note for future: if RAM/disk is still an issue, we could convert the matrix to whatever integer corresponds to the significant figures we need and change the matrix to "int16" or whatever. for example, if we only need two significant figures between 0-1, we could multiply the matrix by 100 and use int8, which would reduce memory by 75%.

species['ngenomes'] = species['ngenomes'].astype(int)
# Fill any potential NaN values with 0 before casting to integer.
# This makes the script more robust against unexpected missing values.
species['ngenomes'] = species['ngenomes'].fillna(0).astype(int)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the NaN's a) intentionally introduced/permitted upstream, or b) are they an artifact of potential errors?

  • a) no change requested
  • b) it would be ideal to address those upstream and raise an error here if detected

min_inter[i, j] = min_inter[j, i] = mi
#Add Species data

# If the assembly accessions list is not empty, calculate diameter and min_inter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to self: review further following receipt of validation dataset

@xonq
Copy link
Collaborator

xonq commented Aug 28, 2025

Dockerfile failing to build. Tests are corrupted

@xonq
Copy link
Collaborator

xonq commented Aug 28, 2025

gambitdb-curate argparse script description needs to be updated with new required inputs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants