Add memmap to gambit #5

Michal-Babins · 2025-08-14T13:32:36Z

Add numpy memmap as an option to process data coming from npy file instead of a csv for the pairwise distance matrix.

Memmap

xonq

Will finalize review when validation dataset received from analysts. Comments are notes to self for further review or conceptual, with 1 potential requested change/note for the future.

xonq · 2025-08-21T17:05:45Z

gambitdb/CompressClusters.py

+        self.logger.debug(f"Reading distance matrix index from {index_filename}")
+        with open(index_filename, 'r') as f:
+            dist_matrix_index_labels = [line.strip() for line in f]
+        pairwise_distances_index = pandas.Index(dist_matrix_index_labels)


note to self: will need to ensure index is being appropriately assigned

xonq · 2025-08-21T17:08:15Z

gambitdb/CompressClusters.py

+
+        # Memory-map the pairwise distances file instead of loading it
+        self.logger.debug(f"Memory-mapping distance matrix from {self.pairwise_distances_filename}")
+        pairwise_distances_matrix = np.memmap(self.pairwise_distances_filename, dtype='float32', mode='r')


note for future: if RAM/disk is still an issue, we could convert the matrix to whatever integer corresponds to the significant figures we need and change the matrix to "int16" or whatever. for example, if we only need two significant figures between 0-1, we could multiply the matrix by 100 and use int8, which would reduce memory by 75%.

xonq · 2025-08-21T17:17:27Z

gambitdb/Diameters.py

-        species['ngenomes'] = species['ngenomes'].astype(int)
+        # Fill any potential NaN values with 0 before casting to integer.
+        # This makes the script more robust against unexpected missing values.
+        species['ngenomes'] = species['ngenomes'].fillna(0).astype(int)


are the NaN's a) intentionally introduced/permitted upstream, or b) are they an artifact of potential errors?

a) no change requested

b) it would be ideal to address those upstream and raise an error here if detected

xonq · 2025-08-21T17:23:06Z

gambitdb/Diameters.py

-                min_inter[i, j] = min_inter[j, i] = mi
-            #Add Species data
+
+            # If the assembly accessions list is not empty, calculate diameter and min_inter


note to self: review further following receipt of validation dataset

xonq · 2025-08-28T20:37:17Z

Dockerfile failing to build. Tests are corrupted

xonq · 2025-08-28T20:59:51Z

gambitdb-curate argparse script description needs to be updated with new required inputs

thanhleviet and others added 8 commits June 8, 2025 19:32

add memmap for small mem computing resource

db173cc

fix incorrect index extension

6d31767

add gambitdb-compare-gtdb-metadata

17b79c8

add insert patch

47af4a1

Merge pull request #3 from thanhleviet/memmap

3e1fb99

Memmap

add h5py and networkx to install_requires in setup.py

1cd5d85

add support for HDF5 input format in distance matrix conversion

29f858c

Added test data

3f4e932

Michal-Babins requested a review from xonq August 14, 2025 13:33

xonq reviewed Aug 21, 2025

View reviewed changes

Michal-Babins added 4 commits September 8, 2025 18:59

Add processing for memmap idx

dfb0449

Bump to 0.1.0

9367470

Add same npy idx reader for labels as for diameters

a66a492

Keep species level classification if only 1 cluster

b9a261b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add memmap to gambit #5

Add memmap to gambit #5

Uh oh!

Michal-Babins commented Aug 14, 2025

Uh oh!

xonq left a comment

Uh oh!

xonq Aug 21, 2025

Uh oh!

xonq Aug 21, 2025

Uh oh!

xonq Aug 21, 2025

Uh oh!

xonq Aug 21, 2025

Uh oh!

xonq commented Aug 28, 2025

Uh oh!

xonq commented Aug 28, 2025

Uh oh!

Uh oh!

Add memmap to gambit #5

Are you sure you want to change the base?

Add memmap to gambit #5

Uh oh!

Conversation

Michal-Babins commented Aug 14, 2025

Uh oh!

xonq left a comment

Choose a reason for hiding this comment

Uh oh!

xonq Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

xonq Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

xonq Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

xonq Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

xonq commented Aug 28, 2025

Uh oh!

xonq commented Aug 28, 2025

Uh oh!

Uh oh!