Add mzML writer with indexed file support #98

mlorenz49 · 2025-04-14T13:57:42Z

Introduces a new MzMLWriter class to export mass spectrometry data from MSData_Base objects to standard mzML and indexed mzML file formats. Key features include:

Support for both regular and indexed mzML output
Parallel processing of spectrum data for improved performance
Correct implementation of controlled vocabulary parameters with required value attributes
Binary encoding of m/z and intensity arrays with optional zlib compression
Support for centroid and profile spectra
Complete metadata representation including instrument configuration, file description, and processing information
Proper calculation of byte offsets for efficient random access in indexed files
Automatic generation of SHA-1 file checksums
XML formatting according to the PSI standard specification

This writer enables interoperability with common MS analysis tools like FragPipe, MSConvert, and other applications that consume mzML format.

Added unit tests to validate the writer's functionality and output file structure compliance.

Introduces a new MzMLWriter class to export mass spectrometry data from MSData_Base objects to standard mzML and indexed mzML file formats. Key features include: - Support for both regular and indexed mzML output - Parallel processing of spectrum data for improved performance - Correct implementation of controlled vocabulary parameters with required value attributes - Binary encoding of m/z and intensity arrays with optional zlib compression - Support for centroid and profile spectra - Complete metadata representation including instrument configuration, file description, and processing information - Proper calculation of byte offsets for efficient random access in indexed files - Automatic generation of SHA-1 file checksums - XML formatting according to the PSI standard specification This writer enables interoperability with common MS analysis tools like FragPipe, MSConvert, and other applications that consume mzML format. Added unit tests to validate the writer's functionality and output file structure compliance.

github-actions · 2025-04-14T13:58:56Z

alpharaw/ms_data_base.py

        )

        hdf.ms_data = {"spectrum_df": self.spectrum_df, "peak_df": self.peak_df}



The method stores the MzMLWriter instance in an instance attribute 'self.writer', but it doesn't need to do this. Creating a local variable and letting it be garbage collected after writing is complete would be a better approach. The writer object isn't used elsewhere after the save_mzml method completes.

def save_mzml( self, mzml_file_path: str, binary_precision=64, compression=None, indexed=False, process_count=1, batch_size=5000 ): " Save data into mzML file format Parameters ---------- mzml_file_path : str Absolute or relative path of the mzML file to create. binary_precision : int, optional Binary encoding precision (32 or 64 bit), by default 64 compression : str, optional Compression method (None, 'zlib'), by default None indexed : bool, optional Whether to create an indexed mzML file, by default False process_count : int, optional Number of processes to use for writing, by default 1 batch_size : int, optional Batch size for writing spectra, by default 5000 " # Create MzMLWriter instance with appropriate parameters writer = MzMLWriter( self, mzml_file_path, binary_precision=binary_precision, compression=compression, indexed=indexed, process_count=process_count, batch_size=batch_size, ) # Write the mzML file writer.write()

github-actions · 2025-04-14T13:58:58Z

alpharaw/mzml_io/mzml_writer.py

+    cv_param = ET.SubElement(scan, ns_prefix + "cvParam")
+    cv_param.set("cvRef", "MS")
+    cv_param.set("accession", "MS:1000016")
+    cv_param.set("name", "scan start time")


Improved the comment to explain why an empty value attribute is needed. The attribute is required by the mzML schema even though no value is needed for this parameter.

# Add centroid/profile indication cv_param = ET.SubElement(spectrum, ns_prefix + "cvParam") cv_param.set("cvRef", "MS") cv_param.set("accession", "MS:1000127" if ms_data.centroided else "MS:1000128") cv_param.set("name", "centroid spectrum" if ms_data.centroided else "profile spectrum") cv_param.set("value", "") # Empty value attribute required by schema

github-actions · 2025-04-14T13:59:00Z

alpharaw/mzml_io/mzml_writer.py

+
+        # Add collision energy if available
+        if 'nce' in row and row['nce'] > 0:
+            cv_param = ET.SubElement(activation, ns_prefix + "cvParam")


Improved the comment to explain why an empty value attribute is needed. The attribute is required by the mzML schema even though no value is needed for this parameter.

cv_param.set("value", "") # Empty value attribute required by schema

github-actions · 2025-04-14T13:59:02Z

alpharaw/mzml_io/mzml_writer.py

+    cv_param.set("value", "")  # ADD EMPTY VALUE ATTRIBUTE HERE
+    cv_param.set("unitCvRef", "MS")
+    cv_param.set("unitAccession", "MS:1000131")
+    cv_param.set("unitName", "number of detector counts")


The mmap call can fail on some systems if the file is empty or very small. Adding a try block would make the code more robust when dealing with edge cases.

def _process_offset_batch_mmap(args): "Process a batch of spectrum IDs using memory mapping" file_path, start_idx, end_idx = args results = {} # Open the file using memory mapping (shared memory view) with open(file_path, 'rb') as f: import mmap try: with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:

github-actions · 2025-04-14T13:59:04Z

alpharaw/mzml_io/mzml_writer.py

+
+def _process_offset_batch_mmap(args):
+    """Process a batch of spectrum IDs using memory mapping"""
+    file_path, start_idx, end_idx = args


Adding a fallback when memory mapping fails due to file size issues. This makes the code more robust on various systems and with edge cases like empty files.

except (ValueError, OSError) as e: print(f"Warning: Could not memory map file: {e}") # Fallback to regular file reading return {}

github-actions · 2025-04-14T13:59:06Z

alpharaw/mzml_io/mzml_writer.py

+
+    def _calculate_offsets(self, temp_file_path):
+        """Calculate byte offsets for each spectrum using memory-mapped file"""
+        # Create batch ranges for parallelization


Adding error handling when reading the temporary file. This prevents potential crashes if the file can't be read completely for any reason.

with open(self.output_path + ".temp", 'rb') as f: try: content = f.read() except Exception as e: print(f"Error reading temporary file: {e}") content = b""

github-actions · 2025-04-14T13:59:07Z

alpharaw/mzml_io/mzml_writer.py

+        first_spectrum_number = 0
+        last_spectrum_number = spectrum_count - 1
+
+        # Create batch boundaries


The original code has potential issues with attribute checking and accessing spectrum_df['analyzer'] when it might not exist. This version adds more robust validation to prevent AttributeError and KeyError exceptions.

analyzer_type = "orbitrap" if hasattr(self.ms_data, 'spectrum_df'): if hasattr(self.ms_data, 'auxiliary_items') and "analyzer" in self.ms_data.auxiliary_items and "analyzer" in self.ms_data.spectrum_df.columns: if not self.ms_data.spectrum_df["analyzer"].empty: analyzer_type = self.ms_data.spectrum_df["analyzer"].iloc[0] self._add_cv_param(analyzer, "MS", "MS:1000484", analyzer_type, "")

github-actions · 2025-04-14T13:59:09Z

alpharaw/mzml_io/mzml_writer.py

+
+        # Add binary data arrays
+        binary_list = ET.SubElement(spectrum, self.ns_prefix + "binaryDataArrayList")
+        binary_list.set("count", "2")  # m/z and intensity


The str() conversion is needed in case a numeric value is passed. This change ensures that numeric values will be properly converted to strings before being added as XML attributes.

def _add_cv_param(self, parent, cv_ref, accession, name, value="", unit_cv_ref=None, unit_accession=None, unit_name=None): "Helper method to add a CV parameter" cv_param = ET.SubElement(parent, self.ns_prefix + "cvParam") cv_param.set("cvRef", cv_ref) cv_param.set("accession", accession) cv_param.set("name", name) # Always include a value attribute, even if it's empty cv_param.set("value", str(value) if value is not None else "") if unit_cv_ref: cv_param.set("unitCvRef", unit_cv_ref) cv_param.set("unitAccession", unit_accession) cv_param.set("unitName", unit_name)

github-actions · 2025-04-14T13:59:11Z

tests/unit/test_mzml_writter.py

@@ -0,0 +1,206 @@
+import os


The Path import is unused in this test file and should be removed.

import os import tempfile import unittest import xml.etree.ElementTree as ET import numpy as np import pytest from pathlib import Path

github-actions · 2025-04-14T13:59:13Z

tests/unit/test_mzml_writter.py

+            root = tree.getroot()
+        except Exception as e:
+            pytest.fail(f"Failed to parse the generated indexed mzML file: {e}")
+


Improved the comment to make it clear that the expected number of spectra comes from the MockMSData setup.

# Check spectrum count spectrum_list = root.findall(".//mzml:spectrumList", ns) assert len(spectrum_list) == 1 assert spectrum_list[0].get("count") == "3" # We created 3 spectra in MockMSData

github-actions · 2025-04-14T13:59:16Z

tests/unit/test_mzml_writter.py

+
+
+if __name__ == "__main__":
+    unittest.main()


The file is missing a newline at the end of the file, which is a minor style issue.

github-actions · 2025-04-14T13:59:18Z

Number of tokens: input_tokens=23139 output_tokens=2708 max_tokens=4096
review_instructions=''
config={}
thinking: ```
[]

GeorgWa · 2025-04-23T14:07:20Z

alpharaw/mzml_io/mzml_writer.py

+from tqdm import tqdm
+import numpy as np
+
+def _create_spectrum_xml(i, row, ms_data, binary_precision, compression, ns_prefix):


please add type annotation and docstring.
Also can you replace i with a more meaningful name like spectrum_index

GeorgWa · 2025-04-23T14:13:21Z

alpharaw/mzml_io/mzml_writer.py

+    ms_level = row.get('ms_level', 1)
+    cv_param = ET.SubElement(spectrum, ns_prefix + "cvParam")
+    cv_param.set("cvRef", "MS")
+    cv_param.set("accession", "MS:1000511")


All of the controlled vocabulary is defined here:
https://github.com/HUPO-PSI/psi-ms-CV/blob/master/psi-ms.obo

Could we reuse this in some way or at least put it in a constants file:

CV_REF = "cvRef" ACCESSION_CENTROIDED = "MS:1000127" ACCESSION_PROFILE = "MS:1000128"

GeorgWa · 2025-04-23T14:13:53Z

alpharaw/mzml_io/mzml_writer.py

+    cv_param.set("unitAccession", "UO:0000010")
+    cv_param.set("unitName", "second")
+
+    # Add precursor information for MS2+ spectra


can we move this into a function?

GeorgWa · 2025-04-23T14:15:44Z

alpharaw/mzml_io/mzml_writer.py

+from tqdm import tqdm
+import numpy as np
+
+def _create_spectrum_xml(i, row, ms_data, binary_precision, compression, ns_prefix):


This is quite a large function. Could we split this into a high level recepie?

def _create_spectrum_xml() build_spectrum_node() if level > 1: build_precursor_node() build_other_stuff()

mschwoer

just some very quick comments

mschwoer · 2025-04-23T14:22:33Z

tests/unit/test_mzml_writter.py

+
+
+if __name__ == "__main__":
+    unittest.main()


please use pytest framework

mschwoer · 2025-04-23T14:23:24Z

alpharaw/mzml_io/mzml_writer.py

+                    spectrum.set("id", f"scan={i}")
+                    spectrum.set("defaultArrayLength", "0")
+
+    def _add_spectrum(self, spectrum_list, i, row):


this method is quite long -> could be split

mschwoer · 2025-04-23T14:23:39Z

alpharaw/mzml_io/mzml_writer.py

+        self.ns_uri = "http://psi.hupo.org/ms/mzml"
+        self.ns_prefix = "{" + self.ns_uri + "}"        
+
+    def write(self):


this method is quite long -> could be split

GeorgWa · 2025-04-23T14:25:09Z

Hi @mlorenz49, I added some initial review but realised that this code still contains some redundancy and two versions of the writer. Could you refactor it so it's only a single version, I will perform a full review then.

Some considerations for this:

Let's aim for a very simple first version without multiprocessing.
I would also strongly suggest against using mmap as this has been source of bugs in other modules.
I would suggest a minimal solution. WHile the fields like instrument configuration and files are possible, they are not needed for DIANN or Spectronaut, so I would skip them for the first solution.

mlorenz49 · 2025-06-26T11:20:38Z

Created a new PR request with a simplified version of the MzMLWritter class in #100

mlorenz49 added the code-review label Apr 14, 2025

mlorenz49 requested review from jalew188, GeorgWa and mschwoer April 14, 2025 13:57

github-actions bot reviewed Apr 14, 2025

View reviewed changes

GeorgWa reviewed Apr 23, 2025

View reviewed changes

mschwoer reviewed Apr 23, 2025

View reviewed changes

mlorenz49 mentioned this pull request Jun 26, 2025

Refactor MzMLWriter: Simplify implementation and improve CV compliance #100

Open

mlorenz49 closed this Jun 26, 2025

		)

		hdf.ms_data = {"spectrum_df": self.spectrum_df, "peak_df": self.peak_df}



		if __name__ == "__main__":
		unittest.main() No newline at end of file

Add mzML writer with indexed file support #98

Add mzML writer with indexed file support #98

Uh oh!

Conversation

mlorenz49 commented Apr 14, 2025

Uh oh!

github-actions bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Apr 14, 2025

Uh oh!

GeorgWa Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

GeorgWa Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

GeorgWa Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

GeorgWa Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

mschwoer left a comment

Choose a reason for hiding this comment

Uh oh!

mschwoer Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

mschwoer Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

mschwoer Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

GeorgWa commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mlorenz49 commented Jun 26, 2025

Uh oh!

Uh oh!

GeorgWa commented Apr 23, 2025 •

edited

Loading