Skip to content

Conversation

mlorenz49
Copy link
Contributor

Introduces a new MzMLWriter class to export mass spectrometry data from MSData_Base objects to standard mzML and indexed mzML file formats. Key features include:

  • Support for both regular and indexed mzML output
  • Parallel processing of spectrum data for improved performance
  • Correct implementation of controlled vocabulary parameters with required value attributes
  • Binary encoding of m/z and intensity arrays with optional zlib compression
  • Support for centroid and profile spectra
  • Complete metadata representation including instrument configuration, file description, and processing information
  • Proper calculation of byte offsets for efficient random access in indexed files
  • Automatic generation of SHA-1 file checksums
  • XML formatting according to the PSI standard specification

This writer enables interoperability with common MS analysis tools like FragPipe, MSConvert, and other applications that consume mzML format.

Added unit tests to validate the writer's functionality and output file structure compliance.

Introduces a new MzMLWriter class to export mass spectrometry data from MSData_Base objects to standard mzML and indexed mzML file formats. Key features include:

- Support for both regular and indexed mzML output
- Parallel processing of spectrum data for improved performance
- Correct implementation of controlled vocabulary parameters with required value attributes
- Binary encoding of m/z and intensity arrays with optional zlib compression
- Support for centroid and profile spectra
- Complete metadata representation including instrument configuration, file description, and processing information
- Proper calculation of byte offsets for efficient random access in indexed files
- Automatic generation of SHA-1 file checksums
- XML formatting according to the PSI standard specification

This writer enables interoperability with common MS analysis tools like FragPipe, MSConvert, and other applications that consume mzML format.

Added unit tests to validate the writer's functionality and output file structure compliance.
)

hdf.ms_data = {"spectrum_df": self.spectrum_df, "peak_df": self.peak_df}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method stores the MzMLWriter instance in an instance attribute 'self.writer', but it doesn't need to do this. Creating a local variable and letting it be garbage collected after writing is complete would be a better approach. The writer object isn't used elsewhere after the save_mzml method completes.

    def save_mzml(
        self, mzml_file_path: str, binary_precision=64, compression=None, indexed=False, process_count=1, batch_size=5000
    ):
        "
        Save data into mzML file format

        Parameters
        ----------
        mzml_file_path : str
            Absolute or relative path of the mzML file to create.
        binary_precision : int, optional
            Binary encoding precision (32 or 64 bit), by default 64
        compression : str, optional
            Compression method (None, 'zlib'), by default None
        indexed : bool, optional
            Whether to create an indexed mzML file, by default False
        process_count : int, optional
            Number of processes to use for writing, by default 1
        batch_size : int, optional
            Batch size for writing spectra, by default 5000
        "

        # Create MzMLWriter instance with appropriate parameters
        writer = MzMLWriter(
            self,
            mzml_file_path,
            binary_precision=binary_precision,
            compression=compression,
            indexed=indexed,
            process_count=process_count,
            batch_size=batch_size,
        )
        # Write the mzML file
        writer.write()

cv_param = ET.SubElement(scan, ns_prefix + "cvParam")
cv_param.set("cvRef", "MS")
cv_param.set("accession", "MS:1000016")
cv_param.set("name", "scan start time")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved the comment to explain why an empty value attribute is needed. The attribute is required by the mzML schema even though no value is needed for this parameter.

    # Add centroid/profile indication
    cv_param = ET.SubElement(spectrum, ns_prefix + "cvParam")
    cv_param.set("cvRef", "MS")
    cv_param.set("accession", "MS:1000127" if ms_data.centroided else "MS:1000128")
    cv_param.set("name", "centroid spectrum" if ms_data.centroided else "profile spectrum")
    cv_param.set("value", "")  # Empty value attribute required by schema


# Add collision energy if available
if 'nce' in row and row['nce'] > 0:
cv_param = ET.SubElement(activation, ns_prefix + "cvParam")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved the comment to explain why an empty value attribute is needed. The attribute is required by the mzML schema even though no value is needed for this parameter.

            cv_param.set("value", "")  # Empty value attribute required by schema

cv_param.set("value", "") # ADD EMPTY VALUE ATTRIBUTE HERE
cv_param.set("unitCvRef", "MS")
cv_param.set("unitAccession", "MS:1000131")
cv_param.set("unitName", "number of detector counts")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mmap call can fail on some systems if the file is empty or very small. Adding a try block would make the code more robust when dealing with edge cases.

def _process_offset_batch_mmap(args):
    "Process a batch of spectrum IDs using memory mapping"
    file_path, start_idx, end_idx = args
    results = {}
    
    # Open the file using memory mapping (shared memory view)
    with open(file_path, 'rb') as f:
        import mmap
        try:
            with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:


def _process_offset_batch_mmap(args):
"""Process a batch of spectrum IDs using memory mapping"""
file_path, start_idx, end_idx = args

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a fallback when memory mapping fails due to file size issues. This makes the code more robust on various systems and with edge cases like empty files.

        except (ValueError, OSError) as e:
            print(f"Warning: Could not memory map file: {e}")
            # Fallback to regular file reading
            return {}


def _calculate_offsets(self, temp_file_path):
"""Calculate byte offsets for each spectrum using memory-mapped file"""
# Create batch ranges for parallelization

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding error handling when reading the temporary file. This prevents potential crashes if the file can't be read completely for any reason.

            with open(self.output_path + ".temp", 'rb') as f:
                try:
                    content = f.read()
                except Exception as e:
                    print(f"Error reading temporary file: {e}")
                    content = b""

first_spectrum_number = 0
last_spectrum_number = spectrum_count - 1

# Create batch boundaries

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code has potential issues with attribute checking and accessing spectrum_df['analyzer'] when it might not exist. This version adds more robust validation to prevent AttributeError and KeyError exceptions.

        analyzer_type = "orbitrap"
        if hasattr(self.ms_data, 'spectrum_df'):
            if hasattr(self.ms_data, 'auxiliary_items') and "analyzer" in self.ms_data.auxiliary_items and "analyzer" in self.ms_data.spectrum_df.columns:
                if not self.ms_data.spectrum_df["analyzer"].empty:
                    analyzer_type = self.ms_data.spectrum_df["analyzer"].iloc[0]
        self._add_cv_param(analyzer, "MS", "MS:1000484", analyzer_type, "")


# Add binary data arrays
binary_list = ET.SubElement(spectrum, self.ns_prefix + "binaryDataArrayList")
binary_list.set("count", "2") # m/z and intensity

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The str() conversion is needed in case a numeric value is passed. This change ensures that numeric values will be properly converted to strings before being added as XML attributes.

    def _add_cv_param(self, parent, cv_ref, accession, name, value="", unit_cv_ref=None, 
                    unit_accession=None, unit_name=None):
        "Helper method to add a CV parameter"
        cv_param = ET.SubElement(parent, self.ns_prefix + "cvParam")
        cv_param.set("cvRef", cv_ref)
        cv_param.set("accession", accession)
        cv_param.set("name", name)
        
        # Always include a value attribute, even if it's empty
        cv_param.set("value", str(value) if value is not None else "")
            
        if unit_cv_ref:
            cv_param.set("unitCvRef", unit_cv_ref)
            cv_param.set("unitAccession", unit_accession)
            cv_param.set("unitName", unit_name)

@@ -0,0 +1,206 @@
import os

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Path import is unused in this test file and should be removed.

import os
import tempfile
import unittest
import xml.etree.ElementTree as ET
import numpy as np
import pytest
from pathlib import Path

root = tree.getroot()
except Exception as e:
pytest.fail(f"Failed to parse the generated indexed mzML file: {e}")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improved the comment to make it clear that the expected number of spectra comes from the MockMSData setup.

        # Check spectrum count
        spectrum_list = root.findall(".//mzml:spectrumList", ns)
        assert len(spectrum_list) == 1
        assert spectrum_list[0].get("count") == "3"  # We created 3 spectra in MockMSData



if __name__ == "__main__":
unittest.main() No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file is missing a newline at the end of the file, which is a minor style issue.

Copy link

Number of tokens: input_tokens=23139 output_tokens=2708 max_tokens=4096
review_instructions=''
config={}
thinking: ```
[]

from tqdm import tqdm
import numpy as np

def _create_spectrum_xml(i, row, ms_data, binary_precision, compression, ns_prefix):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add type annotation and docstring.
Also can you replace i with a more meaningful name like spectrum_index

ms_level = row.get('ms_level', 1)
cv_param = ET.SubElement(spectrum, ns_prefix + "cvParam")
cv_param.set("cvRef", "MS")
cv_param.set("accession", "MS:1000511")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the controlled vocabulary is defined here:
https://github.com/HUPO-PSI/psi-ms-CV/blob/master/psi-ms.obo

Could we reuse this in some way or at least put it in a constants file:

CV_REF = "cvRef"
ACCESSION_CENTROIDED = "MS:1000127"
ACCESSION_PROFILE = "MS:1000128"

cv_param.set("unitAccession", "UO:0000010")
cv_param.set("unitName", "second")

# Add precursor information for MS2+ spectra
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this into a function?

from tqdm import tqdm
import numpy as np

def _create_spectrum_xml(i, row, ms_data, binary_precision, compression, ns_prefix):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite a large function. Could we split this into a high level recepie?

def _create_spectrum_xml()
   build_spectrum_node()
   if level > 1:
      build_precursor_node()
   build_other_stuff()

Copy link
Contributor

@mschwoer mschwoer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some very quick comments



if __name__ == "__main__":
unittest.main() No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use pytest framework

spectrum.set("id", f"scan={i}")
spectrum.set("defaultArrayLength", "0")

def _add_spectrum(self, spectrum_list, i, row):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is quite long -> could be split

self.ns_uri = "http://psi.hupo.org/ms/mzml"
self.ns_prefix = "{" + self.ns_uri + "}"

def write(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this method is quite long -> could be split

@GeorgWa
Copy link
Collaborator

GeorgWa commented Apr 23, 2025

Hi @mlorenz49, I added some initial review but realised that this code still contains some redundancy and two versions of the writer. Could you refactor it so it's only a single version, I will perform a full review then.

Some considerations for this:

  • Let's aim for a very simple first version without multiprocessing.
  • I would also strongly suggest against using mmap as this has been source of bugs in other modules.
  • I would suggest a minimal solution. WHile the fields like instrument configuration and files are possible, they are not needed for DIANN or Spectronaut, so I would skip them for the first solution.

@mlorenz49
Copy link
Contributor Author

Created a new PR request with a simplified version of the MzMLWritter class in #100

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants