From f09e7fa888eeaea4419e0af4d5d3127107e3d69f Mon Sep 17 00:00:00 2001 From: Chris Bielow Date: Wed, 6 Nov 2024 18:17:38 +0100 Subject: [PATCH 1/8] - lots of fixes affecting rendering - fixes for correctness (referencing the correct lines etc) - more elaborate explanation of terms in texts (protein accession etc) --- docs/README.md | 7 +++ docs/source/user_guide/background.rst | 4 +- docs/source/user_guide/chemistry.rst | 2 +- docs/source/user_guide/digestion.rst | 20 +++++++- docs/source/user_guide/glossary.rst | 4 +- .../source/user_guide/identification_data.rst | 49 +++++++++---------- docs/source/user_guide/ms_data.rst | 2 +- .../user_guide/oligonucleotides_rna.rst | 29 ++++++++++- .../user_guide/other_ms_data_formats.rst | 2 + docs/source/user_guide/peptides_proteins.rst | 2 +- docs/source/user_guide/spectrum_alignment.rst | 4 +- 11 files changed, 87 insertions(+), 38 deletions(-) diff --git a/docs/README.md b/docs/README.md index 9e694da41..651d844a6 100644 --- a/docs/README.md +++ b/docs/README.md @@ -8,9 +8,16 @@ Preparation Install Sphinx (which is a Python package) and some of its modules/plugins. We recommend doing this in a [python venv](https://docs.python.org/3/library/venv.html). + # create it: python -m venv /path/to/myenv + # activate it, e.g. + + # Linux: source /bin/activate + + # Windows: + c:\path\to\myenv\Scripts\activate.bat Once the environment is active, you can install all required python packages using diff --git a/docs/source/user_guide/background.rst b/docs/source/user_guide/background.rst index bc382bbf7..74ce519de 100644 --- a/docs/source/user_guide/background.rst +++ b/docs/source/user_guide/background.rst @@ -221,10 +221,8 @@ Refer to the image below for a diagrammatic representation of the :term:`quadrup .. image:: img/introduction/quadrupole-analyzer.png -.. raw:: html +.. admonition:: Video -
-

Video

For more information on :term:`quadrupole` analyzers, view this video.
diff --git a/docs/source/user_guide/chemistry.rst b/docs/source/user_guide/chemistry.rst index e839e8fd0..46cbe2b6b 100644 --- a/docs/source/user_guide/chemistry.rst +++ b/docs/source/user_guide/chemistry.rst @@ -315,7 +315,7 @@ which produces Ethanol has 6 hydrogen atoms -Note how in line 5 we were able to make a new molecule by adding existing +Note how in line 3 we were able to make a new molecule by adding existing molecules (for example by adding two :py:class:`~.EmpiricalFormula` objects). In this case, we illustrated how to make ethanol by adding a :chem:`CH2` methyl group to an existing methanol molecule. Note that OpenMS describes sum formulas with the diff --git a/docs/source/user_guide/digestion.rst b/docs/source/user_guide/digestion.rst index 87d68091e..142de7c6c 100644 --- a/docs/source/user_guide/digestion.rst +++ b/docs/source/user_guide/digestion.rst @@ -1,6 +1,21 @@ Digestion ========= +In ``top-down proteomics``, whole proteins are measured in the mass spec. +It is a hard problem to know the exact protein sequence, since proteins +can be very large, i.e. have a very long sequence of constituing amino acids. + +However, in an orthogonal approach called ``bottom-up proteomics``, +it is usually more beneficial to first cut proteins into smaller +chunks at defined positions by using enzymatic digestion. The resulting peptides +are lighter in mass, have less charge, and their sequence can be readily determined +using MS/MS in many cases. In a subsequent step, one needs to infer which proteins +were present in the sample, given a set of peptide sequences - a process called protein inference. +The usual enzyme of choice for bottom-up proteomics is ``Trypsin`` (sometimes in combination with Lys-C). + +We will now learn how to do digestion of protein sequences in-silico, so you can predict which +peptides you can expect to observe in the data and even generate theoretical spectra for them. + Proteolytic Digestion with Trypsin ********************************** @@ -24,7 +39,7 @@ OpenMS has classes for proteolytic digestion which can be used as follows: print(result[4].toString()) len(result) # 82 peptides -Very short peptides or even single amino acid digestion products are often discarded as they usually contain little information (e.g., can't be used to identify proteins). +Very short peptides or even single amino acid digestion products are often discarded as they usually contain little information (e.g., are shared by many proteins making them useless to identify specific proteins or will not be detected in a real mass spectrum, since their peptide mass is below the usual minimal recorded mass). We now only generate digestion products with a length of :math:`7` to :math:`40`. .. code-block:: python @@ -35,7 +50,7 @@ We now only generate digestion products with a length of :math:`7` to :math:`40` for s in result: print(s.toString()) -Enzymatic digestion is often not perfect and sometimes enzymes miss cutting a peptide. +Enzymatic digestion is often not perfect and sometimes enzymes miss cutting position (aka cleavage site), resulting in some larger peptides. These are a sequence of two or even more consecutive peptides within the protein sequence. We now allow up to two missed cleavages. .. code-block:: python @@ -51,6 +66,7 @@ We now allow up to two missed cleavages. Proteolytic Digestion with Lys-C ******************************** +In the previous example we used Trypsin as our enzyme of choice. We can of course also use different enzymes, these are defined in the ``Enzymes.xml`` file and can be accessed using the :py:class:`~.EnzymesDB` object diff --git a/docs/source/user_guide/glossary.rst b/docs/source/user_guide/glossary.rst index 5526d81a6..7d752aee9 100644 --- a/docs/source/user_guide/glossary.rst +++ b/docs/source/user_guide/glossary.rst @@ -38,7 +38,7 @@ A glossary of common terms used throughout OpenMS documentation. TOF time-of-flight Time-of-flight (TOF) is the time taken by an object, particle or wave (be it acoustic, electromagnetic, e.t.c) to travel a distance through a medium. - TOF analyzers can obtain good, but not ultra-high resolution, such as :term:`orbitrap`s. + TOF analyzers can obtain good, but not ultra-high resolution, such as an :term:`orbitrap`. quadrupole A mass filter allowing one mass channel at a time to reach the detector as the mass range is scanned. A low resolution MS analyzer. @@ -164,7 +164,7 @@ A glossary of common terms used throughout OpenMS documentation. feature maps feature map - A feature map is a collection of :term:`feature`s identified from a single experiment. + A feature map is a collection of :term:`feature`\ s identified from a single experiment. One feature map usually contains many features. OpenMS represents a feature map using the class `FeatureMap `_. consensus features diff --git a/docs/source/user_guide/identification_data.rst b/docs/source/user_guide/identification_data.rst index c8ae92413..e05a78872 100644 --- a/docs/source/user_guide/identification_data.rst +++ b/docs/source/user_guide/identification_data.rst @@ -3,46 +3,47 @@ Identification Data In OpenMS, identifications of peptides, proteins and small molecules are stored in dedicated data structures. These data structures are typically stored to disc -as idXML or mzIdentML file. The highest-level structure is +as idXML or mzIdentML files. The highest-level structure is :py:class:`~.ProteinIdentification`. It stores all identified proteins of an identification -run as :py:class:`~.ProteinHit` objects plus additional metadata (search parameters, etc.). Each -:py:class:`~.ProteinHit` contains the actual protein accession, an associated score, and -(optionally) the protein sequence. +run (usually all IDs from a single HPLC-MS run) as :py:class:`~.ProteinHit` objects plus additional metadata (search parameters, etc.). Each +:py:class:`~.ProteinHit` represents a potential protein which may be present in the sample. The `ProteinHit` contains the actual protein identifier (also known as `accession`), an associated score, and +the protein sequence. The latter may be omitted to reduce memory consumption. A :py:class:`~.PeptideIdentification` object stores the data corresponding to a single identified spectrum or feature. It has members for the retention time, m/z, and a vector of :py:class:`~.PeptideHit` objects. Each :py:class:`~.PeptideHit` -stores the information of a specific :term:`peptide-spectrum match` or :term:`PSM` (e.g., the score -and the peptide sequence). Each :py:class:`~.PeptideHit` also contains a vector of +stores the information of a specific :term:`peptide-spectrum match` (:term:`PSM`), e.g., the score +and the peptide sequence. Each :py:class:`~.PeptideHit` also contains a vector of :py:class:`~.PeptideEvidence` objects which store the reference to one or more (in the case the peptide maps to multiple proteins) proteins and the position therein. .. NOTE:: - Protein Ids are linked to peptide Ids by a common identifier (e.g., a unique string of time and date of the search). + Proteins and their corresponding peptides are linked by a common identifier (e.g., a unique string of time and date of the search). The Identifier can be set using the :py:meth:`~.ProteinIdentification.setIdentifier` method in :py:class:`~.ProteinIdentification` and :py:class:`~.PeptideIdentification`. - Similarly :py:meth:`~.ProteinIdentification.getIdentifier` can be used to check the link between them. - With the link one can retrieve search meta data (which is stored at the protein level) for individual peptide Ids. + Similarly :py:meth:`~.ProteinIdentification.getIdentifier` can be used to check the identifier. + Using this link one can retrieve search meta data (which is stored at the protein level) for individual peptides. .. code-block:: python - :linenos: - import pyopenms as oms + :linenos: + + import pyopenms as oms - protein_id = oms.ProteinIdentification() - peptide_id = oms.PeptideIdentification() + protein_id = oms.ProteinIdentification() + peptide_id = oms.PeptideIdentification() - # Sets the Identifier - protein_id.setIdentifier("IdentificationRun1") - peptide_id.setIdentifier("IdentificationRun1") + # Sets the Identifier + protein_id.setIdentifier("IdentificationRun1") + peptide_id.setIdentifier("IdentificationRun1") - # Prints the Identifier - print("Protein Identifier -", protein_id.getIdentifier()) - print("Peptide Identifier -", peptide_id.getIdentifier()) + # Prints the Identifier + print("Protein Identifier -", protein_id.getIdentifier()) + print("Peptide Identifier -", peptide_id.getIdentifier()) .. code-block:: output - - Protein Identifier - IdentificationRun1 - Peptide Identifier - IdentificationRun1 + + Protein Identifier - IdentificationRun1 + Peptide Identifier - IdentificationRun1 Protein Identification *********************** @@ -183,9 +184,7 @@ Storage on Disk Finally, we can store the peptide and protein identification data in a :py:class:`~.IdXMLFile` (a OpenMS internal file format which we have previously -discussed `here -`_) -which we would do as follows: +discussed :ref:`anchor-other-id-data`) which we would do as follows: .. code-block:: python :linenos: diff --git a/docs/source/user_guide/ms_data.rst b/docs/source/user_guide/ms_data.rst index 2daa887e3..0e772f1ea 100644 --- a/docs/source/user_guide/ms_data.rst +++ b/docs/source/user_guide/ms_data.rst @@ -54,7 +54,7 @@ First we create a mass spectrum and insert peaks with descending mass-to-charge First peak: 500.0 1.0 -Note how lines 11-12 (as well as line 19) use the direct access to the +Note how lines 12-13 (as well as line 16) use the direct access to the :py:class:`~.Peak1D` objects (explicit iteration through the :py:class:`~.MSSpectrum` object, which is convenient but slow since a new :py:class:`~.Peak1D` object needs to be created each time). diff --git a/docs/source/user_guide/oligonucleotides_rna.rst b/docs/source/user_guide/oligonucleotides_rna.rst index 478dd8ee6..db47f4785 100644 --- a/docs/source/user_guide/oligonucleotides_rna.rst +++ b/docs/source/user_guide/oligonucleotides_rna.rst @@ -100,6 +100,18 @@ Similarly to before for amino acid sequences, we can also generate internal frag print("RNA Oligo w4++ ion", suffix, "has mz", mz) print("RNA Oligo w4++ ion", suffix, "has molecular formula", w4_formula) +Which will output + +.. code-block:: output + + 10 + 3206.4885302061 + =================================== + RNA Oligo w4++ ion AUGG has mz 672.5989092135458 + RNA Oligo w4++ ion AUGG has molecular formula C39H51N17O29P4 + + + Modified Oligonucleotides ************************* @@ -140,6 +152,21 @@ sequence as follows: for i in range(oligo_mod.size()): print(oligo_mod[i].isModified()) +Which will output + +.. code-block:: output + + + RNA Oligo A[m1A][Gm]A has molecular formula C42H53N20O23P3 and length 4 + =================================== + RNA Oligo A[m1A][Gm]A has unmodified sequence AAGA + '1-methyladenosine' + '"' + 'A' + False + True + True + False DNA, RNA and Protein ******************** @@ -147,7 +174,7 @@ DNA, RNA and Protein We can also work with DNA and RNA sequences in combination with the BioPython library (you can install BioPython with ``pip install biopython``): -.. code-block:: pseudocode +.. code-block:: python :linenos: from Bio.Seq import Seq diff --git a/docs/source/user_guide/other_ms_data_formats.rst b/docs/source/user_guide/other_ms_data_formats.rst index f608016ec..f1268afb6 100644 --- a/docs/source/user_guide/other_ms_data_formats.rst +++ b/docs/source/user_guide/other_ms_data_formats.rst @@ -1,6 +1,8 @@ Other MS Data Formats ============================= +.. _anchor-other-id-data: + Identification Data (idXML, mzIdentML, pepXML, protXML) ------------------------------------------------------- diff --git a/docs/source/user_guide/peptides_proteins.rst b/docs/source/user_guide/peptides_proteins.rst index c3c5dec36..89d5c262b 100644 --- a/docs/source/user_guide/peptides_proteins.rst +++ b/docs/source/user_guide/peptides_proteins.rst @@ -260,7 +260,7 @@ the mass of the residue in square brackets. For example peptide "DFPIAMGER" with an oxidized methionine. There are multiple ways to specify modifications, and ``AASequence.fromString("DFPIAM(UniMod:35)GER")``, ``AASequence.fromString("DFPIAM[+16]GER")`` and -``AASequence.fromString("DFPIAM[147]GER")`` are all equivalent). +``AASequence.fromString("DFPIAM[147]GER")`` are all equivalent. .. code-block:: python diff --git a/docs/source/user_guide/spectrum_alignment.rst b/docs/source/user_guide/spectrum_alignment.rst index 1969074c4..149e3b835 100644 --- a/docs/source/user_guide/spectrum_alignment.rst +++ b/docs/source/user_guide/spectrum_alignment.rst @@ -61,7 +61,7 @@ which produces .. image:: img/spec_alignment_1.png -Now we want to find matching peaks between observed and theoretical mass spectrum. +Now we want to find matching peaks (in m/z) between the observed and the theoretical spectrum (note: we ignore the peak intensity during the alignment). .. code-block:: python :linenos: @@ -69,7 +69,7 @@ Now we want to find matching peaks between observed and theoretical mass spectru alignment = [] spa = oms.SpectrumAlignment() p = spa.getParameters() - # use 0.5 Da tolerance (Note: for high-resolution data we could also use ppm by setting the is_relative_tolerance value to true) + # use 0.5 Da tolerance for m/z (Note: for high-resolution data we could also use ppm by setting the is_relative_tolerance value to true) p.setValue("tolerance", 0.5) p.setValue("is_relative_tolerance", "false") spa.setParameters(p) From f6bf6a45e67185c37d272438d6bc581c6aaff068 Mon Sep 17 00:00:00 2001 From: Chris Bielow Date: Tue, 12 Nov 2024 15:01:07 +0100 Subject: [PATCH 2/8] sort glossary for easier maintenance --- docs/source/user_guide/glossary.rst | 222 ++++++++++++++-------------- 1 file changed, 112 insertions(+), 110 deletions(-) diff --git a/docs/source/user_guide/glossary.rst b/docs/source/user_guide/glossary.rst index 7d752aee9..e8829b662 100644 --- a/docs/source/user_guide/glossary.rst +++ b/docs/source/user_guide/glossary.rst @@ -6,11 +6,56 @@ A glossary of common terms used throughout OpenMS documentation. .. glossary:: :sorted: - peptide-spectrum match - PSM - A method used in proteomics to identify proteins from a complex mixture. Involves comparing the - mass spectra of peptide fragments generated from a protein sample with a database of predicted - spectra, in order to identify the protein that produced the observed peptides. + C18 + octadecyl + Octadecyl (C18) is an alkyl radical C(18)H(37) derived from an octadecane by removal of one hydrogen atom. + + CID + collision-induced dissociation + Collision-induced dissociation is an MS technique to induce fragmentation of selected ions in the gas phase, which are subjected to a subsequent measurement (see :term:`MS2`). + + consensus features + consensus feature + Features from replicate experiments with similar retention times and m/z values are linked and considered a consensus feature. + A consensus feature contains information on the common retention time and m/z values as well as intensities for each sample. OpenMS represents a consensus feature using the class `ConsensusFeature `_. + + consensus maps + consensus map + A consensus map is a collection of :term:`consensus features` identified from mass spectra across replicate experiments, usually by combining multiple :term:`feature maps`. + One consensus map usually contains many consensus features. OpenMS represents a consensus map using the class `ConsensusMap `_. + + de novo peptide sequencing + A peptide’s amino acid sequence is inferred directly from the precursor peptide mass and tandem + mass spectrum (:term:`MS2` or :term:`MS3`) fragment ions, without comparison to a reference proteome. + + electrospray ionization + ESI + Electrospray ionization (ESI) is a technique used in MS to produce ions. + + FASTA + A text-based format for representing nucleotide or amino acid sequences. + + feature + features + A feature, in the OpenMS terminology, subsumes all m/z signals originating from a single compound at a certain charge state. This includes the isotope pattern and usually spans multiple spectra in retention time (the elution profile). + + feature maps + feature map + A feature map is a collection of :term:`feature`\ s identified from a single experiment. + One feature map usually contains many features. OpenMS represents a feature map using the class `FeatureMap `_. + + high performance liquid chromatography + HPLC + In high performance liquid chromatography (HPLC), analytes are dissolved in a pressurized solvent (mobile phase) + and pumped through a solid adsorbent material (stationary phase) packed into a + capillary column. Physicochemical properties of the analyte determine how strongly it + interacts with the stationary phase. + + iTRAQ + Isobaric tags for relative and absolute quantitation (iTRAQ) is a MS based multiplexing technique designed to identify and quantify proteins from different samples in one single measurement. + + KNIME + An advanced workflow editor which OpenMS provides a plugin for. LC-MS LCMS @@ -24,37 +69,24 @@ A glossary of common terms used throughout OpenMS documentation. LC An analytical technique used to separate molecules of interest. - FASTA - A text-based format for representing nucleotide or amino acid sequences. - - C18 - octadecyl - Octadecyl (C18) is an alkyl radical C(18)H(37) derived from an octadecane by removal of one hydrogen atom. - - ESI - electrospray ionization - Electrospray ionization (ESI) is a technique used in MS to produce ions. - - TOF - time-of-flight - Time-of-flight (TOF) is the time taken by an object, particle or wave (be it acoustic, electromagnetic, e.t.c) to travel a distance through a medium. - TOF analyzers can obtain good, but not ultra-high resolution, such as an :term:`orbitrap`. - - quadrupole - A mass filter allowing one mass channel at a time to reach the detector as the mass range is scanned. A low resolution MS analyzer. - - orbitrap - In MS, an ion trap mass analyzer consisting of an outer barrel-like electrode and a coaxial inner - spindle-like electrode that traps ions in an orbital motion around the spindle. - An ultra-high resolution MS analyzer, capable of resolving fine-isotope structure. + LuciphorAdapter + Adapter for the LuciPHOr2: a site localisation tool of generic post-translational modifications from tandem mass + spectrometry data. More information is available in the `OpenMS API reference documentation `__. Mass Spectrometry MS An analytical technique to measure the mass over charge (m/z) ratio of ions along with their abundance. This gives rise to a mass spectrum (with m/z on the x-axis and abundance on the y-axis). - + mass spectrum A visual or numerical representation of a measurement from an MS instrument. A spectrum contains (usually many) pairs of mass-over-charge(m/z)+intensity values. + MascotAdapter + Used to identify peptides in :term:`MS2` spectra. Read more about MascotAdapter in the `OpenMS API reference documentation `__. + + MSGFPlusAdapter + Adapter for the MS-GF+ protein identification (database search) engine. More information is available in the + `OpenMS API reference documentation `__. + MS1 Mass spectra of a sample where only precursor ions (i.e. no fragment ions) can be observed. Usually MS1 spectra are recorded to select targets for MS2 fragmentation. @@ -65,114 +97,84 @@ A glossary of common terms used throughout OpenMS documentation. Tandem MS is a technique where two or more mass analyzers are coupled together using an additional, usually destructive, reaction step to generate fragment ions which increases their abilities to analyse chemical samples. MS3 - Multi-stage MS - - CID - collision-induced dissociation - Collision-induced dissociation is an MS technique to induce fragmentation of selected ions in the gas phase, which are subjected to a subsequent measurement (see :term:`MS2`). - - TOPP - 'TOPP - The OpenMS PiPeline' is a pipeline for the analysis of HPLC-MS data. It consists of several small applications that can be chained to create analysis pipelines tailored for a specific problem. See :term:`TOPP tools`. + Multi-stage MS. - MSGFPlusAdapter - Adapter for the MS-GF+ protein identification (database search) engine. More information is available in the - `OpenMS API reference documentation `__. - - LuciphorAdapter - Adapter for the LuciPHOr2: a site localisation tool of generic post-translational modifications from tandem mass - spectrometry data. More information is available in the `OpenMS API reference documentation `__. - - TOPP tools - OpenMS provides a number of applications (executable files) that are chainable in a pipeline/script and each process MS data. - These tools are subdivided into different categories, such as 'File Handling' or 'Peptide Identification'. - All :term:`TOPP` tools are described in the `OpenMS API reference documentation `__. - - UTILS - [deprecated since OpenMS 3.1] Besides :term:`TOPP tools`, OpenMS offers a range of other tools. They are not included in :term:`TOPP` as they - are not part of typical analysis pipelines. Since OpenMS 3.1 all UTILS are :term:`TOPP tools` under the 'Utilities' category. - - TOPPView - TOPPView is a viewer for MS and HPLC-MS data and shipped with every OpenMS release. - - nightly snapshot - Untested installers and containers which are created regularly between official releases and reflect the current development state. - - MascotAdapter - Used to identify peptides in :term:`MS2` spectra. Read more about MascotAdapter in the `OpenMS API reference documentation `__. - - high performance liquid chromatography - HPLC - In high performance liquid chromatography (HPLC), analytes are dissolved in a pressurized solvent (mobile phase) - and pumped through a solid adsorbent material (stationary phase) packed into a - capillary column. Physicochemical properties of the analyte determine how strongly it - interacts with the stationary phase. + mzData + mzdata + mzData was the first attempt by the Proteomics Standards Initiative (PSI) from the Human Proteome Organization (HUPO) + to create a standardized format for MS data. This format is now deprecated, and replaced by mzML. mzML mzml The mzML format is an open, XML-based format for mass spectrometer output files, developed by the Proteomics Standard Initiative (PSI) with the full participation of vendors and researchers in order to create a single open format that would be supported by all software. - mzData - mzdata - mzData was the first attempt by the Proteomics Standards Initiative (PSI) from the Human Proteome Organization (HUPO) - to create a standardized format for MS data. This format is now deprecated, and replaced by mzML. - mzXML mzxml mzXML is an open data format for storage and exchange of mass spectroscopy data, developed at the SPC/Institute for Systems Biology. This format is now deprecated, and replaced by mzML. - ProteoWizard - ProteoWizard is a set of open-source, cross-platform tools and libraries for proteomics data analyses. - It provides a framework for unified MS data file access and performs standard chemistry and LCMS dataset computations. + nightly snapshot + Untested installers and containers which are created regularly between official releases and reflect the current development state. + + octadecyl + See :term:`C18`. + + OpenMS API + A C++ interface that allows developers to use OpenMS core library classes and methods. + + orbitrap + In MS, an ion trap mass analyzer consisting of an outer barrel-like electrode and a coaxial inner + spindle-like electrode that traps ions in an orbital motion around the spindle. + An ultra-high resolution MS analyzer, capable of resolving fine-isotope structure. + + peptide-spectrum match + PSM + A method used in proteomics to identify proteins from a complex mixture. Involves comparing the + mass spectra of peptide fragments generated from a protein sample with a database of predicted + spectra, in order to identify the protein that produced the observed peptides. PepNovo PepNovo is a de :term:`de novo peptide sequencing` algorithm for :term:`MS2` spectra. - de novo peptide sequencing - A peptide’s amino acid sequence is inferred directly from the precursor peptide mass and tandem - mass spectrum (:term:`MS2` or :term:`MS3`) fragment ions, without comparison to a reference proteome. - - TOPPAS - An assistant for GUI-driven :term:`TOPP` workflow design, build into OpenMS. See `TOPPAS tutorial ` for details. + ProteoWizard + ProteoWizard is a set of open-source, cross-platform tools and libraries for proteomics data analyses. + It provides a framework for unified MS data file access and performs standard chemistry and LCMS dataset computations. - KNIME - An advanced workflow editor which OpenMS provides a plugin for. + quadrupole + A mass filter allowing one mass channel at a time to reach the detector as the mass range is scanned. A low resolution MS analyzer. SILAC stable isotope labeling with amino acids in cell culture Stands for Stable isotope labeling using amino acids in cell culture. - iTRAQ - Isobaric tags for relative and absolute quantitation (iTRAQ) is a MS based multiplexing technique designed to identify and quantify proteins from different samples in one single measurement. - - TMT - Tandem Mass Tag (TMT) is a MS based multiplexing technique designed to identify and quantify proteins from different samples in one single measurement. - SRM Selected reaction monitoring (SRM) is a MS technique for targeted small molecule analysis. SWATH - Sequential acquisition of all theoretical fragment ion spectra (SWATH) uses parially overlapping MS2 scans with wide isolation windows to capture all fragment ions in a data independent analysis (DIA). + Sequential acquisition of all theoretical fragment ion spectra (SWATH) uses partially overlapping MS2 scans with wide isolation windows to capture all fragment ions in a data independent analysis (DIA). - OpenMS API - A C++ interface that allows developers to use OpenMS core library classes and methods. + tandem mass spectrometry + See :term:`MS2`. - feature - features - A feature, in the OpenMS terminology, subsumes all m/z signals originating from a single compound at a certain charge state. This includes the isotope pattern and usually spans multiple spectra in retention time (the elution profile). - - feature maps - feature map - A feature map is a collection of :term:`feature`\ s identified from a single experiment. - One feature map usually contains many features. OpenMS represents a feature map using the class `FeatureMap `_. + time-of-flight + TOF + Time-of-flight (TOF) is the time taken by an object, particle or wave (be it acoustic, electromagnetic, etc.) to travel a distance through a medium. + TOF analyzers can obtain good, but not ultra-high resolution, such as an :term:`orbitrap`. - consensus features - consensus feature - Features from replicate experiments with similar retention times and m/z values are linked and considered a consensus feature. - A consensus feature contains information on the common retention time and m/z values as well as intensities for each sample. OpenMS represents a consensus feature using the class `ConsensusFeature `_. + TMT + Tandem Mass Tag (TMT) is a MS based multiplexing technique designed to identify and quantify proteins from different samples in one single measurement. - consensus maps - consensus map - A consensus map is a collection of :term:`consensus features` identified from mass spectra across replicate experiments, usually by combining multiple :term:`feature maps`. - One consensus map usually contains many consensus features. OpenMS represents a consensus map using the class `ConsensusMap `_. + TOPP + 'TOPP - The OpenMS PiPeline' is a pipeline for the analysis of HPLC-MS data. It consists of several small applications that can be chained to create analysis pipelines tailored for a specific problem. See :term:`TOPP tools`. + + TOPPAS + An assistant for GUI-driven :term:`TOPP` workflow design, build into OpenMS. See `TOPPAS tutorial ` for details. + + TOPP tools + OpenMS provides a number of applications (executable files) that are chainable in a pipeline/script and each process MS data. + These tools are subdivided into different categories, such as 'File Handling' or 'Peptide Identification'. + All :term:`TOPP` tools are described in the `OpenMS API reference documentation `__. + + TOPPView + TOPPView is a viewer for \ No newline at end of file From cab3001a39aa1c7ee731d56b1cef60e46b1ac480 Mon Sep 17 00:00:00 2001 From: Chris Bielow Date: Tue, 3 Dec 2024 09:45:41 +0100 Subject: [PATCH 3/8] update dev build instructions for pyOpenMS --- docs/source/community/build_from_source.rst | 74 ++------------------- 1 file changed, 6 insertions(+), 68 deletions(-) diff --git a/docs/source/community/build_from_source.rst b/docs/source/community/build_from_source.rst index 0da77310c..4b02264ed 100644 --- a/docs/source/community/build_from_source.rst +++ b/docs/source/community/build_from_source.rst @@ -3,76 +3,14 @@ Build from Source To install pyOpenMS from :index:`source`, you will first have to compile OpenMS successfully on your platform of choice (note that for MS Windows you will need -to match your compiler and Python version). Please follow the `official -documentation -`_ -in order to compile OpenMS for your platform. Next you will need to install the -following software packages +to match your compiler and Python version). Please follow the +`official OpenMS documentation `_ +in order to compile OpenMS for your platform. -On Microsoft Windows: you need the 64 bit C++ compiler from Visual Studio 2015 -to compile the newest pyOpenMS for Python 3.5, 3.6 or 3.7. This is important, -else you get a clib that is different than the one used for building the Python -executable, and pyOpenMS will crash on import. The OpenMS wiki has `detailed information -`_ -on building pyOpenMS on Windows. - -You can install all necessary Python packages on which pyOpenMS -depends through - -.. code-block:: bash - - pip install -U setuptools - pip install -U pip - pip install -U autowrap - pip install -U pytest - pip install -U numpy - pip install -U wheel - -Depending on your systems setup, it may make sense to do this inside a virtual environment - -.. code-block:: bash - - virtualenv pyopenms_venv - source pyopenms_venv/bin/activate - -Next, we will configure the CMake-based OpenMS build system -to enable the pyOpenMS target with the configuration option ``-DPYOPENMS=ON``. -If your are using virtualenv or a specific Python version, -add ``-DPYTHON_EXECUTABLE:FILEPATH=/path/to/python`` to ensure -that the correct Python executable is used. Compiling pyOpenMS can use a lot of -memory and take some time, however you can reduce the memory consumption by -breaking up the compilation into multiple units and compiling in parallel, for -example ``-DPY_NUM_THREADS=2 -DPY_NUM_MODULES=4`` will build 4 modules with 2 -threads. You can now configure pyOpenMS (inside your build folder) with: - -.. code-block:: bash - - cmake -DPYOPENMS=ON - - -Remember, that you can pass the other options as described above to the first -command by adding ``-DOPTION=VALUE`` statements if you need them. - -Now build pyOpenMS (now there should be pyOpenMS specific build targets). -If you are still inside your build folder, you can use "." as the build -folder parameter. - -.. code-block:: bash - - cmake --build $YOURBUILDFOLDER --target pyopenms --config Release - - -Afterwards, test that all went well by running the tests: - -.. code-block:: bash - - ctest -R pyopenms - -Which should execute all the tests and return with all tests passing. +See https://github.com/OpenMS/OpenMS/tree/develop/src/pyOpenMS for installation instructions. Further Questions ***************** -In case the above instructions did not work, please refer to the `Wiki Page -`_, contact the development -team on github or send an email to the OpenMS mailing list. +In case the above instructions did not work, please contact the development +team on GitHub (https://github.com/OpenMS/OpenMS/issues) or send an email to the OpenMS mailing list. From 25c621af7529e0a994ccc55652a42fbc6808a703 Mon Sep 17 00:00:00 2001 From: Chris Bielow Date: Tue, 3 Dec 2024 09:46:58 +0100 Subject: [PATCH 4/8] glossary fixes: improved descriptions --- docs/source/user_guide/glossary.rst | 66 ++++++++++++++++++----------- 1 file changed, 42 insertions(+), 24 deletions(-) diff --git a/docs/source/user_guide/glossary.rst b/docs/source/user_guide/glossary.rst index e8829b662..2760dc1c1 100644 --- a/docs/source/user_guide/glossary.rst +++ b/docs/source/user_guide/glossary.rst @@ -7,25 +7,29 @@ A glossary of common terms used throughout OpenMS documentation. :sorted: C18 - octadecyl Octadecyl (C18) is an alkyl radical C(18)H(37) derived from an octadecane by removal of one hydrogen atom. CID collision-induced dissociation - Collision-induced dissociation is an MS technique to induce fragmentation of selected ions in the gas phase, which are subjected to a subsequent measurement (see :term:`MS2`). + Collision-induced dissociation is an MS technique to induce fragmentation of selected ions in the gas phase, + which are subjected to a subsequent measurement (see :term:`MS2`). consensus features consensus feature Features from replicate experiments with similar retention times and m/z values are linked and considered a consensus feature. - A consensus feature contains information on the common retention time and m/z values as well as intensities for each sample. OpenMS represents a consensus feature using the class `ConsensusFeature `_. + A consensus feature contains information on the common retention time and m/z values as well as intensities for each sample. + OpenMS represents a consensus feature using the class `ConsensusFeature + `_. consensus maps consensus map - A consensus map is a collection of :term:`consensus features` identified from mass spectra across replicate experiments, usually by combining multiple :term:`feature maps`. - One consensus map usually contains many consensus features. OpenMS represents a consensus map using the class `ConsensusMap `_. + A consensus map is a collection of :term:`consensus features` identified from mass spectra across replicate experiments, + usually by combining multiple :term:`feature maps`. + One consensus map usually contains many consensus features. OpenMS represents a consensus map using + the class `ConsensusMap `_. de novo peptide sequencing - A peptide’s amino acid sequence is inferred directly from the precursor peptide mass and tandem + A peptide's amino acid sequence is inferred directly from the precursor peptide mass and tandem mass spectrum (:term:`MS2` or :term:`MS3`) fragment ions, without comparison to a reference proteome. electrospray ionization @@ -33,16 +37,18 @@ A glossary of common terms used throughout OpenMS documentation. Electrospray ionization (ESI) is a technique used in MS to produce ions. FASTA - A text-based format for representing nucleotide or amino acid sequences. + A text-based file format for representing nucleotide or amino acid sequences. feature features - A feature, in the OpenMS terminology, subsumes all m/z signals originating from a single compound at a certain charge state. This includes the isotope pattern and usually spans multiple spectra in retention time (the elution profile). + A feature, in the OpenMS terminology, subsumes all m/z signals originating from a single compound at a + certain charge state. This includes the isotope pattern and usually spans multiple spectra in retention time (the elution profile). feature maps feature map A feature map is a collection of :term:`feature`\ s identified from a single experiment. - One feature map usually contains many features. OpenMS represents a feature map using the class `FeatureMap `_. + One feature map usually contains many features. OpenMS represents a feature map using the class + `FeatureMap `_. high performance liquid chromatography HPLC @@ -52,7 +58,8 @@ A glossary of common terms used throughout OpenMS documentation. interacts with the stationary phase. iTRAQ - Isobaric tags for relative and absolute quantitation (iTRAQ) is a MS based multiplexing technique designed to identify and quantify proteins from different samples in one single measurement. + Isobaric tags for relative and absolute quantitation (iTRAQ) is a MS based multiplexing technique designed to + identify and quantify proteins from different samples in one single measurement. KNIME An advanced workflow editor which OpenMS provides a plugin for. @@ -71,17 +78,21 @@ A glossary of common terms used throughout OpenMS documentation. LuciphorAdapter Adapter for the LuciPHOr2: a site localisation tool of generic post-translational modifications from tandem mass - spectrometry data. More information is available in the `OpenMS API reference documentation `__. + spectrometry data. More information is available in the `OpenMS API reference documentation + `__. Mass Spectrometry MS - An analytical technique to measure the mass over charge (m/z) ratio of ions along with their abundance. This gives rise to a mass spectrum (with m/z on the x-axis and abundance on the y-axis). + An analytical technique to measure the mass over charge (m/z) ratio of ions along with their abundance. + This gives rise to a mass spectrum (with m/z on the x-axis and abundance on the y-axis). mass spectrum - A visual or numerical representation of a measurement from an MS instrument. A spectrum contains (usually many) pairs of mass-over-charge(m/z)+intensity values. + A visual or numerical representation of a measurement from an MS instrument. + A spectrum contains (usually many) pairs of mass-over-charge(m/z)+intensity values. MascotAdapter - Used to identify peptides in :term:`MS2` spectra. Read more about MascotAdapter in the `OpenMS API reference documentation `__. + Used to identify peptides in :term:`MS2` spectra. Read more about this adapter in the `OpenMS API reference documentation + `__. MSGFPlusAdapter Adapter for the MS-GF+ protein identification (database search) engine. More information is available in the @@ -93,8 +104,8 @@ A glossary of common terms used throughout OpenMS documentation. MS2 MS/MS - tandem mass spectrometry - Tandem MS is a technique where two or more mass analyzers are coupled together using an additional, usually destructive, reaction step to generate fragment ions which increases their abilities to analyse chemical samples. + Tandem MS is a technique where two or more mass analyzers are coupled together using an additional, usually destructive, + reaction step to generate fragment ions which increases their abilities to analyse chemical samples. MS3 Multi-stage MS. @@ -142,7 +153,8 @@ A glossary of common terms used throughout OpenMS documentation. It provides a framework for unified MS data file access and performs standard chemistry and LCMS dataset computations. quadrupole - A mass filter allowing one mass channel at a time to reach the detector as the mass range is scanned. A low resolution MS analyzer. + A low resolution MS analyzer. + A mass filter allowing one mass channel at a time to reach the detector as the mass range is scanned. SILAC stable isotope labeling with amino acids in cell culture @@ -152,29 +164,35 @@ A glossary of common terms used throughout OpenMS documentation. Selected reaction monitoring (SRM) is a MS technique for targeted small molecule analysis. SWATH - Sequential acquisition of all theoretical fragment ion spectra (SWATH) uses partially overlapping MS2 scans with wide isolation windows to capture all fragment ions in a data independent analysis (DIA). + Sequential acquisition of all theoretical fragment ion spectra (SWATH) uses partially overlapping MS2 + scans with wide isolation windows to capture all fragment ions in a data independent analysis (DIA). tandem mass spectrometry See :term:`MS2`. time-of-flight TOF - Time-of-flight (TOF) is the time taken by an object, particle or wave (be it acoustic, electromagnetic, etc.) to travel a distance through a medium. + Time-of-flight (TOF) is the time taken by an object, particle or wave (be it acoustic, electromagnetic, etc.) + to travel a distance through a medium. TOF analyzers can obtain good, but not ultra-high resolution, such as an :term:`orbitrap`. TMT - Tandem Mass Tag (TMT) is a MS based multiplexing technique designed to identify and quantify proteins from different samples in one single measurement. + Tandem Mass Tag (TMT) is a MS based multiplexing technique designed to identify and + quantify proteins from different samples in one single measurement. TOPP - 'TOPP - The OpenMS PiPeline' is a pipeline for the analysis of HPLC-MS data. It consists of several small applications that can be chained to create analysis pipelines tailored for a specific problem. See :term:`TOPP tools`. + 'TOPP - The OpenMS PiPeline' is a pipeline for the analysis of HPLC-MS data. It consists of several small + applications that can be chained to create analysis pipelines tailored for a specific problem. See :term:`TOPP tools`. TOPPAS - An assistant for GUI-driven :term:`TOPP` workflow design, build into OpenMS. See `TOPPAS tutorial ` for details. + An assistant for GUI-driven :term:`TOPP` workflow design, build into OpenMS. + See `TOPPAS tutorial ` for details. TOPP tools OpenMS provides a number of applications (executable files) that are chainable in a pipeline/script and each process MS data. These tools are subdivided into different categories, such as 'File Handling' or 'Peptide Identification'. - All :term:`TOPP` tools are described in the `OpenMS API reference documentation `__. + All :term:`TOPP` tools are described in the `OpenMS API reference documentation + `__. TOPPView - TOPPView is a viewer for \ No newline at end of file + TOPPView is a viewer for MS and HPLC-MS data and shipped with every OpenMS release. \ No newline at end of file From ad10bf7a7cb29c4cabbbabad297d8c075d0bc8d3 Mon Sep 17 00:00:00 2001 From: Chris Bielow Date: Tue, 3 Dec 2024 09:48:59 +0100 Subject: [PATCH 5/8] improve QuantData tutorial --- docs/source/user_guide/quantitative_data.rst | 34 ++++++++++---------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/docs/source/user_guide/quantitative_data.rst b/docs/source/user_guide/quantitative_data.rst index 05c6ec273..a8fcee2f6 100644 --- a/docs/source/user_guide/quantitative_data.rst +++ b/docs/source/user_guide/quantitative_data.rst @@ -6,7 +6,7 @@ features In OpenMS, information about quantitative data is stored in a so-called :py:class:`~.Feature`. Each -:py:class:`~.Feature` represents a region in RT and m/z space use for quantitative +:py:class:`~.Feature` represents a region in RT and m/z for quantitative analysis. .. code-block:: python @@ -27,7 +27,7 @@ analysis. masstrace.push_back(p) Usually, the quantitative features would be produced by a so-called -:py:class:`~.FeatureFinder` algorithm, which we will discuss in the next chapter. The +:py:class:`~.FeatureFinder` algorithm, which we will discuss in the `next chapter `_. The features can be stored in a :py:class:`~.FeatureMap` and written to disk. .. code-block:: python @@ -78,13 +78,13 @@ represented by a :py:class:`~.ConsensusFeature` .. code-block:: python :linenos: - feature = oms.ConsensusFeature() - feature.setMZ(500.9) - feature.setCharge(2) - feature.setRT(1500.1) - feature.setIntensity(80500) + cf = oms.ConsensusFeature() + cf.setMZ(500.9) + cf.setCharge(2) + cf.setRT(1500.1) + cf.setIntensity(80500) - # Generate ConsensusFeature and features from two maps (with id 1 and 2) + # Generate ConsensusFeature from features of two maps (with id 1 and 2) ### Feature 1 f_m1 = oms.ConsensusFeature() f_m1.setRT(500) @@ -97,8 +97,8 @@ represented by a :py:class:`~.ConsensusFeature` f_m2.setMZ(299.99) f_m2.setIntensity(600) f_m2.ensureUniqueId() - feature.insert(1, f_m1) - feature.insert(2, f_m2) + cf.insert(1, f_m1) + cf.insert(2, f_m2) We have thus added two features from two individual maps (which have the unique identifier ``1`` and ``2``) to the :py:class:`~.ConsensusFeature`. @@ -110,12 +110,12 @@ the two maps and output the two linked features: # The two features in map 1 and map 2 represent the same analyte at # slightly different RT and m/z - for fh in feature.getFeatureList(): + for fh in cf.getFeatureList(): print(fh.getMapIndex(), fh.getIntensity(), fh.getRT()) - print(feature.getMZ()) - feature.computeMonoisotopicConsensus() - print(feature.getMZ()) + print(cf.getMZ()) + cf.computeMonoisotopicConsensus() + print(cf.getMZ()) # Generate ConsensusMap and add two maps (with id 1 and 2) cmap = oms.ConsensusMap() @@ -124,14 +124,14 @@ the two maps and output the two linked features: fds[2].filename = "file2" cmap.setColumnHeaders(fds) - feature.ensureUniqueId() - cmap.push_back(feature) + cf.ensureUniqueId() + cmap.push_back(cf) oms.ConsensusXMLFile().store("test.consensusXML", cmap) Inspection of the generated ``test.consensusXML`` reveals that it contains references to two :term:`LC-MS/MS` runs (``file1`` and ``file2``) with their respective unique identifier. Note how the two features we added before have matching -unique identifiers. +unique identifiers. Visualization of the resulting output file reveals a single :py:class:`~.ConsensusFeature` of size 2 that links to the two individual features at From 7d470165fc7998246a58310edb4257b6b2ba42ff Mon Sep 17 00:00:00 2001 From: Chris Bielow Date: Wed, 4 Dec 2024 16:11:03 +0100 Subject: [PATCH 6/8] Update docs/source/user_guide/oligonucleotides_rna.rst --- docs/source/user_guide/oligonucleotides_rna.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/user_guide/oligonucleotides_rna.rst b/docs/source/user_guide/oligonucleotides_rna.rst index db47f4785..73b03f2a0 100644 --- a/docs/source/user_guide/oligonucleotides_rna.rst +++ b/docs/source/user_guide/oligonucleotides_rna.rst @@ -174,7 +174,7 @@ DNA, RNA and Protein We can also work with DNA and RNA sequences in combination with the BioPython library (you can install BioPython with ``pip install biopython``): -.. code-block:: python +.. code-block:: pseudocode :linenos: from Bio.Seq import Seq From 915959251daf4310a6b57573a0bfcf3bbadc48e0 Mon Sep 17 00:00:00 2001 From: Chris Bielow Date: Thu, 5 Dec 2024 13:58:21 +0100 Subject: [PATCH 7/8] fix broken script --- docs/source/user_guide/ms_data.rst | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/user_guide/ms_data.rst b/docs/source/user_guide/ms_data.rst index 0e772f1ea..01cc9feb6 100644 --- a/docs/source/user_guide/ms_data.rst +++ b/docs/source/user_guide/ms_data.rst @@ -503,7 +503,7 @@ This can be useful for a brief visual inspection of your sample in quality contr bilip = oms.BilinearInterpolation() tmp = bilip.getData() - tmp.resize(int(rows), int(cols), float()) + tmp.resize(int(rows), int(cols)) bilip.setData(tmp) bilip.setMapping_0(0.0, exp.getMinRT(), rows - 1, exp.getMaxRT()) bilip.setMapping_1(0.0, exp.getMinMZ(), cols - 1, exp.getMaxMZ()) From bff9e19b100dfff4e73368133924b708e19b36e6 Mon Sep 17 00:00:00 2001 From: Chris Bielow Date: Thu, 5 Dec 2024 14:47:20 +0100 Subject: [PATCH 8/8] more fixes --- docs/source/user_guide/ms_data.rst | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/docs/source/user_guide/ms_data.rst b/docs/source/user_guide/ms_data.rst index 01cc9feb6..6cae38d26 100644 --- a/docs/source/user_guide/ms_data.rst +++ b/docs/source/user_guide/ms_data.rst @@ -592,11 +592,9 @@ Here, we can assess the purity of the precursor to filter spectra with a score b print("\nPurity scores") print("total:", purity_score.total_intensity) # 9098343.890625 print("target:", purity_score.target_intensity) # 7057944.0 - print( - "signal proportion:", purity_score.signal_proportion - ) # 0.7757394186070014 + print("signal proportion:", purity_score.signal_proportion) # 0.7757394186070014 print("target peak count:", purity_score.target_peak_count) # 1 - print("residual peak count:", purity_score.residual_peak_count) # 4 + print("interfering peak count:", purity_score.interfering_peak_count) # 4 .. code-block:: output @@ -614,7 +612,7 @@ Here, we can assess the purity of the precursor to filter spectra with a score b target: 7057944.0 signal proportion: 0.7757394186070014 target peak count: 1 - residual peak count: 4 + interfering peak count: 4 We could assess that we have four other non-isotopic peaks apart from our precursor and its isotope peaks within our precursor isolation window. The signal of the isotopic peaks correspond to roughly 78% of all intensities in the precursor isolation window.