Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Import CSV, MS-DIAL, MZmine #85

Open
misch91 opened this issue Jan 31, 2022 · 7 comments
Open

Import CSV, MS-DIAL, MZmine #85

misch91 opened this issue Jan 31, 2022 · 7 comments
Labels
documentation Issues in the docstrings or documentation enhancement New feature or request

Comments

@misch91
Copy link
Contributor

misch91 commented Jan 31, 2022

Hey there,
Thx for this wonderful tool.

I read in the code that there is an option for direct .csv import for MS datasets: def _loadCSVImport(self, path, noFeatureParams=1, variableType='Discrete'):
Unfortunately, the documentation does not provide an example of how this csv should look like.
Sharing an example feature table with minimum requirements would be very helpful as I am trying to import MS-DIAL and MZmine (LC-MS) peak tables and assess their peak picking vs QI datasets.
Also, is the implementation of direc timput from MS-DIAL or MZmine datasets planned?

I assume the metadata can then nevertheless be imported as usual via dataset.addSampleInfo() ?

@Gscorreia89 Gscorreia89 added the documentation Issues in the docstrings or documentation label Feb 1, 2022
@Gscorreia89
Copy link
Member

Hi,

Thank you for your interest in the nPYc-Toolbox. For the CSV import, the first column must be designated as 'Sample File Name', and contain the sample IDs. All other columns need to have the Feature Name as column name, and the intensities corresponding to each sample in each row. The noFeatureParams=1 defines how many samples to skip.

I think this should work to give a try with MS-DIAL and MZmine outputs. This method is not very well documented as its meant for internal use and testing. If you could provide us with some examples of MS-DIAL and MZmine outputs I could look into adding explicit import methods.

@misch91
Copy link
Contributor Author

misch91 commented Feb 2, 2022

About CSV-Import: Do the feature names need to follow the given name structure like "5.05_536.9785m/z" then ?

I found another thing somewhat strange inside the XCMS-Import:
self.featureMetadata['Retention Time'] = self.featureMetadata['Retention Time'].astype(float) / 60.0
Why is the retention time value divided by 60? This is not applied to QI Datasets or CSV Imports. It results in strange looking ion maps and also peak width plot cannot be shown in the feature summary report. I tested it only with XCMSonline processed data so far (that worked by the way after backnaming columns "rtmed" into "rt" and "mzmed" into "mz" before import as "xcms"). However, the output data looks quite weird, see attachment.

Thank you for your explanations.
I will test and provide MS-DIAL and MZmine peak tables as soon as possible
xcms_online_test_combinedData.csv
.

@Gscorreia89
Copy link
Member

Hi,

The Feature Names can be anything, as long as they are unique.

The current XCMS import functionality was developed for standard XCMS in mind, not XCMS online. In the standard XCMS outputs the rt is in seconds, hence the division by 60. If you have an example of XCMSOnline outputs that you are happy to share alongside the MS-DIAL and MZmine examples that would be great, so we can look into adding an explicit import option for XCMSOnline as well. Also, are you comfortable programming in Python and git to work collaboratively on this if I open a new feature branch? That would make it easier for us to test together that whatever we implement runs correctly for all your use cases.

@misch91
Copy link
Contributor Author

misch91 commented Feb 3, 2022

Hi again,
thanks for your explanation. I already assumed it was a sec to min conversion but wondered why it did not take place for rtmin and rtmax.

I'd be happy to help you implementing import of XCMSonline/MZmine/MS-DIAL data since all of them are widely used in the metabolomics community. That should be simple coding and I have access to some example datasets of my institute.

Another thing I can offer is three additional feature filters for LC-MS datasets: Signal to noise filter, Peak width filter and Detection rate filter. All are recommended by current best practices papers (Broadhurst et al., Dunn et al, and likewise) with specific threshold values. Until now I used a selfmade code that read nPYc-exported csv to apply these filters but with a little work they could be added to the Attributes section and SOP.
I'll get in touch with you via your institutional mail address.

@Gscorreia89 Gscorreia89 added the enhancement New feature or request label Feb 4, 2022
@Gscorreia89
Copy link
Member

Hi,

I have created a new feature branch to tackle this: https://github.com/phenomecentre/nPYc-Toolbox/tree/feature/msImportUpdates

@Gscorreia89
Copy link
Member

Hi,

@misch91 Thank you very much for the commit. I accidentaly approved the pull request without noticing it was already a pull request into develop :S.

I am reproducing your message in the PR text below for reference:

Example datasets can be found here:
https://drive.google.com/drive/folders/1ITl4pvgvDmTZOv0fytplMd4nXMVtAqqX?usp=sharing

XCMSonline
-By default, XCMSonline provides two result tables (.xlsx) as output. One is annotated, the other is not. The annotated one contains extra columns at the end of the table which are named “isotopes”, “adduct” and “pcgroup”. To enable import function dealing with both, I circumvent this issue by adjusting endIndex and adduct metadata with if-clause.
-Sadly, the generated xlsx files of untargeted datasets can be (very) large in file size. From my experience I usually get sth. between 30-60 MB per file. This can be even larger if peak picking settings are changed to a more sensitive level. This however causes problems with pandas.read_excel() function as it cannot read xlsx files chunkwise and ends in memory problems. With my datasets it even failed at all, returning an empty dataframe. I tried all 4 different engines but without success. Browsing through stackoverflow revealed this is indeed a general issue and the easiest way out is converting the original file to csv format. As a consequence, nPYc users must do this xlsx -> csv conversion manually beforehand (I incorporated advices inside function info but this definitely needs to be addressed in the documentation & tutorial). I know it seems a little drawback but I hope I can expect future users to be literate enough to do so.
-Another XCMSonline issue occurs with the metadata file. Apparently, XCMSonline changes all Sample File Names to lower case throughout the process. This took me a while to realize after my self-made metadata csv file did not match using the addSampleInfo() function (it seems to be case-sensitive). Two options appear possible here imo: a) Documentation/tutorial should point this out or b) generalize it for all data types so that addSampleInfo() automatically applies to_lower_case() function. What’s your opinion on this?

MS-DIAL
-First of all the software provides many different options to choose from when exporting data (similar to Progenesis QI), although “Raw peak area” option is the one we should aim for (Raw peak height might also be interesting for some folks). The output txt file then comes with a lot of information (see example files), ranging from statistical parameters over identifications to MS/MS peak info – good thing for us is that these columns are consistent even if MS/MS data was not even acquired. I applied Occam’s razor and picked only RT, m/z and Area (or height) values.
-Interestingly, file type (nPYc equivalent: AssayRole/Sample Type), Injection Order (Run Order) and Batch ID (Correction Batch) are also provided if provided by the user beforehand. I tried to implement an extraction of these metadata as well but I have honestly no idea how to check if the code block works (please have a try). For some folks, this may already be enough, so I stated “self.sampleMetadata['Metadata Available'] = True” already, however others may need to provide more metadata for, let’s say, dilution etc. I expect that the remaining metadata can later on be added with the usual addSampleInfo() function. Are yet existing metadata being overwritten then?

MZmine
-MZmine output looks a bit confusing at first sight (see example file) but can be easily filtered to what we need. Also, peak width and RTmin/max and MZmin/max are provided but for each sample. I therefore had them calculated by mean (as gap filling was applied there was a value in all cases, I do not know how it goes without gap filling. I assume MZmine outputs “null” for this and had this checked manually. Also, values are imported as float directly).
-Btw: How does XCMS calculate RTmin and RTmax, is it the true minimum/maximum RT ever measured in any sample or a mean/median value of all RT min/max values?

nPYc reimport
-I tried to avoid hardcoding here as much as possible as export csv style strongly depends on the software that was used for preprocessing the dataset in the first place. The featureMetadata block thus contain many if’s, about the sampleMetadata block I am not 100% sure if it’s useful for the other modules. Please have a look.

@Gscorreia89
Copy link
Member

Thank you for the input files - I will push them to our unittest data git repo (https://github.com/phenomecentre/npc-standard-project) and prepare other necessary files to have a working example for testing.

I will start with the MZMine and MS-DIAL imports as they are easier to get to work as they come out of the software. For XCMS Online I will see what can be done to minimise the need for the user to modify the output files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Issues in the docstrings or documentation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants