forked from sneumann/xcms
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathnew_functionality.Rmd
992 lines (775 loc) · 45.2 KB
/
new_functionality.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
---
title: "New and modified functionality in xcms"
author: "Johannes Rainer"
graphics: yes
package: xcms
output:
BiocStyle::html_document2:
toc_float: true
vignette: >
%\VignetteIndexEntry{New and modified functionality in xcms}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
%\VignetteDepends{xcms,RColorBrewer}
bibliography: references.bib
csl: biomed-central.csl
references:
- id: dummy
title: no title
author:
- family: noname
given: noname
---
# New functionality in `xcms`
This document describes new functionality and changes to existing functionality
in the `xcms` package introduced during the update to version *3*.
```{r message = FALSE, warning = FALSE }
library(xcms)
library(RColorBrewer)
## Use socket based parallel processing on Windows systems
if (.Platform$OS.type == "unix") {
register(bpstart(MulticoreParam(2)))
} else {
register(bpstart(SnowParam(2)))
}
```
## Modernized user interface
The modernization of the user interface comprises new classes for data
representation and new data analysis methods. In addition, the core logic for
the data processing has been extracted from the old methods and put into a set
of R functions, the so called core API functions (or `do_` functions). These
functions take standard R data structures as input and return standard R data
types as result and can hence be easily included in other R packages.
The new user interface aims at simplifying and streamlining the `xcms` workflow
while guaranteeing data integrity and performance also for large scale
metabolomics experiments. Importantly, a simplified access to the original raw
data should be provided throughout the whole metabolomics data analysis workflow.
The new interface re-uses objects from the `MSnbase` Bioconductor package, such as
the `OnDiskMSnExp` object. This object is specifically designed for large scale MS
experiments as it initially reads just the scan header information from the mzML
while the mz-intensity value pairs from all or from selected spectra of a file
are read on demand hence minimizing the memory demand. Also, in contrast to
the old `xcmsRaw` object, the `OnDiskMSnExp` contains information from all files of
an experiment. In addition, all data normalization and adjustment methods
implemented in the `MSnbase` package can be directly applied to the MS data
without the need to re-implement such methods in `xcms`. Results from `xcms`
preprocessings, such as chromatographic peak detection or correspondence are
stored into the new `XCMSnExp` object. This object extends the `OnDiskMSnExp` object
and inherits thus all of its methods including raw data access.
Class and method/function names follow also a new naming convention trying tp
avoid the partially confusing nomenclature of the original `xcms` methods (such as
the `group` method to perform the correspondence of peaks across samples). To
distinguish them from mass peaks, the peaks identified by the peak detection in
an LS/GC-MS experiment are referred to as *chromatographic peaks*. The respective
method to identify such peaks is hence called `findChromPeaks` and the identified
peaks can be accessed using the `XCMSnExp` `chromPeaks` method. The results from an
correspondence analysis which aims to match and group chromatographic peaks
within and between samples are called *features*. A feature corresponds to
individual ions with a unique mass-to-charge ratio (mz) and a unique retention
time (rt). The definition of such mz-rt features (i.e. the result from the
`groupChromPeaks` method) can be accessed *via* the `featureDefinitions` method of
the `XCMSnExp` class. Finally, alignment (retention time correction) can be
performed using the `adjustRtime` method.
The settings for any of the new analysis methods are bundled in *parameter*
classes, one class for each method. This encapsulation of the parameters to a
function into a parameter class (such as `CentWaveParam`) avoids busy function
calls (with many single parameters) and enables saving, reloading and reusing
the settings. In addition, the parameter classes are added, along with other
information to the process history of an `XCMSnExp` object thus providing a
detailed documentation of each processing step of an analysis, with the
possibility to recall all settings of the performed analyses at any stage. In
addition, validation of the parameters can be performed within the parameter
object and hence is no longer required in the analysis function.
The example below illustrates the new user interface. First we load the raw data
files from the `faahKO` package using the `readMSData` from the `MSnbase` package.
```{r message = FALSE, warning = FALSE }
## Reading the raw data using the MSnbase package
library(xcms)
## Load 6 of the CDF files from the faahKO
cdf_files <- dir(system.file("cdf", package = "faahKO"), recursive = TRUE,
full.names = TRUE)[c(1:3, 7:9)]
## Define the sample grouping.
s_groups <- rep("KO", length(cdf_files))
s_groups[grep(cdf_files, pattern = "WT")] <- "WT"
## Define a data.frame that will be used as phenodata
pheno <- data.frame(sample_name = sub(basename(cdf_files), pattern = ".CDF",
replacement = "", fixed = TRUE),
sample_group = s_groups, stringsAsFactors = FALSE)
## Read the data.
raw_data <- readMSData(cdf_files, pdata = new("NAnnotatedDataFrame", pheno),
mode = "onDisk")
```
The `OnDiskMSnExp` organizes the MS data by spectrum and provides th methods
`intensity`, `mz` and `rtime` to access the raw data from the files (the measured
intensity values, the corresponding m/z and retention time values). In addition,
the `spectra` method could be used to return all data encapsulated in `Spectrum`
classes. Below we extract the retention time values from the object.
```{r message = FALSE }
head(rtime(raw_data))
class(rtime(raw_data))
```
All data is returned as one-dimensional vectors (a numeric vector for `rtime` and
a `list` of numeric vectors for `mz` and `intensity`, each containing the values from
one spectrum), even if the experiment consists of multiple files/samples. The
`fromFile` returns a numeric vector that provides the mapping of the values to the
originating file. Below we use the `fromFile` indices to organize the `mz` values by
file.
```{r message = FALSE }
mzs <- mz(raw_data)
class(mzs)
length(mzs)
## Split the list by file
mzs_by_file <- split(mzs, f = fromFile(raw_data))
length(mzs_by_file)
```
We next plot the total ion chromatogram (TIC) for all files within the
experiment. Note that we are iteratively sub-setting the full data per file
using the `filterFile` method, which, for `OnDiskMSnExp` objects, is an efficient
way to subset the data while ensuring that all data, including metadata, stays
consistent.
```{r faahKO-tic, message = FALSE, fig.align = 'center', fig.width = 8, fig.height = 4 }
library(RColorBrewer)
sample_colors <- brewer.pal(3, "Set1")[1:2]
names(sample_colors) <- c("KO", "WT")
## Subset the full raw data by file and plot the data.
tmp <- filterFile(raw_data, file = 1)
plot(x = rtime(tmp), y = tic(tmp), xlab = "retention time", ylab = "TIC",
col = paste0(sample_colors[tmp$sample_group], 80), type = "l")
for (i in 2:length(fileNames(raw_data))) {
tmp <- filterFile(raw_data, file = i)
points(rtime(tmp), tic(tmp), type = "l",
col = paste0(sample_colors[tmp$sample_group], 80))
}
legend("topleft", col = sample_colors, legend = names(sample_colors), lty = 1)
```
Alternatively we can use the `chromatogram` method that extracts
chromatograms from the object. In the example below we extract the *base peak
chromatogram* (BPC) by setting `aggregationFun` to `"max"` and not specifying an `rt`
or `mz` range hence extracting the full data. In contrast to the `tic` and `bpi`
methods, this function reads the data from the raw files. It takes thus more
time to create the plot, but it is based on the actual raw data that is used for
the later analysis - the `tic` and `bpi` methods access only the information that is
stored in the raw data files by the MS detector during the data acquisition.
The `chromatogram` method returns a `Chromatograms` object which is an extension to
the base `matrix` class allowing to arrange multiple `Chromatogram` objects in a
two-dimensional grid. Columns in the `Chromatograms` object represent samples,
rows m/z x rt ranges.
```{r faahKO-bpi, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 4 }
## Get the base peak chromatograms. This reads data from the files.
bpis <- chromatogram(raw_data, aggregationFun = "max")
## Plot all chromatograms.
plot(bpis, col = paste0(sample_colors[raw_data$sample_group], 80))
```
While the `plot` method if very convenient (and fast), it would also not be too
difficult to create the plot manually:
```{r faahKO-bbpi-manual, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 4 }
plot(3, 3, pch = NA, xlim = range(unlist(lapply(bpis[1, ], rtime))),
ylim = range(unlist(lapply(bpis[1, ], intensity))), main = "BPC",
xlab = "rtime", ylab = "intensity")
for (i in 1:ncol(bpis)) {
points(rtime(bpis[1, i]), intensity(bpis[1, i]), type = "l",
col = paste0(sample_colors[raw_data$sample_group[i]], 80))
}
```
Note that we could restrict the analysis to a certain retention time range by
first sub-setting `raw_data` with the `filterRt` method.
In addition we can plot the distribution of the total ion counts per file. In
contrast to sub-setting the object we split the numeric vector returned by the
`tic` by file using the `fromFile` method that provides the mapping of the
experiment's spectra to the originating files.
```{r faahKO-tic-boxplot, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 4 }
## Get the total ion current by file
tc <- split(tic(raw_data), f = fromFile(raw_data))
boxplot(tc, col = paste0(sample_colors[raw_data$sample_group], 80),
ylab = "intensity", main = "Total ion current")
```
The `tic` (and for mzML files) the `bpi` methods are very fast, even for large data
sets, as these information are stored in the header of the raw files avoiding
the need to read the raw data from each file. Also, we could subset the whole
object using the filter functions `filterFile`, `filterRt` or `filterMz` to
e.g. remove problematic samples or restrict the retention time range in which we
want to perform the chromatographic peak detection.
Next we perform the chromatographic peak detection using the *centWave* algorithm
[@Tautenhahn:2008fx]. In the example below we use most of the standard
parameters, but the settings should be adjusted to each experiment individually
based on e.g. the expected width of the chromatographic peaks etc.
```{r faahKO-centWave, message = FALSE, warning = FALSE }
## Defining the settings for the centWave peak detection.
cwp <- CentWaveParam(snthresh = 20, noise = 1000)
xod <- findChromPeaks(raw_data, param = cwp)
```
The identified peaks can be accessed with the `chromPeaks` parameter which returns
a `matrix`, each line representing an identified peak. Column `"sample"` specifies
in which *sample* (i.e. file) of the experiment the peak was detected. Below we
plot the signal distribution of the identified peaks per sample.
```{r faahKO-peak-intensity-boxplot, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 4 }
ints <- split(chromPeaks(xod)[, "into"], f = chromPeaks(xod)[, "sample"])
ints <- lapply(ints, log2)
boxplot(ints, varwidth = TRUE, col = sample_colors[xod$sample_group],
ylab = expression(log[2]~intensity), main = "Peak intensities")
```
To get a global overview of the peak detection results we can use the
`plotChromPeakImage` function that plots the number of identified peaks for each
sample along the retention time axis as an image plot. This would allow for
example to spot samples in which much fewer peaks were identified. Below we
create the image counting the number of detected peaks within bins of 20 seconds
along the retention time axis.
```{r faahKO-peak-image, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 4, fig.cap = "Chromatographic peak image." }
plotChromPeakImage(xod, binSize = 20)
```
The `plotChromPeaks` function can be used to get a global overview of the
identified chromatographic peaks of one file. It highlights the identified peaks
in the full mz/rt plane.
```{r faahKO-peak-plot, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 4, fig.cap = "Chromatographic peaks for one file." }
plotChromPeaks(xod, file = 3)
```
After peak detection it might be advisable to evaluate whether the peak
detection identified e.g. compounds known to be present in the
sample. Facilitating access to the raw data has thus been one of the major aims
for the updated user interface.
Next we extract the chromatogram for the rt-mz region corresponding to one
detected chromatographic peak increasing the region in rt dimension by +/- 60
seconds. In addition we extract also the full chromatogram for the specified mz
range (i.e. the full rt range) and identify all chromatographic peaks in that
region by passing the same `mz` and `rt` parameters to the `chromPeaks` method.
If two-column matrices are passed to the `chromatogram` method with parameters `rt`
and `mz`, the function returns a `Chromatograms` object with each column containing
the data from one sample/file and each row the `Chromatogram` objects for the
respective ranges.
```{r faahKO-chromPeaks-extractChroms, warning = FALSE }
rtr <- chromPeaks(xod)[68, c("rtmin", "rtmax")]
## Increase the range:
rtr[1] <- rtr[1] - 60
rtr[2] <- rtr[2] + 60
mzr <- chromPeaks(xod)[68, c("mzmin", "mzmax")]
## Add an rt range that would extract the full chromatogram
rtr <- rbind(c(-Inf, Inf), rtr)
mzr <- rbind(mzr, mzr)
chrs <- chromatogram(xod, rt = rtr, mz = mzr)
## In addition we get all peaks detected in the same region
pks <- chromPeaks(xod, rt = rtr, mz = mzr)
pks
```
Next we plot the extracted chromatogram for the data and highlight in addition
the identified peaks using the `highlightChromPeaks` function.
```{r faahKO-extracted-chrom-with-peaks, message = FALSE, fig.cap = "Extracted ion chromatogram for one of the identified peaks. Left: full retention time range, right: rt range of the peak. Each line represents the signal measured in one sample. The rectangles indicate the margins of the identified chromatographic peak in the respective sample.", fig.align = "center", fig.width = 12, fig.height = 6 }
## Plot the full rt range:
par(mfrow = c(2, 1))
plot(chrs[1, , drop = FALSE], col = paste0(sample_colors[xod$sample_group], 80))
## And now for the peak range.
plot(chrs[2, , drop = FALSE], col = paste0(sample_colors[xod$sample_group], 80))
## Highlight also the identified chromatographic peaks.
highlightChromPeaks(xod, rt = rtr[2, ], mzr[2, ],
border = paste0(sample_colors[xod$sample_group], 40))
```
Note that `Chromatogram` objects extracted by the `chromatogram` method contain an
`NA` value if in a certain scan (i.e. for a specific retention time) no signal was
measured in the respective mz range. This is reflected by the lines not being
drawn as continuous lines in the plot above.
Next we align the samples using the *obiwarp* method [@Prince:2006jj]. This
method does not require, in contrast to other alignment/retention time
correction methods, any identified peaks and could thus also be applied to an
`OnDiskMSnExp` object. Note that all retention time adjustment methods do also
adjust the retention times reported for the individual peaks in `chromPeaks`.
```{r faahKO-obiwarp, message = FALSE }
## Doing the obiwarp alignment using the default settings.
xod <- adjustRtime(xod, param = ObiwarpParam())
```
Note that any pre-processing results can be removed at any time using a *drop*
method, such as `dropChromPeaks`, `dropFeatureDefinitions` or
`dropAdjustedRtime`.
To evaluate the impact of the alignment we can plot again the BPC of each
sample. In addition we plot the differences of the adjusted to the raw retention
times per sample using the `plotAdjustedRtime` function.
```{r faahKO-bpi-obiwarp, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 8 }
## Get the base peak chromatograms. This reads data from the files.
bpis <- chromatogram(xod, aggregationFun = "max")
par(mfrow = c(2, 1), mar = c(4.5, 4.2, 1, 0.5))
plot(bpis, col = paste0(sample_colors[xod$sample_group], 80))
## Plot also the difference of adjusted to raw retention time.
plotAdjustedRtime(xod, col = paste0(sample_colors[xod$sample_group], 80))
```
Too large differences between adjusted and raw retention times could indicate
poorly performing samples or alignment.
The distribution of retention time differences could also be used for quality
assessment.
```{r faahKO-adjusted-rtime-boxplot, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 4 }
## Calculate the difference between the adjusted and the raw retention times.
diffRt <- rtime(xod) - rtime(xod, adjusted = FALSE)
## By default, rtime and most other accessor methods return a numeric vector. To
## get the values grouped by sample we have to split this vector by file/sample
diffRt <- split(diffRt, fromFile(xod))
boxplot(diffRt, col = sample_colors[xod$sample_group],
main = "Obiwarp alignment results", ylab = "adjusted - raw rt")
```
The 3rd sample was used as *center* sample against which all other samples were
aligned to, hence its adjusted retention times are identical to the raw
retention times.
Below we plot the extracted ion chromatogram for the selected peak from the
example above before and after retention time correction to evaluate the impact
of the alignment.
```{r faahKO-extracted-chrom-with-peaks-aligned, echo = FALSE, message = FALSE, fig.cap = "Extracted ion chromatogram for one of the identified peaks before and after alignment.", fig.align = "center", fig.width = 8, fig.height = 8 }
rtr <- chromPeaks(xod)[68, c("rtmin", "rtmax")]
## Increase the range:
rtr[1] <- rtr[1] - 60
rtr[2] <- rtr[2] + 60
mzr <- chromPeaks(xod)[68, c("mzmin", "mzmax")]
chrs <- chromatogram(xod, rt = rtr, mz = mzr)
chrs_raw <- chromatogram(raw_data, rt = rtr, mz = mzr)
par(mfrow = c(2, 1))
plot(chrs_raw, col = paste0(sample_colors[xod$sample_group], 80))
plot(chrs, col = paste0(sample_colors[xod$sample_group], 80))
highlightChromPeaks(xod, rt = rtr, mzr,
border = paste0(sample_colors[xod$sample_group], 40))
```
After alignment, the peaks are nicely overlapping.
Next we group identified chromatographic peaks across samples. We use the *peak
density* method [@Smith:2006ic] specifying that a chromatographic peak has
to be present in at least 2/3 of the samples within each group to be combined to
a mz-rt *feature*.
```{r faahKO-groupPeakDensity, message = FALSE }
## Define the PeakDensityParam
pdp <- PeakDensityParam(sampleGroups = xod$sample_group,
maxFeatures = 300, minFraction = 0.66)
xod <- groupChromPeaks(xod, param = pdp)
```
The definitions of the features can be accessed with the `featureDefinitions`,
which lists the mz-rt space specific to a feature. Column `"peakidx"` lists the
indices (in the `chromPeaks` matrix) of the individual chromatographic peaks
belonging to the feature.
```{r faahKO-featureDefinitions, message = FALSE }
head(featureDefinitions(xod))
```
The `plotChromPeakDensity` method allows to inspect the result of the peak
grouping on e.g. a known compound/peak.
```{r faahKO-plot-peak-density, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 8 }
## Extract the full chromatogram for the mz of 279
chrs <- chromatogram(xod, mz = 279)
## Plot the chromatogram objects
par(mfrow = c(2, 1), mar = c(2.5, 4, 1, 1))
plot(chrs, col = paste0(sample_colors[xod$sample_group], 80))
## Hightlight the chromatographic peaks identified in this mz slice
highlightChromPeaks(xod, mz = 279,
border = paste0(sample_colors[xod$sample_group], 60))
## Plot the peak density distribution for the given mz slice providing also the
## parameter class used for the peak grouping.
plotChromPeakDensity(xod, mz = 279, pch = 16, param = pdp,
col = paste0(sample_colors[xod$sample_group], 80))
```
The upper panel of the plot above shows the extracted chromatogram for an mz
value of 279 with the identified chromatographic peaks indicated with
rectangles. The lower panel shows the location of identified chromatographic
peaks per sample along the retention time axis (points) and the chromatographic
peak density as black solid line. The grey rectangle indicates which peaks have
been grouped and assigned to the corresponding feature. This type of
visualization can be very helpful to tune the parameters of the peak grouping
for example to evaluate whether closely located peaks known to come from
different compounds were successfully separated.
To extract *values* for the features, the `featureValues` method can be used. This
method returns a matrix with rows being the features and column the samples. The
`value` parameter allows to specify the value that should be returned. Below we
extract the `"into"` signal, i.e. the per-peak integrated intensity for each
feature.
```{r faahKO-featureValues, message = FALSE }
## Extract the "into" peak integrated signal.
head(featureValues(xod, value = "into"))
```
After correspondence there will always be features that do not include peaks
from every sample (being it that the peak finding algorithm failed to identify a
peak or that no signal was measured in the respective mz-rt area). For such
features an `NA` is returned by the `featureValues` method. Here, `xcms` allows to
infer values for such missing peaks using the `fillChromPeaks` method. This method
integrates in files where a peak was not found the signal from the mz-rt area
where it is expected and adds it to the `chromPeaks` matrix. Such *filled-in* peaks
have a value of `1` in the `"is_filled"` column of the `chromPeaks` matrix.
```{r faahKO-fillPeaks, message = FALSE }
## Fill in peaks with default settings. Settings can be adjusted by passing
## a FillChromPeaksParam object to the method.
xod <- fillChromPeaks(xod)
head(featureValues(xod, value = "into"))
```
Not for all missing peaks a value could be integrated (because at the respective
location no measurements are available). The peak area from which signal is to
be extracted can also be increased modifying the settings by passing a
`FillChromPeaksParam` object.
Next we inspect the `processHistory` of the analysis. As described earlier, this
records all (major) processing steps along with the corresponding parameter
classes.
```{r faahKO-processHistory, message = FALSE }
## List the full process history
processHistory(xod)
```
It is also possible to extract specific processing steps by specifying its
type. Available types can be listed with the `processHistoryTypes` function. Below
we extract the parameter class for the alignment/retention time adjustment step.
```{r faahKO-processHistory-select, message = FALSE }
ph <- processHistory(xod, type = "Retention time correction")
## Access the parameter
processParam(ph[[1]])
```
As described earlier, we can remove specific analysis results at any
stage. Below we remove the results from the alignment. Since the correspondence
was performed after that processing step its results will be removed too leaving
us only with the results from the peak detection step.
```{r faahKO-drop-alignment, message = FALSE }
## Remove the alignment results
xod <- dropAdjustedRtime(xod)
processHistory(xod)
```
We can now use a different method to perform the alignment. The *peak groups*
alignment method bases the alignment of the samples on chromatographic peaks
present in most samples (so called *well behaved* peaks). This means we have to
perform first an initial correspondence analysis to group peaks within and
across samples.
```{r faahKO-initial-correspondence, message = FALSE }
## Define the parameter for the correspondence
pdparam <- PeakDensityParam(sampleGroups = pData(xod)$sample_group,
minFraction = 0.7, maxFeatures = 100)
xod <- groupChromPeaks(xod, param = pdparam)
```
Before performing the alignment we can also inspect which peak groups might be
selected for alignment based on the provided `PeakGroupsParam` object.
```{r faahKO-peak-groups-matrix, message = FALSE }
## Create the parameter class for the alignment
pgparam <- PeakGroupsParam(minFraction = 0.9, span = 0.4)
## Extract the matrix with (raw) retention times for the peak groups that would
## be used for alignment.
adjustRtimePeakGroups(xod, param = pgparam)
```
If we are not happy with these peak groups (e.g. because we don't have a peak
group for a rather large time span along the retention time axis) we can try
different settings. In addition, we could also *manually* select certain peak
groups, e.g. for internal controls, and add this matrix with the
`peakGroupsMatrix` method to the `PeakGroupsParam` class. Below we just use `pgparam`
we defined and perform the alignment. This will use the peak groups matrix from
above.
```{r faahKO-peak-groups-alignment, message = FALSE }
## Perform the alignment using the peak groups method.
xod <- adjustRtime(xod, param = pgparam)
```
We can now also plot the difference between adjusted and raw retention times. If
alignment was performed using the *peak groups* method, also these peak groups are
highlighted in the plot.
```{r faahKO-peak-groups-alignment-plot, message = FALSE, fig.align = "center", fig.width = 8, fig.height = 4 }
plotAdjustedRtime(xod, col = sample_colors[pData(xod)$sample_group])
```
## New naming convention
Peaks identified in LC/GC-MS metabolomics are referred to as *chromatographic
peaks* where possible to avoid any misconceptions with *mass peaks* identified in
mz dimension.
Methods for data analysis from the original `xcms` code have been renamed to avoid
potential confusions:
- **Chromatographic peak detection**: `findChromPeaks` instead of `findPeaks`: for new
functions and methods the term *peak* is avoided as much as possible, as it is
usually used to describe a mass peak in mz dimension. To clearly distinguish
between these peaks and peaks in retention time space, the latter are referred
to as *chromatographic peak*, or `chromPeak`.
- **Correspondence**: `groupChromPeaks` instead of `group` to clearly indicate what is
being grouped. Group might be a sample group or a peak group, the latter being
referred to also by (mz-rt) *feature*.
- **Alignment**: `adjustRtime` instead of `retcor` for retention time correction. The
word *cor* in *retcor* might be easily misinterpreted as *correlation* instead of
correction.
## New data classes
### `OnDiskMSnExp`
This object is defined and documented in the `MSnbase` package. In brief, it is a
container for the full raw data from an MS-based experiment. To keep the memory
footprint low the mz and intensity values are only loaded from the raw data
files when required. The `OnDiskMSnExp` object replaces the `xcmsRaw` object.
### `XCMSnExp`
The `XCMSnExp` class extends the `OnDiskMSnExp` object from the `MSnbase` package and
represents a container for the xcms-based preprocessing results while (since it
inherits all functionality from its parent class) keeping a direct relation to
the (raw) data on which the processing was performed. An additional slot
`.processHistory` in the object allows to keep track of all performed processing
steps. Each analysis method, such as `findChromPeaks` adds an `XProcessHistory`
object which includes also the parameter class passed to the analysis
method. Hence not only the time and type of the analysis, but its exact settings
are reported within the `XCMSnExp` object. The `XCMSnExp` is thus equivalent to the
`xcmsSet` from the original `xcms` implementation, but keeps in addition a link to
the raw data on which the preprocessing was performed.
### `Chromatogram`
The `Chromatogram` class (available in the `MSnbase` package since version 2.3.8)
allows a data representation that is orthogonal to the `Spectrum` class (also
defined in `MSnbase`). The `Chromatogram` class stores retention time and intensity
duplets and is designed to accommodate most use cases, from total ion
chromatogram, base peak chromatogram to extracted ion chromatogram and SRM/MRM
ion traces.
`Chromatogram` objects can be extracted from `XCMSnExp` (and `MSnExp` and
`OnDiskMSnExp`) objects using the `chromatogram` method.
Note that this class is still considered developmental and might thus undergo
some changes in the future.
## Binning and missing value imputation functions
The binning/profile matrix generation functions have been completely
rewritten. The new `binYonX` function replaces the binning of intensity values
into bins defined by their m/z values implemented in the `profBin`, `profBinLin` and
`profBinLinBase` methods. The `binYonX` function provides also additional functionality:
- Breaks for the bins can be defined based on either the number of desired bins
(`nBins`) or the size of a bin (`binSize`). In addition it is possible to provide
a vector with pre-defined breaks. This allows to bin data from multiple files
or scans on the same bin-definition.
- The function returns a list with element `y` containing the binned values and
element `x` the bin mid-points.
- Values in input vector `y` can be aggregated within each bin with different
methods: `max`, `min`, `sum` and `mean`.
- The index of the largest (or smallest for `method` being "min") within each bin
can be returned by setting argument `returnIndex` to `TRUE`.
- Binning can be performed on single or multiple sub-sets of the input vectors
using the `fromIdx` and `toIdx` arguments. This replaces the *M* methods (such as
`profBinM`). These sub-sets can be overlapping.
The missing value imputation logic inherently build into the `profBinLin` and
`profBinLinBase` methods has been implemented in the `imputeLinInterpol` function.
The example below illustrates the binning and imputation with the `binYtoX` and
`imputeLinInterpol` functions. After binning of the test vectors below some of the
bins have missing values, for which we impute a value using
`imputeLinInterpol`. By default, `binYonX` selects the largest value within each
bin, but other aggregation methods are also available (i.e. min, max, mean,
sum).
```{r message = FALSE }
## Defining the variables:
set.seed(123)
X <- sort(abs(rnorm(30, mean = 20, sd = 25))) ## 10
Y <- abs(rnorm(30, mean = 50, sd = 30))
## Bin the values in Y into 20 bins defined on X
res <- binYonX(X, Y, nBins = 22)
res
```
As a result we get a `list` with the bin mid-points (`$x`) and the binned `y` values
(`$y`).
Next we use two different imputation approaches, a simple linear interpolation
and the linear imputation approach that was defined in the `profBinLinBase`
method. The latter performs linear interpolation only considering a certain
neighborhood of missing values otherwise replacing the `NA` with a base value.
```{r binning-imputation-example, message = FALSE, fig.width = 10, fig.height = 7, fig.cap = 'Binning and missing value imputation results. Black points represent the input values, red the results from the binning and blue and green the results from the imputation (with method lin and linbase, respectively).' }
## Plot the actual data values.
plot(X, Y, pch = 16, ylim = c(0, max(Y)))
## Visualizing the bins
abline(v = breaks_on_nBins(min(X), max(X), nBins = 22), col = "grey")
## Define colors:
point_colors <- paste0(brewer.pal(4, "Set1"), 80)
## Plot the binned values.
points(x = res$x, y = res$y, col = point_colors[1], pch = 15)
## Perform the linear imputation.
res_lin <- imputeLinInterpol(res$y)
points(x = res$x, y = res_lin, col = point_colors[2], type = "b")
## Perform the linear imputation "linbase"
res_linbase <- imputeLinInterpol(res$y, method = "linbase")
points(x = res$x, y = res_linbase, col = point_colors[3], type = "b", lty = 2)
```
The difference between the linear interpolation method `lin` and `linbase` is that
the latter only performs the linear interpolation in a pre-defined neighborhood
of the bin with the missing value (`1` by default). The other missing values are
set to a base value corresponding to half of the smallest bin value. Both
methods thus yield same results, except for bins 15-17 (see Figure above).
## Core functionality exposed *via* simple functions
The core logic from the chromatographic peak detection methods
`findPeaks.centWave`, `findPeaks.massifquant`, `findPeaks.matchedFilter` and
`findPeaks.MSW` and from all alignment (`group.*`) and correspondence (`retcor.*`)
methods has been extracted and put into functions with the common prefix
`do_findChromPeaks`, `do_adjustRtime` and `do_groupChromPeaks`, respectively, with the
aim, as detailed in issue [#30](https://github.com/sneumann/xcms/issues/30), to separate the core logic from the analysis
methods invoked by the users to enable also the use these methods using base R
parameters (i.e. without specific classes containing the data such as the
`xcmsRaw` class). This simplifies also the re-use of these functions in other
packages and simplifies the future implementation of the peak detection
algorithms for e.g. the `MSnExp` or `OnDiskMSnExp` objects from the `MSnbase`
Bioconductor package. The implemented functions are:
- **peak detection methods**:
- `do_findChromPeaks_centWave`: peak density and wavelet based peak detection
for high resolution LC/MS data in centroid mode [@Tautenhahn:2008fx].
- `do_findChromPeaks_matchedFilter`: identification of peak in the
chromatographic domain based on matched filtration [@Smith:2006ic].
- `do_findChromPeaks_massifquant`: identification of peaks using Kalman
filters.
- `do_findChromPeaks_MSW`: single spectrum, non-chromatographic peak detection.
- **alignment methods**:
- `do_adjustRtime_peakGroups`: perform sample alignment (retention time
correction) using alignment of *well behaved* chromatographic peaks that are
present in most samples (and are expected to have the same retention time).
- **correspondence methods**:
- `do_groupChromPeaks_density`: perform chromatographic peak grouping (within
and across samples) based on the density distribution of peaks along the
retention time axis.
- `do_groupChromPeaks_nearest`: groups peaks across samples similar to the
method implemented in mzMine.
- `do_groupChromPeaks_mzClust`: performs high resolution correspondence on
single spectra samples.
One possible drawback from the introduction of this new layer is, that more
objects get copied by R which *could* eventually result in a larger memory demand
or performance decrease (while no such was decrease was observed up to now).
## Usability improvements in the *old* user interface
- `[` subsetting method for `xcmsRaw` objects that enables to subset an `xcmsRaw`
object to specific scans/spectra.
- `profMat` method to extract the *profile* matrix from the `xcmsRaw` object. This
method should be used instead of directly accessing the `@env$profile` slot, as
it will create the profile matrix on the fly if it was not pre-calculated (or
if profile matrix generation settings have been changed).
# Changes due to bug fixes and modified functionality
## Differences in linear interpolation of missing values (`profBinLin`).
From `xcms` version 1.51.1 on the new binning functions are used, thus, the bug
described here are fixed.
Two bugs are present in the `profBinLin` method (reported as issues [#46](https://github.com/sneumann/xcms/issues/46) and [#49](https://github.com/sneumann/xcms/issues/49) on
github) which are fixed in the new `binYonX` and `imputeLinInterpol` functions:
- The first bin value calculated by `profBinLin` can be wrong (i.e. not being the
max value within that bin, but the first).
- If the last bin contains also missing values, the method fails to determine
a correct value for that bin.
The `profBinLin` method is used in `findPeaks.matchedFilter` if the profile
method is set to "binlin".
The example below illustrates both differences.
```{r }
## Define a vector with empty values at the end.
X <- 1:11
set.seed(123)
Y <- sort(rnorm(11, mean = 20, sd = 10))
Y[9:11] <- NA
nas <- is.na(Y)
## Do interpolation with profBinLin:
resX <- xcms:::profBinLin(X[!nas], Y[!nas], 5, xstart = min(X),
xend = max(X))
resX
res <- binYonX(X, Y, nBins = 5L, shiftByHalfBinSize = TRUE)
resM <- imputeLinInterpol(res$y, method = "lin",
noInterpolAtEnds = TRUE)
resM
```
Plotting the results helps to better compare the differences. The black points
in the figure below represent the actual values of `Y` and the grey vertical lines
the breaks defining the bins. The blue lines and points represent the result
from the `profBinLin` method. The bin values for the first and 4th bin are clearly
wrong. The green colored points and lines represent the results from the `binYonX`
and `imputeLinInterpol` functions (showing the correct binning and interpolation).
```{r profBinLin-problems, message = FALSE, fig.align = 'center', fig.width=10, fig.height = 7, fig.cap = "Illustration of the two bugs in profBinLin. The input values are represented by black points, grey vertical lines indicate the bins. The results from binning and interpolation with profBinLin are shown in blue and those from binYonX in combination with imputeLinInterpol in green." }
plot(x = X, y = Y, pch = 16, ylim = c(0, max(Y, na.rm = TRUE)),
xlim = c(0, 12))
## Plot the breaks
abline(v = breaks_on_nBins(min(X), max(X), 5L, TRUE), col = "grey")
## Result from profBinLin:
points(x = res$x, y = resX, col = "blue", type = "b")
## Results from imputeLinInterpol
points(x = res$x, y = resM, col = "green", type = "b",
pch = 4, lty = 2)
```
Note that by default `imputeLinInterpol` would also interpolate missing values at
the beginning and the end of the provided numeric vector. This can be disabled
(to be compliant with `profBinLin`) by setting parameter `noInterpolAtEnds` to
`TRUE` (like in the example above).
## Differences due to updates in `do_findChromPeaks_matchedFilter`, respectively `findPeaks.matchedFilter`.
The original `findPeaks.matchedFilter` (up to version 1.49.7) had several
shortcomings and bugs that have been fixed in the new
`do_findChromPeaks_matchedFilter` method:
- The internal iterative processing of smaller chunks of the full data (also
referred to as *iterative buffering*) could result, for some bin (step) sizes to
unstable binning results (discussed in issue [#47](https://github.com/sneumann/xcms/issues/47) on github): calculation of
the breaks, or to be precise, the actually used bin size was performed in each
iteration and could lead to slightly different sizes between iterations (due
to rounding errors caused by floating point number representations in C).
- The iterative buffering raises also a conceptual issue when linear
interpolation is performed to impute missing values: the linear imputation
will only consider values within the actually processed buffer and can thus
lead to wrong or inaccurate imputations.
- The `profBinLin` implementation contains two bugs, one that can result in
failing to identify the maximal value in the first and last bin (see issue
[#46](https://github.com/sneumann/xcms/issues/46)) and one that fails to assign a value to a bin (issue [#49](https://github.com/sneumann/xcms/issues/49)). Both are fixed
in the `do_findChromPeaks_matchedFilter` implementation.
A detailed description of tests comparing all implementations is available in
issue [#52](https://github.com/sneumann/xcms/issues/52) on github. Note also that in course of these changes also the `getEIC`
method has been updated to use the new binning and missing value imputation
function.
While it is strongly discouraged, it is still possible to use to *old* code (from
1.49.7) by calling `useOriginalCode(TRUE)`.
## Differences in `findPeaks.massifquant`
- Argument `scanrange` was ignored in the *original* old code (issue [#61](https://github.com/sneumann/xcms/issues/61)).
- The method returned a `matrix` if `withWave` was `0` and a `xcmsPeaks` object
otherwise. The updated version returns **always** an `xcmsPeaks` object (issue #60).
## Differences in *obiwarp* retention time correction
Retention time correction using the obiwarp method uses the *profile* matrix
(i.e. intensities binned in discrete bins along the mz axis). Profile matrix
generation uses now the `binYonX` method which fixed some problems in the original
binning and linear interpolation methods. Thus results might be slightly
different.
Also, the `retcor.obiwarp` method reports (un-rounded) adjusted retention times,
but adjusts the retention time of eventually already identified peaks using
rounded adjusted retention times. The new `adjustRtime` method(s) does adjust
identified peaks using the reported adjusted retention times (not rounded). This
guarantees that e.g. removing retention time adjustment/alignment results from
an object restores the object to its initial state (i.e. the adjusted retention
times of the identified peaks are reverted to the retention times before
alignment).
See issue [#122](https://github.com/sneumann/xcms/issues/122) for more details.
## `retcor.peaksgroups`: change in the way how *well behaved* peak groups are ordered
The `retcor.peakgroups` defines first the chromatographic peak groups that are
used for the alignment of all spectra. Once these are identified, the retention
time of the peak with the highest intensity in a sample for a given peak group
is returned and the peak groups are ordered increasingly by retention time
(which is required for the later fitting of either a polynomial or a linear
model to the data). The selection of the retention time of the peak with the
highest intensity within a feature (peak group) and samples, denoted as
*representative* peak for a given feature in a sample, ensures that only the
retention time of a single peak per sample and feature is selected (note that
multiple chromatographic peaks within the same sample can be assigned to a
feature). In the original code the ordering of the peak groups was however
performed using the median retention time of the complete peak group (which
includes also potential additional peaks per sample). This has been changed and
the features are ordered now by the median retention time across samples of the
representative chromatographic peaks.
## `scanrange` parameter in all `findPeaks` methods
The `scanrange` in the `findPeaks` methods is supposed to enable the peak detection
only within a user-defined range of scans. This was however not performed in
each method. Due to a bug in `findPeaks.matchedFilter`'s original code the
argument was ignored, except if the upper scan number of the user defined range
was larger than the total number of available scans (see issue [#63](https://github.com/sneumann/xcms/issues/63)). In
`findPeaks.massifquant` the argument was completely ignored (see issue [#61](https://github.com/sneumann/xcms/issues/61)) and,
while the argument was considered in `findPeaks.centWave` and feature detection
was performed within the specified scan range, but the original `@scantime` slot
was used throughout the code instead of just the scan times for the specified
scan indices (see issue [#64](https://github.com/sneumann/xcms/issues/64)).
These problems have been fixed in version 1.51.1 by first sub-setting the
`xcmsRaw` object (using the `[` method) before actually performing the feature
detection.
## `fillPeaks` (`fillChromPeaks`) differences
In the original `fillPeaks.MSW`, the mz range from which the signal is to be
integrated was defined using
```{r eval = FALSE }
mzarea <- seq(which.min(abs(mzs - peakArea[i, "mzmin"])),
which.min(abs(mzs - peakArea[i, "mzmax"])))
```
Depending on the data this could lead to the inclusion of signal in the
integration that are just outside of the mz range. In the new `fillChromPeaks`
method signal is integrated only for mz values >= mzmin and <= mzmax thus
ensuring that only signal is used that is truly within the peak area defined by
columns `"mzmin"`, `"mzmax"`, `"rtmin"` and `"rtmax"`.
Also, the `fillPeaks.chrom` method did return `"into"` and `"maxo"` values of `0` if no
signal was found in the peak area. The new method does not integrate any signal
in such cases and does not fill in that peak.
See also issue [#130](https://github.com/sneumann/xcms/issues/130) for more
information.
# Under the hood changes
These changes and updates will not have any large impact on the day-to-day use of
`xcms` and are listed here for completeness.
- From `xcms` version 1.51.1 on the default methods from the `mzR` package are used
for data import. Besides ensuring easier maintenance, this enables also data
import from *gzipped* mzML files.
# Deprecated functions and files
Here we list all of the functions and related files that are deprecated.
- `xcmsParallelSetup`, `xcmsPapply`, `xcmsClusterApply`: use `BiocParallel` package
instead to setup and perform parallel processing, either *via* the `BPPARAM`
parameter to function and methods, or by calling `register` to globally set
parallel processing.
- `profBin`, `profBinM`, `profBinLin`, `profBinLinM`, `profBinLinBase`, `profBinLinBaseM`:
replaced by the `binYonX` and `imputeLinInterpol` functions. Also, to create or
extract the profile matrix from an `xcmsRaw` object, the `profMat` method.
## Deprecated
### xcms 1.49:
- `xcmsParallelSetup` (Deprecated.R)
- `xcmsPapply` (Deprecated.R)
- `xcmsClusterApply` (Deprecated.R)
### xcms 1.51:
- `profBin` (c.R)
- `profBinM` (c.R)
- `profBinLin` (c.R)
- `profBinLinM` (c.R)
- `profBinLinBase` (c.R)
- `profBinLinBaseM` (c.R)
## Defunct
# References