Skip to content

Commit

Permalink
CCS 6.0.0
Browse files Browse the repository at this point in the history
  • Loading branch information
armintoepfer committed Dec 16, 2020
1 parent 3da9e7c commit 178f15a
Show file tree
Hide file tree
Showing 12 changed files with 138 additions and 44 deletions.
9 changes: 8 additions & 1 deletion docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,14 @@ nav_order: 99

# Version changelog

**5.0.0**
**6.0.0**
* Increase number of HiFi reads
* Increase percentage of barcode yield
* Run time, CPU time, and peak RSS improvements
* Change main draft algorithm from pbdagcon to sparc
* Replace minimap2 with pancake and edlib/KSW2

5.0.0
* SMRT Link v10.0 release
* Add `--hifi-kinetics` to average kinetic information for polished reads
* Add `--all-kinetics` to add kinetic information for all ZMWs, except for unpolished draft consensus
Expand Down
4 changes: 2 additions & 2 deletions docs/faq/bioconda-binary.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,10 @@ A modern (post-2008) CPU with support for
SMRT Link also has this requirement.

**`FATAL: kernel too old`** Your OS or rather your kernel version is not supported.
Since CCS v4.2 we also ship a second binary via bioconda `ccs-alt`, which does
Since _ccs_ v4.2 we also ship a second binary via bioconda `ccs-alt`, which does
not bundle a newer `glibc`. Please use this alternative binary.

For CCS v5.0, we offer two binaries in bioconda:
For _ccs_, we offer two binaries in bioconda:

* `ccs`, statically links `glibc` v2.32 and `mimalloc` v1.3.0.
* `ccs-alt`, was build by dynamically linking `glibc` v2.12, but statically links `mimalloc` v1.3.0.
26 changes: 26 additions & 0 deletions docs/faq/chemistry.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,32 @@ layout: default
parent: FAQ
title: Chemistry
---
## Supported chemistries
The latest _ccs_ v6 supports following combinations of binding and
sequencing kit part numbers:

| BindingKit | SequencingKit | Chemistry | System |
| :---------: | :-----------: | :--------------: | :-------: |
| 101-500-400 | 101-427-500 | S/P3-C3/5.0 | Sequel |
| 101-500-400 | 101-427-800 | S/P3-C3/5.0 | Sequel |
| 101-500-400 | 101-646-800 | S/P3-C3/5.0 | Sequel |
| 101-490-800 | 101-490-900 | S/P3-C1/5.0-8M | Sequel II |
| 101-490-800 | 101-491-000 | S/P3-C1/5.0-8M | Sequel II |
| 101-490-800 | 101-644-500 | S/P3-C1/5.0-8M | Sequel II |
| 101-490-800 | 101-717-100 | S/P3-C1/5.0-8M | Sequel II |
| 101-717-300 | 101-644-500 | S/P3-C1/5.0-8M | Sequel II |
| 101-717-300 | 101-717-100 | S/P3-C1/5.0-8M | Sequel II |
| 101-717-400 | 101-644-500 | S/P3-C1/5.0-8M | Sequel II |
| 101-717-400 | 101-717-100 | S/P3-C1/5.0-8M | Sequel II |
| 101-789-500 | 101-789-300 | S/P4-C2/5.0-8M | Sequel II |
| 101-820-500 | 101-789-300 | S/P4.1-C2/5.0-8M | Sequel II |
| 101-789-500 | 101-826-100 | S/P4-C2/5.0-8M | Sequel II |
| 101-789-500 | 101-820-300 | S/P4-C2/5.0-8M | Sequel II |
| 101-820-500 | 101-826-100 | S/P4.1-C2/5.0-8M | Sequel II |
| 101-820-500 | 101-820-300 | S/P4.1-C2/5.0-8M | Sequel II |
| 101-894-200 | 101-826-100 | S/P5-C2/5.0-8M | Sequel II |
| 101-894-200 | 101-789-300 | S/P5-C2/5.0-8M | Sequel II |
| 101-894-200 | 101-820-300 | S/P5-C2/5.0-8M | Sequel II |

## Help! I am getting "Unsupported ..."!
If you encounter the error `Unsupported chemistries found: (...)` or
Expand Down
4 changes: 2 additions & 2 deletions docs/faq/low-complexity.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ parent: FAQ
title: Low complexity
---

## Does CCS dislike low-complexity regions?
## Does _ccs_ dislike low-complexity regions?
Low-complexity comes in many shapes and forms.
A particular challenge for _ccs_ are highly enriched tandem repeats, like
hundreds of copies of `AGGGGT`.
Expand All @@ -13,7 +13,7 @@ a consensus sequence.
Since _ccs_ v5.0, every ZMW is tested if it contains a tandem repeat
of length `--min-tandem-repeat-length 1000`.
For this, we use [symmetric DUST](https://doi.org/10.1089/cmb.2006.13.1028)
and in particular this [sdust](https://github.com/lh3/sdust) implementation,
and in particular the [sdust](https://github.com/lh3/sdust) implementation,
but slightly modified.
If a ZMW is flagged as a tandem repeat, internally `--disable-heuristics`
is activated for only this ZMW, and various filters that are known to exclude
Expand Down
2 changes: 1 addition & 1 deletion docs/faq/mode-all.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ Similar to the CLR instrument mode, in which subreads are accompanied by
a scraps file, _ccs_ offers a new mode to never lose a single read due to
filtering, without massive run time increase by polishing low-pass productive ZMWs.

Starting with SMRT Link v10.0 and Sequel IIe, _ccs_ v5.0 is able to generate
Starting with SMRT Link v10.0 and Sequel IIe, _ccs_ v5.0 or newer is able to generate
one representative sequence per productive ZMW, irrespective of quality and passes.
This ensures no yield loss due to filtering and enables users to have maximum
control over their data. Never fear again that SMRT Link or the Sequel IIe
Expand Down
93 changes: 65 additions & 28 deletions docs/faq/performance.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,26 +4,71 @@ parent: FAQ
title: Performance
---

## How fast is CCS?
We tested CCS runtime using 500 ZMWs per length bin with exactly 7 passes.
## How fast is _ccs_?
### Latest version
The latest _ccs_ v6 can process 200 GBases HiFi yield in 24 hours for a 25 KBases
library on 2x64 cores at 2.4 GHz.
To put this into perspective for actual sequencing collections:

| Sample | Insert size | HiFi Yield | Run Time |
| :------: | :---------: | :---------: | :------: |
| HG002 | 15 KBases | 41.1 GBases | 5h 52m |
| HG002 | 18 KBases | 34.0 GBases | 4h 36m |
| Readwood | 25 KBases | 32.4 GBases | 3h 46m |

### Relative performance v3.0 to v6.0
Current _ccs_ v6 achieves a >150x speed-up for 20 KBases inserts compared to
v3.0 from SMRT Link 6.0 release in 2018.

### Algorithmic complexity
To understand how this performance gain was possible, an overview how we changed
the algorithmic complexity and how _ccs_ scales with insert size and number of passes:

| CCS version | O(insert size) | O(#passes) |
| :---------: | :------------: | :-----------: |
| ≤3.0.0 | quadratic | linear |
| 3.4.1 | **linear** | linear |
| ≥4.0.0 | linear | **sublinear** |

To visualize this table, we benchmarked runtime using 500 ZMWs per length bin with
exactly 7 passes.

<img width="1000px" src="../img/runtime.png"/>

### How does that translate into time to result per SMRT Cell?
We will measure time to result for Sequel System and Sequel System II CCS sequencing collections
on a PacBio recommended HPC, according to the
[Sequel II System Compute Requirements](https://www.pacb.com/wp-content/uploads/SMRT_Link_Installation_v701.pdf)
with 192 physical or 384 hyper-threaded cores.
After v4.0.0, the slope of the curve does not change, as the complexity class
hasn't changed; only improvements independent of input type were made.

### Performance comparisons
Performance comparisons on different libraries; the `faster` column is with
respect to the run time of the previous version. All runs were performed on the
same hardware with 256 threads. A major part of the speed increase in v5.0 is
due to toolchain improvements for generating a more optimized binary.
#### **HG002 15kb SQII, 41 GBases HiFi yield**

1) Sequel System: 15 kb insert size, 24-hours movie, 37 GB raw yield, 2.3 GB HiFi UMY
2) Sequel II System: 15 kb insert size, 30-hours movie, 340 GB raw yield, 24 GB HiFi UMY
| CCS Version | HiFi Reads | Run Time | CPU Time | Peak RSS | Faster |
| :---------: | :--------: | :------: | :------: | :------: | :----: |
| 4.0.0 | 2,765,431 | 13h 14m | 89d 13h | 71 GB | |
| 4.2.0 | 2,806,886 | 10h 47m | 61d 9h | 72 GB | 18% |
| 5.0.0 | 2,807,317 | 6h 44m | 62d 22h | 27 GB | 37% |
| 6.0.0 | 2,831,192 | 5h 52m | 44d 17h | 20 GB | 13% |

| CCS version | Sequel System | Sequel II System |
| :-: | :-: | :-: |
| ≤3.0.0 | 1 day | >1 week |
| 3.4.1 | 3 hours | >1 day |
| 4.0.0 | 40 minutes | 6 hours |
| ≥4.2.0 | **30 minutes** | **4 hours** |
#### **HG002 18kb SQII, 32 GBases HiFi yield**
Omitting v4.0.0, due to lack of chemistry support.

| CCS Version | HiFi Reads | Run Time | CPU Time | Peak RSS | Faster |
| :---------: | :--------: | :------: | :------: | :------: | :----: |
| 4.2.0 | 1823016 | 8h 35m | 47d 13h | 80 GB | |
| 5.0.0 | 1824206 | 5h 29m | 50d 16h | 46 GB | 36% |
| 6.0.0 | 1855604 | 4h 36m | 30d 13h | 18 GB | 15% |

#### **Redwood 25kb SQII, 32 GBases HiFi yield**

| CCS Version | HiFi Reads | Run Time | CPU Time | Peak RSS | Faster |
| :---------: | :--------: | :------: | :------: | :------: | :----: |
| 4.0.0 | 1,269,680 | 7h 58m | 60d 19h | 72 GB | |
| 4.2.0 | 1,310,775 | 6h 37m | 43d 18h | 74 GB | 17% |
| 5.0.0 | 1,311,693 | 4h 36m | 41d 13h | 41 GB | 30% |
| 6.0.0 | 1,335,888 | 3h 56m | 25d 11h | 17 GB | 14% |

### How is CCS speed affected by raw base yield?
Raw base yield is the sum of all polymerase read lengths.
Expand All @@ -39,14 +84,6 @@ ZMWs per SMRT Cell.
Starting with version 3.3.0 _ccs_ scales linear in (2) the polymerase read length
and with version 4.0.0 _ccs_ scales sublinear.

### What did change in each version?

| CCS version | O(insert size) | O(#passes) |
| :-: | :-: | :-: |
| ≤3.0.0 | quadratic | linear |
| 3.4.1 | **linear** | linear |
| ≥4.0.0 | linear | **sublinear** |

### How can version 4.0.0 be sublinear in the number of passes?
With the introduction of improved heuristics, individual draft bases can skip
polishing if they are of sufficient quality.
Expand All @@ -57,13 +94,13 @@ No, we optimized _ccs_ such that there is a good balance between speed and
output quality.

## Does speed impact quality and yield?
Yes it does. With ~35x speed improvements from version 3.1.0 to 4.0.0 and
consequently reducing CPU time from >60,000 to <2,000 core hours,
heuristics and changes in algorithms lead to slightly lower yield and
Yes it does. With >150x speed improvements from version 3.0 to 6.0,
heuristics and changes in algorithms lead to slightly different yield and
accuracy if run head-to-head on the same data set. Internal tests show
that _ccs_ 4.0.0 introduces no regressions in CCS-only Structural Variant
that _ccs_ 6.0 introduces no regressions in _ccs_-only Structural Variant
calling and has minimal impact on SNV and indel calling in DeepVariant.
In contrast, lower DNA quality has a bigger impact on quality and yield.
In contrast, lower DNA quality and sample preparation has a bigger impact
on quality and yield.

## Can I tune performance without sacrificing output quality?
The bioconda _ccs_ ≥v5.0 binaries statically link [mimalloc](https://github.com/microsoft/mimalloc).
Expand Down
22 changes: 22 additions & 0 deletions docs/faq/reports-aux-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,28 @@ The following comments refer to the filters that are explained in the FAQ above.
If run in `--by-strand` mode, rows may contain half ZMWs, as we account
each strand as half a ZMW.

### Coverage drops
Example for a coverage drop in a single ZMW, subreads colored by strand orientation:

<p align="center">
<img width="500px" src="../img/coveragedrop.png" />
</p>

During sequencing of the molecule, one strand exhibits 744 more bases than its
reverse complemented strand. What happened?
Either there is a gain or loss of information.
An explanation for loss of information could be that a secondary structure,
the 744 bp forming a hairpin, could affect the replication during PCR and lead
to loss of bases.
Gain of information could also happen during PCR, when the polymerase gets stuck
and incorporates the current base too often.
In this example, there is a homopolymer of 744 `A` bases.
While it might be obvious to a human eye what happened,
its not the responsibility of _ccs_ to interpret and recover molecular damage.
Even if there were a low-complexity filter for those regions, setting the
appropriate threshold would be arbitrary;
would a 10bp homopolymer insertion be valid, but 11bp would get discarded?

## How do I read the zmw_metrics.json file?
Per default, each _ccs_ run generates a `<outputPrefix>.zmw_metrics.json.gz` file.
Change file name with `--metrics-json`.
Expand Down
20 changes: 11 additions & 9 deletions docs/how-does-ccs-work.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,22 +35,24 @@ To avoid improper mappings, short subreads are excluded.
The polish stage iteratively improves upon a candidate template sequence.
Because polishing is very compute intensive, it is desirable to start with a
template that is as close as possible to the true sequence of the molecule to
reduce the number of iterations until convergence.
So, the _ccs_ software does not pick a full-length subread as the initial
template to be polished, but instead generates an approximate draft consensus
sequence using graph algorithms like [partial-order alignment](https://academic.oup.com/bioinformatics/article/18/3/452/236691) (POA)
[consensus](https://academic.oup.com/bioinformatics/article/19/8/999/235258),
employing an accelerated implementation called [SPOA](https://github.com/rvaser/spoa),
or our own alignment graph consensus caller, called pbdagcon.
reduce the number of iterations until convergence. The _ccs_ software does
not pick a full-length subread as the initial template to be polished, but
instead generates an approximate draft consensus sequence using our improved
implementation of the [Sparc graph consensus algorithm](https://doi.org/10.7717/peerj.2016).
This algorithm depends on a subread to backbone alignment that is generated
by our own mapper [pancake](https://github.com/PacificBiosciences/pancake)
using [edlib](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408825/) as the core
aligner.
Typically, subreads have accuracy of around 90% and the draft consensus has a
higher accuracy, but depending on the algorithm employed is still below 99%.
higher accuracy, but is still below 99%.

<p align="center"><img width="1000px" src="img/draft.png"/></p>

Stop if draft length is shorter than `--min-length` and longer than `--max-length`.

## 3. Alignment
Align subreads to the draft consensus for downstream windowing and filtering.
Align subreads to the draft consensus using pancake with
[KSW2](https://github.com/lh3/ksw2) for downstream windowing and filtering.

## 4. Windowing
Divide the the subread-to-draft alignment into overlapping windows with a target
Expand Down
Binary file added docs/img/coveragedrop.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/img/run-design-kinetics.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/img/run-design-oiccs.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Please refer to our [official pbbioconda page](https://github.com/PacificBioscie
for information on Installation, Support, License, Copyright, and Disclaimer.

## Latest Version
Version **5.0.0**: [Full changelog here](/changelog)
Version **6.0.0**: [Full changelog here](/changelog)

## What's new!
_ccs_ is now running on the Sequel IIe instrument, transferring HiFi reads
Expand Down

0 comments on commit 178f15a

Please sign in to comment.