Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Covid19 #140

Merged
merged 10 commits into from
Mar 4, 2021
19 changes: 12 additions & 7 deletions bio_hansel/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,15 @@
import attr
import pandas as pd

from . import program_desc, __version__, program_name
from .const import SUBTYPE_SUMMARY_COLS, REGEX_FASTQ, REGEX_FASTA, JSON_EXT_TMPL
from .subtype import Subtype
from .subtype_stats import subtype_counts
from .subtyper import \
from bio_hansel import program_desc, __version__, program_name
from bio_hansel.const import SUBTYPE_SUMMARY_COLS, REGEX_FASTQ, REGEX_FASTA, JSON_EXT_TMPL
from bio_hansel.subtype import Subtype
from bio_hansel.subtype_stats import subtype_counts
from bio_hansel.subtyper import \
subtype_contigs_samples, \
subtype_reads_samples
from .metadata import read_metadata_table, merge_results_with_metadata
from .utils import (genome_name_from_fasta_path, get_scheme_fasta, does_file_exist, collect_fastq_from_dir,
from bio_hansel.metadata import read_metadata_table, merge_results_with_metadata
from bio_hansel.utils import (genome_name_from_fasta_path, get_scheme_fasta, does_file_exist, collect_fastq_from_dir,
group_fastqs, collect_fasta_from_dir, init_subtyping_params, df_field_fillna)

SCRIPT_NAME = 'hansel'
Expand Down Expand Up @@ -81,6 +81,9 @@ def init_parser():
parser.add_argument('--min-kmer-freq',
type=int,
help='Min k-mer freq/coverage')
parser.add_argument('--min-kmer-frac',
type=float,
help='Proportion of k-mer required for detection (0.0 - 1)')
parser.add_argument('--max-kmer-freq',
type=int,
help='Max k-mer freq/coverage')
Expand Down Expand Up @@ -258,6 +261,8 @@ def main():
if output_kmer_results:
if len(dfs) > 0:
dfall: pd.DataFrame = pd.concat([df.sort_values('is_pos_kmer', ascending=False) for df in dfs], sort=False)
#Error message is redundant accross each of the k-mers
dfall = dfall.drop(columns=['qc_message'])
dfall = df_field_fillna(dfall)
dfall.to_csv(output_kmer_results, **kwargs_for_pd_to_table)
logging.info('Kmer results written to "{}".'.format(output_kmer_results))
Expand Down
21 changes: 21 additions & 0 deletions bio_hansel/subtyper.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,24 @@ def parallel_query_reads(reads: List[Tuple[List[str], str]],
outputs = [x.get() for x in res]
return outputs

def filter_by_kmer_fraction(df,min_kmer_frac=0.05):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jrober84 I'm having trouble understanding what this function is supposed to do.

Is kmer fraction supposed to be like the alternate allele fraction (AF) from variant calling?

What is a noisy kmer? Any kmer that is observed at a low frequency relative to the sum of frequencies of all kmers observed at that position?

kmer_freq / sum(kmer_freqs_at_position) < min_kmer_frac

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @peterk87, I have updated the description in the code to be a bit more clear. But essentially, the function determines the total number of positive and negative kmers covering a position and then determines the percentage of the total coverage each k-mer for the position contributes. In the case where only a single k-mer is present it will always be one and will not ever be filtered. But when both positive and negative are present it will see if the percentage contribution of each k-mer to the total is above the set threshold. If it isn't that specific k-mer gets filtered from the data frame and so the QA/QC module will never see it. I would say this process has some similarity to the AF for variant calling but also will allow the user to configure an acceptable "contamination" level for their sample which for their application is not an issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jrober84

I'm not sure this code is doing what your description says it's doing. I don't see any accessing of kmer frequency values. If I'm interpreting your description above and in the docstring correctly, you need to be calculating the sum of kmer frequencies at each position. With .value_counts() on refposition you're simply getting a count of how many times you observe each refposition value (similar to Python's Counter). It also doesn't make sense that the refposition count is being divided by the refposition value.

Also, when possible, I think it's a better idea to include all results for the detailed results report for debugging and troubleshooting rather than filtering those results out. I would instead just modify this line:

st, df = process_subtyping_results(st, df[df.is_kmer_freq_okay], scheme_subtype_counts)

Where instead of just getting the subset of results where is_kmer_okay, kmers that pass the kmer fraction threshold could also be filtered for:

df[(df.is_kmer_freq_okay | (df.kmer_fraction >= subtyping_params.min_kmer_fraction))]

This way no results are removed from the detailed report, and only the "good" kmers are used to determine the subtype result.

So I would propose adding columns like kmer_fraction and maybe total_refposition_kmer_frequency and does_pass_kmer_fraction_threshold. This would be useful information for potential contamination detection.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good suggestion, I have updated the code to no longer be filtering the df for failed k-mers. This way all detected k-mers will appear in the detailed report but the QA/QC will be only on the pass list. Pass kmers must pass both the k-mer freq filter and k-mer frac filter. I have added the additional fields that you suggested for troubleshooting purposes.

"""Filter out noisy kmers from high coverage datasets

Args:
df: BioHansel k-mer frequence pandas df
min_kmer_frac: float 0 - 1 on the minimum fraction a kmer needs to be to be considered valid

Returns:
- pd.DataFrame with k-mers which satisfy the min-fraction
"""
position_counts = df['refposition'].value_counts().rename_axis('position').reset_index(name='counts')
valid_indexes = []
for index,row in df.iterrows():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using .itertuples() for performance reasons:

for row in df.itertuples():
    refposition = row.refposition # cannot do string based access (row['refposition']) with tuples

You could also look into using apply instead of using a for-loop:

def get_kmer_fraction(row):
    total_freq = position_frequencies.get(row.refposition, 0)
    return row.freq / total_freq if total_freq > 0 else 0.0

df['kmer_fraction'] = df.apply(get_kmer_fraction, axis=1)
df['total_refposition_kmer_frequency'] = df.apply(lambda row: position_frequencies.get(row.refposition, 0), axis=1) 

frac = row['refposition'] / position_counts.loc[position_counts['position'] == row['refposition'], 'counts'].iloc[0]
if frac > min_kmer_frac:
valid_indexes.append(index)
return df[df.index.isin(valid_indexes)]


def subtype_reads(reads: Union[str, List[str]],
genome_name: str,
Expand Down Expand Up @@ -285,6 +303,9 @@ def subtype_reads(reads: Union[str, List[str]],
df['subtype'] = subtypes
df['is_pos_kmer'] = ~df.kmername.str.contains('negative')
df['is_kmer_freq_okay'] = (df.freq >= subtyping_params.min_kmer_freq) & (df.freq <= subtyping_params.max_kmer_freq)
#apply a scaled approach for filtering of k-mers required for high coverage amplicon data
df = filter_by_kmer_fraction(df,subtyping_params.min_kmer_frac)

st.avg_kmer_coverage = df['freq'].mean()
st, df = process_subtyping_results(st, df[df.is_kmer_freq_okay], scheme_subtype_counts)
st.qc_status, st.qc_message = perform_quality_check(st, df, subtyping_params)
Expand Down
5 changes: 3 additions & 2 deletions bio_hansel/subtyping_params.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,10 @@ class SubtypingParams(object):
min_ambiguous_kmers = attr.ib(default=3, validator=attr.validators.instance_of(int))
max_perc_intermediate_kmers = attr.ib(default=0.05, validator=attr.validators.instance_of(float))
min_kmer_freq = attr.ib(default=8, validator=attr.validators.instance_of((float, int)))
max_kmer_freq = attr.ib(default=10000, validator=attr.validators.instance_of((float, int)))
min_kmer_frac = attr.ib(default=0.05, validator=attr.validators.instance_of(float))
max_kmer_freq = attr.ib(default=1000000, validator=attr.validators.instance_of((float, int)))
min_coverage_warning = attr.ib(default=20, validator=attr.validators.instance_of((float, int)))
max_degenerate_kmers = attr.ib(default=100000, validator=attr.validators.instance_of(int))
max_degenerate_kmers = attr.ib(default=10000000, validator=attr.validators.instance_of(int))

@max_perc_missing_kmers.validator
def _validate_max_perc_missing_kmers(self, attribute, value):
Expand Down
2 changes: 2 additions & 0 deletions bio_hansel/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -200,6 +200,8 @@ def init_subtyping_params(args: Optional[Any] = None,
subtyping_params.min_coverage_warning = args.low_cov_warning
if args.min_kmer_freq:
subtyping_params.min_kmer_freq = args.min_kmer_freq
if args.min_kmer_frac:
subtyping_params.min_kmer_frac = args.min_kmer_frac
if args.max_kmer_freq:
subtyping_params.max_kmer_freq = args.max_kmer_freq
if args.max_degenerate_kmers:
Expand Down