diff --git a/docs/api/data.rst b/docs/api/data.rst index 7b2d0904..3c9e9d14 100644 --- a/docs/api/data.rst +++ b/docs/api/data.rst @@ -49,7 +49,7 @@ This module supports reading and writing files that follow `the VCF ` contain example usage of the :class:`Genotypes` class. -See the documentation for the :class:`GenotypesRefAlt` class for an example of extending the :class:`Genotypes` class so that it loads REF and ALT alleles as well. +See the documentation for the :class:`GenotypesVCF` class for an example of extending the :class:`Genotypes` class so that it loads REF and ALT alleles as well. Classes ------- @@ -152,15 +152,15 @@ You can index into a loaded :class:`Genotypes` instance using the ``subset()`` f By default, the ``subset()`` method returns a new :class:`Genotypes` instance. The samples and variants in the new instance will be in the order specified. -GenotypesRefAlt -+++++++++++++++ -The :class:`Genotypes` class can be easily *extended* (sub-classed) to load extra fields into the ``variants`` structured array. The :class:`GenotypesRefAlt` class is an example of this where I extended the :class:`Genotypes` class to add REF and ALT fields from the VCF to the columns of the structured array. So the ``variants`` array will have named columns: "id", "chrom", "pos", "ref", and "alt". +GenotypesVCF +++++++++++++ +The :class:`Genotypes` class can be easily *extended* (sub-classed) to load extra fields into the ``variants`` structured array. The :class:`GenotypesVCF` class is an example of this where I extended the :class:`Genotypes` class to add REF and ALT fields from the VCF as a new column of the structured array. So the ``variants`` array will have named columns: "id", "chrom", "pos", "alleles". The new "alleles" column contains lists of alleles designed such that the first element in the list is the REF allele, the second is ALT1, the third is ALT2, etc. -All of the other methods in the :class:`Genotypes` class are inherited, but the :class:`GenotypesRefAlt` class implements an additional method ``write()`` for dumping the contents of the class to the provided file. +All of the other methods in the :class:`Genotypes` class are inherited, but the :class:`GenotypesVCF` class implements an additional method ``write()`` for dumping the contents of the class to the provided file. .. code-block:: python - genotypes = data.GenotypesRefAlt.load('tests/data/simple.vcf') + genotypes = data.GenotypesVCF.load('tests/data/simple.vcf') # make the first sample homozygous for the alt allele of the fourth variant genotypes.data[0, 3] = (1, 1) genotypes.write() @@ -182,13 +182,13 @@ The :class:`GenotypesPLINK` class offers experimental support for reading and wr pip install haptools[files] -The :class:`GenotypesPLINK` class inherits from the :class:`GenotypesRefAlt` class, so it has all the same methods and properties. Loading genotypes is the exact same, for example. +The :class:`GenotypesPLINK` class inherits from the :class:`GenotypesVCF` class, so it has all the same methods and properties. Loading genotypes is the exact same, for example. .. code-block:: python genotypes = data.GenotypesPLINK.load('tests/data/simple.pgen') genotypes.data # a numpy array of shape n x p x 2 - genotypes.variants # a numpy structured array of shape p x 6 + genotypes.variants # a numpy structured array of shape p x 5 genotypes.samples # a tuple of strings of length n In addition to the ``read()`` and ``load()`` methods, the :class:`GenotypesPLINK` class also has methods for reading (or writing) PVAR or PSAM files separately, without having to read (or write) the PGEN file as well. @@ -198,7 +198,7 @@ In addition to the ``read()`` and ``load()`` methods, the :class:`GenotypesPLINK genotypes = data.GenotypesPLINK('tests/data/simple.pgen') genotypes.read_variants() - genotypes.variants # a numpy structured array of shape p x 6 + genotypes.variants # a numpy structured array of shape p x 5 genotypes.read_samples() genotypes.samples # a tuple of strings of length n @@ -305,14 +305,14 @@ To write to a **.hap** file, you must first initialize a :class:`Haplotypes` obj Obtaining haplotype "genotypes" ******************************* -Using the ``transform()`` function, you can obtain a full instance of the :class:`GenotypesRefAlt` class where haplotypes from a :class:`Haplotypes` object are encoded as the variants in the genotype matrix. +Using the ``transform()`` function, you can obtain a full instance of the :class:`GenotypesVCF` class where haplotypes from a :class:`Haplotypes` object are encoded as the variants in the genotype matrix. .. code-block:: python haplotypes = data.Haplotypes.load('tests/data/example.hap.gz') - genotypes = data.GenotypesRefAlt.load('tests/data/example.vcf.gz') + genotypes = data.GenotypesVCF.load('tests/data/example.vcf.gz') hap_gts = haplotypes.transform(genotypes) - hap_gts # a GenotypesRefAlt instance where haplotypes are variants + hap_gts # a GenotypesVCF instance where haplotypes are variants Haplotype +++++++++ diff --git a/docs/api/examples.rst b/docs/api/examples.rst index c6ae9dcf..764c739f 100644 --- a/docs/api/examples.rst +++ b/docs/api/examples.rst @@ -24,7 +24,7 @@ As an example, let's say we would like to convert the following ``.blocks.det`` # load the genotypes file # you can use either a VCF or PGEN file - gt = data.GenotypesRefAlt.load("input.vcf.gz") + gt = data.GenotypesVCF.load("input.vcf.gz") gt = data.GenotypesPLINK.load("input.pgen") # load the haplotypes @@ -46,9 +46,9 @@ As an example, let's say we would like to convert the following ``.blocks.det`` # create variant lines for each haplotype # Note that the .blocks.det file doesn't specify an allele, so - # we simply choose the REF allele for this example + # we simply choose the first allele (ie the REF allele) for this example hp.data[hap_id].variants = tuple( - data.Variant(start=v["pos"], end=v["pos"]+len(v["ref"]), id=v["id"], allele=v["ref"]) + data.Variant(start=v["pos"], end=v["pos"]+len(v["alleles"][0]), id=v["id"], allele=v["alleles"][0]) for v in snp_gts.variants ) @@ -84,7 +84,7 @@ You can easily use the :ref:`data API ` and the :ref:`simphenotype API # load the genotypes file # you can use either a VCF or PGEN file - gt = data.GenotypesRefAlt("tests/data/apoe.vcf.gz") + gt = data.GenotypesVCF("tests/data/apoe.vcf.gz") gt.read(variants=variants) # the advantage of using a PGEN file is that you can use read_variants() to load # the variants quickly w/o having to load the genotypes, too @@ -96,14 +96,14 @@ You can easily use the :ref:`data API ` and the :ref:`simphenotype API hp.data = {} for variant in gt.variants: - ID, chrom, pos, alt = variant[["id", "chrom", "pos", "alt"]] - end = pos + len(alt) + ID, chrom, pos, alleles = variant[["id", "chrom", "pos", "alleles"]] + end = pos + len(alleles[1]) # create a haplotype line in the .hap file # you should fill out "beta" with your own value hp.data[ID] = Haplotype(chrom=chrom, start=pos, end=end, id=ID, beta=0.5) # create variant lines for each haplotype - hp.data[ID].variants = (data.Variant(start=pos, end=end, id=ID, allele=alt),) + hp.data[ID].variants = (data.Variant(start=pos, end=end, id=ID, allele=alleles[1]),) hp.write() diff --git a/haptools/data/__init__.py b/haptools/data/__init__.py index dbd3fca5..51f0e019 100644 --- a/haptools/data/__init__.py +++ b/haptools/data/__init__.py @@ -3,4 +3,4 @@ from .covariates import Covariates from .breakpoints import Breakpoints, HapBlock from .haplotypes import Extra, Variant, Haplotype, Haplotypes -from .genotypes import Genotypes, GenotypesRefAlt, GenotypesPLINK +from .genotypes import Genotypes, GenotypesVCF, GenotypesPLINK diff --git a/haptools/data/genotypes.py b/haptools/data/genotypes.py index 57b01747..6bd11329 100644 --- a/haptools/data/genotypes.py +++ b/haptools/data/genotypes.py @@ -491,7 +491,10 @@ def check_phase(self): self.log.warning("Phase information has already been removed from the data") return # check: are there any variants that are heterozygous and unphased? - unphased = (self.data[:, :, 0] ^ self.data[:, :, 1]) & (~self.data[:, :, 2]) + data = self.data + if data.dtype != np.bool_: + data = self.data.astype(np.bool_) + unphased = (data[:, :, 0] ^ data[:, :, 1]) & (~data[:, :, 2]) if np.any(unphased): samp_idx, variant_idx = np.nonzero(unphased) raise ValueError( @@ -591,11 +594,11 @@ def check_sorted(self): ) -class GenotypesRefAlt(Genotypes): +class GenotypesVCF(Genotypes): """ A class for processing genotypes from a file - Unlike the base Genotypes class, this class also includes REF and ALT alleles in - the variants array + Unlike the base Genotypes class, this class also includes REF and ALT alleles as + a list of alleles in the variants array Attributes ---------- @@ -610,24 +613,15 @@ class GenotypesRefAlt(Genotypes): 1. ID 2. CHROM 3. POS - 4. REF - 5. ALT + 4. [REF, ALT1, ALT2, ...] log: Logger See documentation for :py:attr:`~.Genotypes.log` """ def __init__(self, fname: Path | str, log: Logger = None): super().__init__(fname, log) - self.variants = np.array( - [], - dtype=[ - ("id", "U50"), - ("chrom", "U10"), - ("pos", np.uint32), - ("ref", "U100"), - ("alt", "U100"), - ], - ) + dtype = {k: v[0] for k, v in self.variants.dtype.fields.items()} + self.variants = np.array([], dtype=list(dtype.items()) + [("alleles", object)]) def _variant_arr(self, record: Variant): """ @@ -638,15 +632,14 @@ def _variant_arr(self, record: Variant): record.ID, record.CHROM, record.POS, - record.REF, - record.ALT[0], + (record.REF, *record.ALT), ), dtype=self.variants.dtype, ) def write(self): """ - Write the variants in this class to a VCF at :py:attr:`~.GenotypesRefAlt.fname` + Write the variants in this class to a VCF at :py:attr:`~.GenotypesVCF.fname` """ vcf = VariantFile(str(self.fname), mode="w") # make sure the header is properly structured @@ -671,13 +664,15 @@ def write(self): for sample in self.samples: vcf.header.add_sample(sample) self.log.info("Writing VCF records") + phased = self._prephased or (self.data.shape[2] < 3) + missing_val = np.iinfo(np.uint8).max for var_idx, var in enumerate(self.variants): rec = { "contig": var["chrom"], "start": var["pos"], - "stop": var["pos"] + len(var["ref"]) - 1, + "stop": var["pos"] + len(var["alleles"][0]) - 1, "qual": None, - "alleles": tuple(var[["ref", "alt"]]), + "alleles": var["alleles"], "id": var["id"], "filter": None, } @@ -686,29 +681,34 @@ def write(self): # parse the record into a pysam.VariantRecord record = vcf.new_record(**rec) for samp_idx, sample in enumerate(self.samples): - # TODO: make this work when there are missing values - record.samples[sample]["GT"] = tuple(self.data[samp_idx, var_idx, :2]) - # TODO: add proper phase info - record.samples[sample].phased = True + record.samples[sample]["GT"] = tuple( + None if val == missing_val else val + for val in self.data[samp_idx, var_idx, :2] + ) + # add proper phasing info + if phased: + record.samples[sample].phased = True + else: + record.samples[sample].phased = self.data[samp_idx, var_idx, 2] # write the record to a file vcf.write(record) vcf.close() -class GenotypesPLINK(GenotypesRefAlt): +class GenotypesPLINK(GenotypesVCF): """ A class for processing genotypes from a PLINK ``.pgen`` file Attributes ---------- data : np.array - See documentation for :py:attr:`~.GenotypesRefAlt.data` + See documentation for :py:attr:`~.GenotypesVCF.data` samples : tuple - See documentation for :py:attr:`~.GenotypesRefAlt.data` + See documentation for :py:attr:`~.GenotypesVCF.data` variants : np.array - See documentation for :py:attr:`~.GenotypesRefAlt.data` + See documentation for :py:attr:`~.GenotypesVCF.data` log: Logger - See documentation for :py:attr:`~.GenotypesRefAlt.data` + See documentation for :py:attr:`~.GenotypesVCF.data` chunk_size: int, optional The max number of variants to fetch from and write to the PGEN file at any given time @@ -716,7 +716,7 @@ class GenotypesPLINK(GenotypesRefAlt): If this value is provided, variants from the PGEN file will be loaded in chunks so as to use less memory _prephased: bool - See documentation for :py:attr:`~.GenotypesRefAlt.data` + See documentation for :py:attr:`~.GenotypesVCF.data` Examples -------- @@ -746,7 +746,7 @@ def read_samples(self, samples: list[str] = None): Parameters ---------- samples : list[str], optional - See documentation for :py:attr:`~.GenotypesRefAlt.read` + See documentation for :py:attr:`~.GenotypesVCF.read` Returns ------- @@ -839,14 +839,12 @@ def _variant_arr( npt.NDArray A row from the :py:attr:`~.GenotypesPLINK.variants` array """ - # TODO: remove the AAF column; right now, we just set it to 0.5 arbitrarily return np.array( ( record[cid["ID"]], record[cid["CHROM"]], record[cid["POS"]], - record[cid["REF"]], - record[cid["ALT"]], + (record[cid["REF"]], record[cid["ALT"]]), ), dtype=self.variants.dtype, ) @@ -865,9 +863,9 @@ def _iterate_variants( Parameters ---------- region : str, optional - See documentation for :py:attr:`~.GenotypesRefAlt.read` + See documentation for :py:attr:`~.GenotypesVCF.read` variants : set[str], optional - See documentation for :py:attr:`~.GenotypesRefAlt.read` + See documentation for :py:attr:`~.GenotypesVCF.read` Yields ------ @@ -940,11 +938,11 @@ def read_variants( Parameters ---------- region : str, optional - See documentation for :py:attr:`~.GenotypesRefAlt.read` + See documentation for :py:attr:`~.GenotypesVCF.read` variants : set[str], optional - See documentation for :py:attr:`~.GenotypesRefAlt.read` + See documentation for :py:attr:`~.GenotypesVCF.read` max_variants : int, optional - See documentation for :py:attr:`~.GenotypesRefAlt.read` + See documentation for :py:attr:`~.GenotypesVCF.read` Returns ------- @@ -997,13 +995,13 @@ def read( Parameters ---------- region : str, optional - See documentation for :py:attr:`~.GenotypesRefAlt.read` + See documentation for :py:attr:`~.GenotypesVCF.read` samples : list[str], optional - See documentation for :py:attr:`~.GenotypesRefAlt.read` + See documentation for :py:attr:`~.GenotypesVCF.read` variants : set[str], optional - See documentation for :py:attr:`~.GenotypesRefAlt.read` + See documentation for :py:attr:`~.GenotypesVCF.read` max_variants : int, optional - See documentation for :py:attr:`~.GenotypesRefAlt.read` + See documentation for :py:attr:`~.GenotypesVCF.read` """ super(Genotypes, self).read() import pgenlib @@ -1199,9 +1197,9 @@ def write_variants(self): rec = { "contig": var["chrom"], "start": var["pos"], - "stop": var["pos"] + len(var["ref"]) - 1, + "stop": var["pos"] + len(var["alleles"][0]) - 1, "qual": None, - "alleles": tuple(var[["ref", "alt"]]), + "alleles": var["alleles"], "id": var["id"], "filter": None, } diff --git a/haptools/data/haplotypes.py b/haptools/data/haplotypes.py index 2ce48fb9..30e66dcc 100644 --- a/haptools/data/haplotypes.py +++ b/haptools/data/haplotypes.py @@ -11,7 +11,7 @@ from pysam import TabixFile from .data import Data -from .genotypes import GenotypesRefAlt +from .genotypes import GenotypesVCF @dataclass @@ -453,7 +453,7 @@ def extras_order(cls) -> tuple[str]: """ return tuple(extra.name for extra in cls._extras) - def transform(self, genotypes: GenotypesRefAlt) -> npt.NDArray[bool]: + def transform(self, genotypes: GenotypesVCF) -> npt.NDArray[bool]: """ Transform a genotypes matrix via the current haplotype @@ -462,7 +462,7 @@ def transform(self, genotypes: GenotypesRefAlt) -> npt.NDArray[bool]: Parameters ---------- - genotypes : GenotypesRefAlt + genotypes : GenotypesVCF The genotypes which to transform using the current haplotype If the genotypes have not been loaded into the Genotypes object yet, this @@ -490,7 +490,7 @@ def transform(self, genotypes: GenotypesRefAlt) -> npt.NDArray[bool]: allele_arr = np.array( [ [ - [int(var.allele != gts.variants[i]["ref"])] + [int(var.allele != gts.variants[i]["alleles"][0])] for i, var in enumerate(self.variants) ] ] @@ -1053,9 +1053,9 @@ def write(self): def transform( self, - gts: GenotypesRefAlt, - hap_gts: GenotypesRefAlt = None, - ) -> GenotypesRefAlt: + gts: GenotypesVCF, + hap_gts: GenotypesVCF = None, + ) -> GenotypesVCF: """ Transform a genotypes matrix via the current haplotype @@ -1064,23 +1064,23 @@ def transform( Parameters ---------- - gts : GenotypesRefAlt + gts : GenotypesVCF The genotypes which to transform using the current haplotype - hap_gts: GenotypesRefAlt - An empty GenotypesRefAlt object into which the haplotype genotypes should + hap_gts: GenotypesVCF + An empty GenotypesVCF object into which the haplotype genotypes should be stored Returns ------- - GenotypesRefAlt + GenotypesVCF A Genotypes object composed of haplotypes instead of regular variants. """ - # Initialize GenotypesRefAlt return value + # Initialize GenotypesVCF return value if hap_gts is None: - hap_gts = GenotypesRefAlt(fname=None, log=self.log) + hap_gts = GenotypesVCF(fname=None, log=self.log) hap_gts.samples = gts.samples hap_gts.variants = np.array( - [(hap.id, hap.chrom, hap.start, "A", "T") for hap in self.data.values()], + [(hap.id, hap.chrom, hap.start, ("A", "T")) for hap in self.data.values()], dtype=hap_gts.variants.dtype, ) # build a fast data structure for querying the alleles in each haplotype: @@ -1104,7 +1104,7 @@ def transform( # with shape (1, gts.data.shape[1], 1) for broadcasting later allele_arr = np.array( [ - int(allele != gts.variants[i]["ref"]) + int(allele != gts.variants[i]["alleles"][0]) for i, (vID, allele) in enumerate(alleles) ], dtype=gts.data.dtype, diff --git a/haptools/ld.py b/haptools/ld.py index 1c4f1896..6e3b33c0 100644 --- a/haptools/ld.py +++ b/haptools/ld.py @@ -152,7 +152,7 @@ def calc_ld( gt = data.GenotypesPLINK(fname=genotypes, log=log, chunk_size=chunk_size) else: log.info("Loading genotypes from VCF/BCF file") - gt = data.GenotypesRefAlt(fname=genotypes, log=log) + gt = data.GenotypesVCF(fname=genotypes, log=log) # gt._prephased = True gt.read(region=region, samples=samples, variants=variants) gt.check_missing(discard_also=discard_missing) @@ -180,7 +180,7 @@ def calc_ld( if not from_gts: log.info("Transforming genotypes via haplotypes") - hp_gt = data.GenotypesRefAlt(fname=None, log=log) + hp_gt = data.GenotypesVCF(fname=None, log=log) hp.transform(gt, hp_gt) log.info("Obtaining target genotypes") diff --git a/haptools/sim_genotype.py b/haptools/sim_genotype.py index 6405aac2..51f83a93 100644 --- a/haptools/sim_genotype.py +++ b/haptools/sim_genotype.py @@ -9,7 +9,7 @@ from pysam import VariantFile from collections import defaultdict from .admix_storage import GeneticMarker, HaplotypeSegment -from .data import GenotypesRefAlt, GenotypesPLINK +from .data import GenotypesVCF, GenotypesPLINK from .transform import GenotypesAncestry @@ -72,7 +72,7 @@ def output_vcf( Outputs messages to the appropriate channel. """ - log.info(f"Outputting VCF file {out}.vcf.gz") + log.info(f"Outputting file {out}") # details to know # vcf file: how to handle samples and which sample is which haplotype block randomly choose out of current population types @@ -104,13 +104,17 @@ def output_vcf( if variant_file.endswith(".pgen"): vcf = GenotypesPLINK(variant_file, log=log) else: - vcf = GenotypesRefAlt(variant_file, log=log) + vcf = GenotypesVCF(variant_file, log=log) if not region: vcf.read() else: vcf.read(region=f"{region['chr']}:{region['start']}-{region['end']}") + if variant_file.endswith(".pgen"): + # the pgenlib library collapses multiallelic variants into a single allele + vcf.check_biallelic() + log.debug(f"Read in variants from {variant_file}") sample_dict = {} @@ -212,7 +216,7 @@ def output_vcf( gts.valid_labels = output_labels if not pop_field and not sample_field: - gts = GenotypesRefAlt(out, log=log) + gts = GenotypesVCF(out, log=log) else: gts = GenotypesPLINK(out, log=log) @@ -915,8 +919,8 @@ def validate_params(model, mapdir, chroms, popsize, invcf, sample_info, no_repla else: total_per_pop[info_pop] += 1 - if sample not in vcf_samples: - raise Exception(f"Sample {sample} in sampleinfo file is not present in the vcf file.") + if sample not in vcf_samples and info_pop in pops: + raise Exception(f"Sample {sample} from population {info_pop} in sampleinfo file is not present in the vcf file.") # Ensure that all populations from the model file are listed in the sample info file # and that there is sufficient enough samples when no_replacement is specified diff --git a/haptools/sim_phenotype.py b/haptools/sim_phenotype.py index bcb1277d..34f336df 100644 --- a/haptools/sim_phenotype.py +++ b/haptools/sim_phenotype.py @@ -14,7 +14,6 @@ Phenotypes, Haplotypes, GenotypesPLINK, - GenotypesRefAlt, ) diff --git a/haptools/transform.py b/haptools/transform.py index 482d9ffd..b25b95ad 100644 --- a/haptools/transform.py +++ b/haptools/transform.py @@ -28,7 +28,7 @@ class HaplotypeAncestry(data.Haplotype): default=(data.Extra("ancestry", "s", "Local ancestry"),), ) - def transform(self, genotypes: data.GenotypesRefAlt) -> npt.NDArray[bool]: + def transform(self, genotypes: data.GenotypesVCF) -> npt.NDArray[bool]: """ Transform a genotypes matrix via the current haplotype and its ancestral population @@ -51,7 +51,7 @@ def transform(self, genotypes: data.GenotypesRefAlt) -> npt.NDArray[bool]: allele_arr = np.array( [ [ - [int(var.allele != gts.variants[i]["ref"])] + [int(var.allele != gts.variants[i]["alleles"][0])] for i, var in enumerate(self.variants) ] ] @@ -91,14 +91,14 @@ def __init__( def transform( self, gts: data.GenotypesAncestry, - hap_gts: data.GenotypesRefAlt = None, - ) -> data.GenotypesRefAlt: - # Initialize GenotypesRefAlt return value + hap_gts: data.GenotypesVCF = None, + ) -> data.GenotypesVCF: + # Initialize GenotypesVCF return value if hap_gts is None: - hap_gts = data.GenotypesRefAlt(fname=None, log=self.log) + hap_gts = data.GenotypesVCF(fname=None, log=self.log) hap_gts.samples = gts.samples hap_gts.variants = np.array( - [(hap.id, hap.chrom, hap.start, "A", "T") for hap in self.data.values()], + [(hap.id, hap.chrom, hap.start, ("A", "T")) for hap in self.data.values()], dtype=hap_gts.variants.dtype, ) # build a fast data structure for querying the alleles in each haplotype: @@ -125,7 +125,7 @@ def transform( # with shape (1, gts.data.shape[1], 1) for broadcasting later allele_arr = np.array( [ - int(allele != gts.variants[i]["ref"]) + int(allele != gts.variants[i]["alleles"][0]) for i, (vID, allele) in enumerate(alleles) ], dtype=gts.data.dtype, @@ -147,9 +147,9 @@ def transform( return hap_gts -class GenotypesAncestry(data.GenotypesRefAlt): +class GenotypesAncestry(data.GenotypesVCF): """ - Extends the GenotypesRefAlt class for ancestry data + Extends the GenotypesVCF class for ancestry data The ancestry information is stored within the FORMAT field of the VCF @@ -162,7 +162,7 @@ class GenotypesAncestry(data.GenotypesRefAlt): samples : tuple[str] See documentation for :py:attr:`~.Genotypes.samples` variants : np.array - See documentation for :py:attr:`~.GenotypesRefAlt.variants` + See documentation for :py:attr:`~.GenotypesVCF.variants` valid_labels: np.array Reference VCF sample and respective variant grabbed for each sample. @@ -479,13 +479,14 @@ def write(self, chroms=None): for sample in self.samples: vcf.header.add_sample(sample) self.log.info("Writing VCF records") + phased = self._prephased or (self.data.shape[2] < 3) for var_idx, var in enumerate(self.variants): rec = { "contig": var["chrom"], "start": var["pos"], - "stop": var["pos"] + len(var["ref"]) - 1, + "stop": var["pos"] + len(var["alleles"][0]) - 1, "qual": None, - "alleles": tuple(var[["ref", "alt"]]), + "alleles": var["alleles"], "id": var["id"], "filter": None, } @@ -507,8 +508,11 @@ def write(self, chroms=None): record.samples[sample]["SAMPLE"] = tuple( self.valid_labels[samp_idx, var_idx, :] ) - # TODO: add proper phase info - record.samples[sample].phased = True + # add proper phasing info + if phased: + record.samples[sample].phased = True + else: + record.samples[sample].phased = self.data[samp_idx, var_idx, 2] # write the record to a file vcf.write(record) vcf.close() @@ -597,7 +601,7 @@ def transform_haps( if ancestry and not bps_file.exists(): gt = GenotypesAncestry(fname=genotypes, log=log) else: - gt = data.GenotypesRefAlt(fname=genotypes, log=log) + gt = data.GenotypesVCF(fname=genotypes, log=log) # gt._prephased = True gt.read(region=region, samples=samples, variants=variants) gt.check_missing(discard_also=discard_missing) @@ -631,7 +635,7 @@ def transform_haps( bps = data.Breakpoints(fname=bps_file, log=log) bps.read(samples=set(gt.samples)) bps.encode() - # convert the GenotypesRefAlt object to a GenotypesAncestry object + # convert the GenotypesVCF object to a GenotypesAncestry object # TODO: figure out a better solution for this # this is just a temp hack to get output from simgenotype to load a bit faster gta = GenotypesAncestry(fname=None, log=log) @@ -647,7 +651,7 @@ def transform_haps( hp_gt = data.GenotypesPLINK(fname=output, log=log, chunk_size=chunk_size) else: out_file_type = "VCF/BCF" - hp_gt = data.GenotypesRefAlt(fname=output, log=log) + hp_gt = data.GenotypesVCF(fname=output, log=log) log.info("Transforming genotypes via haplotypes") hp.transform(gt, hp_gt) diff --git a/tests/bench_genotypes.py b/tests/bench_genotypes.py index 1258c620..734e1d79 100755 --- a/tests/bench_genotypes.py +++ b/tests/bench_genotypes.py @@ -10,7 +10,7 @@ import numpy as np import matplotlib.pyplot as plt -from haptools.data import GenotypesRefAlt, GenotypesPLINK +from haptools.data import GenotypesVCF, GenotypesPLINK # COMMAND FOR GENERATING UKB PLOT: @@ -32,7 +32,7 @@ def create_variant_files(gts, intervals, num_samps): sub.fname = variant_dir / f"{val}.pgen" sub.write() # write VCF files - vcf = GenotypesRefAlt(sub.fname.with_suffix(".vcf")) + vcf = GenotypesVCF(sub.fname.with_suffix(".vcf")) vcf.data = sub.data vcf.samples = sub.samples vcf.variants = sub.variants @@ -51,7 +51,7 @@ def create_sample_files(gts, intervals, num_vars): sub.fname = sample_dir / f"{val}.pgen" sub.write() # write VCF files - vcf = GenotypesRefAlt(sub.fname.with_suffix(".vcf")) + vcf = GenotypesVCF(sub.fname.with_suffix(".vcf")) vcf.data = sub.data vcf.samples = sub.samples vcf.variants = sub.variants @@ -60,7 +60,7 @@ def create_sample_files(gts, intervals, num_vars): def time_vcf(vcf, max_variants): - GenotypesRefAlt(vcf).read(max_variants=max_variants) + GenotypesVCF(vcf).read(max_variants=max_variants) def time_plink(pgen, max_variants): diff --git a/tests/bench_transform.py b/tests/bench_transform.py index 16b1d98b..2aaa7753 100755 --- a/tests/bench_transform.py +++ b/tests/bench_transform.py @@ -11,7 +11,7 @@ import numpy as np import matplotlib.pyplot as plt -from haptools.data import GenotypesRefAlt, Haplotypes, Haplotype, Variant +from haptools.data import GenotypesVCF, Haplotypes, Haplotype, Variant # ---------------USAGE---------------- @@ -22,7 +22,7 @@ def create_genotypes(log, samples, variants, with_phase=False): - gts = GenotypesRefAlt(fname=None, log=log) + gts = GenotypesVCF(fname=None, log=log) shape = (samples, variants, 2 + with_phase) # create a GT matrix with shape: samples x SNPs x (strands+phase) gts.data = np.random.choice([0, 1], size=np.prod(shape)) diff --git a/tests/data/dat_files/correct_model.dat b/tests/data/dat_files/correct_model.dat index 55668724..572d1dda 100644 --- a/tests/data/dat_files/correct_model.dat +++ b/tests/data/dat_files/correct_model.dat @@ -1,2 +1,2 @@ -40 Admix Pop1 Pop2 Pop3 +40 Admix CEU YRI Pop3 10 0 0 0.95 .05 diff --git a/tests/data/simple-multiallelic.vcf b/tests/data/simple-multiallelic.vcf new file mode 100644 index 00000000..25821b09 --- /dev/null +++ b/tests/data/simple-multiallelic.vcf @@ -0,0 +1,9 @@ +##fileformat=VCFv4.2 +##filedate=20180225 +##FORMAT= +##contig= +#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 HG00099 HG00100 HG00101 +1 10114 1:10114:T:C T C . . . GT 0|0 0|0 0|0 0|0 0|0 +1 10116 1:10116:A:G A G,T . . . GT 0|1 0|1 2|1 1|1 0|2 +1 10117 1:10117:C:A C A . . . GT 0|0 0|0 0|0 0|0 0|0 +1 10122 1:10122:A:G A G,C . . . GT 0|0 0|0 0|0 0|0 0|0 diff --git a/tests/test_data.py b/tests/test_data.py index 03059d73..7f882b16 100644 --- a/tests/test_data.py +++ b/tests/test_data.py @@ -17,7 +17,7 @@ Covariates, Haplotypes, Breakpoints, - GenotypesRefAlt, + GenotypesVCF, GenotypesPLINK, ) @@ -148,14 +148,15 @@ def test_load_genotypes_example(self): assert gts.samples[:25] == samples def test_load_genotypes_iterate(self, caplog): - expected = self._get_expected_genotypes().transpose((1, 0, 2)) - samples = ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101") + expected = self._get_fake_genotypes() # can we load the data from the VCF? gts = Genotypes(DATADIR.joinpath("simple.vcf")) for idx, line in enumerate(gts): - np.testing.assert_allclose(line.data, expected[idx]) - assert gts.samples == samples + np.testing.assert_allclose(line.data[:, :2], expected.data[:, idx]) + for col in ("chrom", "pos", "id"): + assert line.variants[col] == expected.variants[col][idx] + assert gts.samples == expected.samples def test_load_genotypes_discard_multiallelic(self): gts = self._get_fake_genotypes() @@ -281,7 +282,7 @@ def test_check_sorted(self, caplog): class TestGenotypesPLINK: def _get_fake_genotypes_plink(self): pgenlib = pytest.importorskip("pgenlib") - gts_ref_alt = TestGenotypesRefAlt()._get_fake_genotypes_refalt() + gts_ref_alt = TestGenotypesVCF()._get_fake_genotypes_refalt() gts = GenotypesPLINK(gts_ref_alt.fname) gts.data = gts_ref_alt.data gts.samples = gts_ref_alt.samples @@ -299,7 +300,7 @@ def test_load_genotypes(self): np.testing.assert_allclose(gts.data, expected.data) assert gts.samples == expected.samples for i, x in enumerate(expected.variants): - for col in ("chrom", "pos", "id", "ref", "alt"): + for col in ("chrom", "pos", "id", "alleles"): assert gts.variants[col][i] == expected.variants[col][i] def test_load_genotypes_chunked(self): @@ -313,7 +314,7 @@ def test_load_genotypes_chunked(self): np.testing.assert_allclose(gts.data, expected.data) assert gts.samples == expected.samples for i, x in enumerate(expected.variants): - for col in ("chrom", "pos", "id", "ref", "alt"): + for col in ("chrom", "pos", "id", "alleles"): assert gts.variants[col][i] == expected.variants[col][i] def test_load_genotypes_prephased(self): @@ -327,7 +328,7 @@ def test_load_genotypes_prephased(self): np.testing.assert_allclose(gts.data, expected.data) assert gts.samples == expected.samples for i, x in enumerate(expected.variants): - for col in ("chrom", "pos", "id", "ref", "alt"): + for col in ("chrom", "pos", "id", "alleles"): assert gts.variants[col][i] == expected.variants[col][i] def test_load_genotypes_iterate(self): @@ -338,8 +339,11 @@ def test_load_genotypes_iterate(self): # check that everything matches what we expected for idx, line in enumerate(gts): np.testing.assert_allclose(line.data[:, :2], expected.data[:, idx]) - for col in ("chrom", "pos", "id", "ref", "alt"): + for col in ("chrom", "pos", "id"): assert line.variants[col] == expected.variants[col][idx] + assert ( + line.variants["alleles"].tolist() == expected.variants["alleles"][idx] + ) assert gts.samples == expected.samples def test_load_genotypes_subset(self): @@ -390,7 +394,7 @@ def test_write_genotypes_chunked(self): np.testing.assert_allclose(gts.data, new_gts.data) assert gts.samples == new_gts.samples for i in range(len(new_gts.variants)): - for col in ("chrom", "pos", "id", "ref", "alt"): + for col in ("chrom", "pos", "id", "alleles"): assert gts.variants[col][i] == new_gts.variants[col][i] # clean up afterwards: delete the files we created @@ -413,7 +417,7 @@ def test_write_genotypes(self): np.testing.assert_allclose(gts.data, new_gts.data) assert gts.samples == new_gts.samples for i in range(len(new_gts.variants)): - for col in ("chrom", "pos", "id", "ref", "alt"): + for col in ("chrom", "pos", "id", "alleles"): assert gts.variants[col][i] == new_gts.variants[col][i] # clean up afterwards: delete the files we created @@ -437,7 +441,7 @@ def test_write_genotypes_prephased(self): np.testing.assert_allclose(gts.data, new_gts.data) assert gts.samples == new_gts.samples for i in range(len(new_gts.variants)): - for col in ("chrom", "pos", "id", "ref", "alt"): + for col in ("chrom", "pos", "id", "alleles"): assert gts.variants[col][i] == new_gts.variants[col][i] # clean up afterwards: delete the files we created @@ -462,7 +466,7 @@ def test_write_genotypes_unphased(self): np.testing.assert_allclose(gts.data, new_gts.data) assert gts.samples == new_gts.samples for i in range(len(new_gts.variants)): - for col in ("chrom", "pos", "id", "ref", "alt"): + for col in ("chrom", "pos", "id", "alleles"): assert gts.variants[col][i] == new_gts.variants[col][i] # clean up afterwards: delete the files we created @@ -897,7 +901,7 @@ def test_hap_transform(self): ) hap = list(self._get_dummy_haps().data.values())[0] - gens = TestGenotypesRefAlt()._get_fake_genotypes_refalt() + gens = TestGenotypesVCF()._get_fake_genotypes_refalt() hap_gt = hap.transform(gens) np.testing.assert_allclose(hap_gt, expected) @@ -914,10 +918,10 @@ def test_haps_transform(self, return_also=False): ) haps = self._get_dummy_haps() - gens = TestGenotypesRefAlt()._get_fake_genotypes_refalt() + gens = TestGenotypesVCF()._get_fake_genotypes_refalt() gens.data[[2, 4], 0, 1] = 1 gens.data[[1, 4], 2, 0] = 1 - hap_gt = GenotypesRefAlt(fname=None) + hap_gt = GenotypesVCF(fname=None) haps.transform(gens, hap_gt) np.testing.assert_allclose(hap_gt.data, expected) @@ -998,11 +1002,11 @@ def test_sort(self): test_hap1.fname.unlink() -class TestGenotypesRefAlt: +class TestGenotypesVCF: def _get_fake_genotypes_refalt(self, with_phase=False): base_gts = TestGenotypes()._get_fake_genotypes() # copy all of the fields - gts = GenotypesRefAlt(fname=None) + gts = GenotypesVCF(fname=None) gts.data = base_gts.data if with_phase: data_shape = (gts.data.shape[0], gts.data.shape[1], 1) @@ -1011,42 +1015,45 @@ def _get_fake_genotypes_refalt(self, with_phase=False): (gts.data, np.ones(data_shape, dtype=gts.data.dtype)), axis=2 ) gts.samples = base_gts.samples - # add additional ref and alt alleles - ref_alt = np.array( + base_dtype = {k: v[0] for k, v in base_gts.variants.dtype.fields.items()} + ref_alt = [ + ("T", "C"), + ("A", "G"), + ("C", "A"), + ("A", "G"), + ] + gts.variants = np.array( + [tuple(rec) + (ref_alt[idx],) for idx, rec in enumerate(base_gts.variants)], + dtype=(list(base_dtype.items()) + [("alleles", object)]), + ) + return gts + + def _get_fake_genotypes_multiallelic(self, with_phase=False): + gts = self._get_fake_genotypes_refalt(with_phase=with_phase) + # replace the necessary properties + gts.variants["alleles"] = np.array( [ ("T", "C"), - ("A", "G"), + ("A", "G", "T"), ("C", "A"), - ("A", "G"), - ], - dtype=[ - ("ref", "U100"), - ("alt", "U100"), + ("A", "G", "C"), ], + dtype=gts.variants["alleles"].dtype, ) - # see https://stackoverflow.com/a/5356137 - gts.variants = rfn.merge_arrays((base_gts.variants, ref_alt), flatten=True) + gts.data[[2, 4], 1, [0, 1]] = 2 return gts def test_read_ref_alt(self): # simple.vcf - gts_ref_alt_read = GenotypesRefAlt(DATADIR.joinpath("simple.vcf")) - gts_ref_alt_read.read() - expected = np.array( - [ - ("T", "C"), - ("A", "G"), - ("C", "A"), - ("A", "G"), - ], - dtype=gts_ref_alt_read.variants[["ref", "alt"]].dtype, - ) - for i, x in enumerate(expected): - assert gts_ref_alt_read.variants[["ref", "alt"]][i] == x + expected = self._get_fake_genotypes_refalt() + gts = GenotypesVCF(DATADIR.joinpath("simple.vcf")) + gts.read() + for i, x in enumerate(expected.variants["alleles"]): + assert gts.variants["alleles"][i] == x # example.vcf.gz - gts_ref_alt = GenotypesRefAlt(DATADIR.joinpath("example.vcf.gz")) - gts_ref_alt.read() + gts = GenotypesVCF(DATADIR.joinpath("example.vcf.gz")) + gts.read() expected = np.array( [ ("C", "A"), @@ -1065,37 +1072,121 @@ def test_read_ref_alt(self): ("T", "C"), ("C", "T"), ], - dtype=gts_ref_alt.variants[["ref", "alt"]].dtype, + dtype=gts.variants["alleles"].dtype, ) for i, x in enumerate(expected): - assert gts_ref_alt.variants[["ref", "alt"]][i] == x + assert gts.variants["alleles"][i] == tuple(x.tolist()) + + def test_read_multiallelic(self): + # simple-multiallelic.vcf + expected = self._get_fake_genotypes_multiallelic(with_phase=True) + + gts = GenotypesVCF(DATADIR.joinpath("simple-multiallelic.vcf")) + gts.read() + for i, x in enumerate(expected.variants["alleles"]): + assert gts.variants["alleles"][i] == x + np.testing.assert_allclose(gts.data, expected.data) - def test_write_ref_alt(self): + def test_write_ref_alt(self, multiallelic=False): # strategy is to read in the file, write it, and then read again - # read genotypes - gts_ref_alt_write = GenotypesRefAlt(DATADIR.joinpath("simple.vcf")) - gts_ref_alt_write.read() - gts_ref_alt_write.check_phase() + if multiallelic: + expected = self._get_fake_genotypes_multiallelic() + # read genotypes + gts = GenotypesVCF(DATADIR.joinpath("simple-multiallelic.vcf")) + else: + expected = self._get_fake_genotypes_refalt() + # read genotypes + gts = GenotypesVCF(DATADIR.joinpath("simple.vcf")) + gts.read() + gts.check_phase() # write file to new file fname = DATADIR.joinpath("test.vcf") - gts_ref_alt_write.fname = fname - gts_ref_alt_write.write() + gts.fname = fname + gts.write() # read again - gts_ref_alt_write.read() - gts_ref_alt_write.check_phase() + gts.read() + gts.check_phase() # compare samples, data, variants, and ref/alt for equality - expected = self._get_fake_genotypes_refalt() - assert gts_ref_alt_write.samples == expected.samples - np.testing.assert_allclose(gts_ref_alt_write.data, expected.data) + assert gts.samples == expected.samples + np.testing.assert_allclose(gts.data, expected.data) for i, x in enumerate(expected.variants): # dtype.names gives us names of columns in the array for col in expected.variants.dtype.names: # each row is a variant # index into col, i gets specific variant, x iterates thru - assert gts_ref_alt_write.variants[col][i] == x[col] + assert gts.variants[col][i] == x[col] - os.remove(str(fname)) + fname.unlink() + + def test_write_multiallelic(self): + self.test_write_ref_alt(multiallelic=True) + + def test_write_phase(self, prephased=True): + gts = self._get_fake_genotypes_refalt() + + fname = DATADIR.joinpath("test_write_phase.vcf") + gts.fname = fname + if prephased: + gts._prephased = True + else: + # add phasing information back + gts.data = np.dstack( + (gts.data, np.ones(gts.data.shape[:2], dtype=np.uint8)) + ) + gts.data[:2, 1, 2] = 0 + gts.write() + + new_gts = GenotypesVCF(fname) + if prephased: + new_gts._prephased = True + new_gts.read() + + # check that everything matches what we expected + np.testing.assert_allclose(gts.data, new_gts.data) + assert gts.samples == new_gts.samples + for i in range(len(new_gts.variants)): + for col in ("chrom", "pos", "id", "alleles"): + assert gts.variants[col][i] == new_gts.variants[col][i] + + fname.unlink() + + def test_write_unphased(self): + self.test_write_phase(prephased=False) + + def test_write_missing(self): + gts = self._get_fake_genotypes_refalt() + gts.fname = DATADIR.joinpath("test_write_missing.vcf") + vals = len(np.unique(gts.data)) + + gts.write() + + new_gts = GenotypesVCF(gts.fname) + new_gts.read() + + new_gts.check_missing() + + # force two of the samples to have a missing GT + gts.data = gts.data.astype(np.int8) + gts.data[1, 1, 1] = -1 + gts.data = gts.data.astype(np.uint8) + + gts.write() + + new_gts = GenotypesVCF(gts.fname) + new_gts.read() + + assert len(np.unique(gts.data)) == (vals + 1) + + with pytest.raises(ValueError) as info: + gts.check_missing() + assert ( + str(info.value) + == "Genotype with ID 1:10116:A:G at POS 1:10116 is missing for sample" + " HG00097" + ) + + gts.fname.unlink() class TestBreakpoints: @@ -1280,3 +1371,36 @@ def test_breakpoints_to_pop_array_chrom_no_match(self): with pytest.raises(ValueError) as info: pop_arr = expected.population_array(variants[[0, 1, 3]]) assert str(info.value).startswith("Chromosome ") + + +class TestDocExamples: + def test_gts2hap(self): + # which variants do we want to write to the haplotype file? + variants = {"rs429358", "rs7412"} + + # load the genotypes file + # you can use either a VCF or PGEN file + gt = GenotypesVCF("tests/data/apoe.vcf.gz") + gt.read(variants=variants) + + # initialize an empty haplotype file + hp = Haplotypes("output.hap", haplotype=Haplotype) + hp.data = {} + + for variant in gt.variants: + ID, chrom, pos, alleles = variant[["id", "chrom", "pos", "alleles"]] + end = pos + len(alleles[1]) + + # create a haplotype line in the .hap file + # you should fill out "beta" with your own value + hp.data[ID] = HaptoolsHaplotype( + chrom=chrom, start=pos, end=end, id=ID, beta=0.5 + ) + + # create variant lines for each haplotype + hp.data[ID].variants = ( + Variant(start=pos, end=end, id=ID, allele=alleles[1]), + ) + + hp.write() + hp.fname.unlink() diff --git a/tests/test_outputvcf.py b/tests/test_outputvcf.py index 74ae0064..07e9739c 100644 --- a/tests/test_outputvcf.py +++ b/tests/test_outputvcf.py @@ -90,9 +90,9 @@ def _get_expected_output(): gts.data = gts.data.transpose((1, 0, 2)) gts.variants = np.array( [ - ("1:10114:T:C", "1", 10114, "T", "C"), - ("1:59423090:A:G", "1", 59423090, "A", "G"), - ("2:10122:A:G", "2", 10122, "A", "G"), + ("1:10114:T:C", "1", 10114, ("T", "C")), + ("1:59423090:A:G", "1", 59423090, ("A", "G")), + ("2:10122:A:G", "2", 10122, ("A", "G")), ], dtype=gts.variants.dtype, ) @@ -759,7 +759,10 @@ def test_model_files(): validate_params( model, mapdir, chroms, popsize, vcf_file, faulty_sampleinfo_file, False ) - msg = "Sample HG00022 in sampleinfo file is not present in the vcf file." + msg = ( + "Sample HG00022 from population CEU in sampleinfo file is not present in the" + " vcf file." + ) assert (str(e.value)) == msg faulty_model = DATADIR.joinpath("dat_files/faulty_model_sample_info.dat") diff --git a/tests/test_transform.py b/tests/test_transform.py index 930c4cc3..7aeb6a25 100644 --- a/tests/test_transform.py +++ b/tests/test_transform.py @@ -12,8 +12,8 @@ GenotypesAncestry, ) from haptools.__main__ import main -from .test_data import TestGenotypesRefAlt -from haptools.data import Variant, GenotypesRefAlt, GenotypesPLINK +from .test_data import TestGenotypesVCF +from haptools.data import Variant, GenotypesVCF, GenotypesPLINK DATADIR = Path(__file__).parent.joinpath("data") @@ -37,7 +37,7 @@ def _get_expected_ancestry(self): return expected def _get_fake_genotypes(self): - base_gts = TestGenotypesRefAlt()._get_fake_genotypes_refalt(with_phase=True) + base_gts = TestGenotypesVCF()._get_fake_genotypes_refalt(with_phase=True) # copy all of the fields gts = GenotypesAncestry(fname=None) gts.data = base_gts.data @@ -131,7 +131,7 @@ def test_write_genotypes(self): assert False @pytest.mark.xfail(reason="not implemented yet") - def test_write_genotypes_unphased(self): + def test_write_genotypes_phase(self, prephased=True): assert False @@ -199,7 +199,7 @@ def test_haps_transform(self): gens = TestGenotypesAncestry()._get_fake_genotypes() gens.check_phase() - hap_gt = GenotypesRefAlt(fname=None) + hap_gt = GenotypesVCF(fname=None) haps.transform(gens, hap_gt) np.testing.assert_allclose(hap_gt.data, expected) @@ -208,7 +208,7 @@ def test_haps_transform(self): gens.data[[2, 4], 0, 1] = 1 gens.data[[1, 4], 2, 0] = 1 - hap_gt = GenotypesRefAlt(fname=None) + hap_gt = GenotypesVCF(fname=None) haps.transform(gens, hap_gt) np.testing.assert_allclose(hap_gt.data, expected) @@ -242,7 +242,7 @@ def test_basic_subset(capfd): """ # first, remove the last two variants in the genotypes file - gts = GenotypesRefAlt.load("tests/data/simple.vcf") + gts = GenotypesVCF.load("tests/data/simple.vcf") gts.fname = Path("simple_minus_two.vcf") gts.subset(variants=tuple(gts.variants["id"][:-2]), inplace=True) gts.write()