Skip to content

Commit

Permalink
fix: regression in multiallelic support for simgenotype (#195)
Browse files Browse the repository at this point in the history
Co-authored-by: mlamkin7 <[email protected]>
  • Loading branch information
aryarm and mlamkin7 authored Mar 5, 2023
1 parent 85bd494 commit b57f91f
Show file tree
Hide file tree
Showing 16 changed files with 327 additions and 186 deletions.
24 changes: 12 additions & 12 deletions docs/api/data.rst
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ This module supports reading and writing files that follow `the VCF <https://gat
Documentation
-------------
The :ref:`genotypes.py API docs <api-haptools-data-genotypes>` contain example usage of the :class:`Genotypes` class.
See the documentation for the :class:`GenotypesRefAlt` class for an example of extending the :class:`Genotypes` class so that it loads REF and ALT alleles as well.
See the documentation for the :class:`GenotypesVCF` class for an example of extending the :class:`Genotypes` class so that it loads REF and ALT alleles as well.

Classes
-------
Expand Down Expand Up @@ -152,15 +152,15 @@ You can index into a loaded :class:`Genotypes` instance using the ``subset()`` f
By default, the ``subset()`` method returns a new :class:`Genotypes` instance. The samples and variants in the new instance will be in the order specified.

GenotypesRefAlt
+++++++++++++++
The :class:`Genotypes` class can be easily *extended* (sub-classed) to load extra fields into the ``variants`` structured array. The :class:`GenotypesRefAlt` class is an example of this where I extended the :class:`Genotypes` class to add REF and ALT fields from the VCF to the columns of the structured array. So the ``variants`` array will have named columns: "id", "chrom", "pos", "ref", and "alt".
GenotypesVCF
++++++++++++
The :class:`Genotypes` class can be easily *extended* (sub-classed) to load extra fields into the ``variants`` structured array. The :class:`GenotypesVCF` class is an example of this where I extended the :class:`Genotypes` class to add REF and ALT fields from the VCF as a new column of the structured array. So the ``variants`` array will have named columns: "id", "chrom", "pos", "alleles". The new "alleles" column contains lists of alleles designed such that the first element in the list is the REF allele, the second is ALT1, the third is ALT2, etc.

All of the other methods in the :class:`Genotypes` class are inherited, but the :class:`GenotypesRefAlt` class implements an additional method ``write()`` for dumping the contents of the class to the provided file.
All of the other methods in the :class:`Genotypes` class are inherited, but the :class:`GenotypesVCF` class implements an additional method ``write()`` for dumping the contents of the class to the provided file.

.. code-block:: python
genotypes = data.GenotypesRefAlt.load('tests/data/simple.vcf')
genotypes = data.GenotypesVCF.load('tests/data/simple.vcf')
# make the first sample homozygous for the alt allele of the fourth variant
genotypes.data[0, 3] = (1, 1)
genotypes.write()
Expand All @@ -182,13 +182,13 @@ The :class:`GenotypesPLINK` class offers experimental support for reading and wr
pip install haptools[files]
The :class:`GenotypesPLINK` class inherits from the :class:`GenotypesRefAlt` class, so it has all the same methods and properties. Loading genotypes is the exact same, for example.
The :class:`GenotypesPLINK` class inherits from the :class:`GenotypesVCF` class, so it has all the same methods and properties. Loading genotypes is the exact same, for example.

.. code-block:: python
genotypes = data.GenotypesPLINK.load('tests/data/simple.pgen')
genotypes.data # a numpy array of shape n x p x 2
genotypes.variants # a numpy structured array of shape p x 6
genotypes.variants # a numpy structured array of shape p x 5
genotypes.samples # a tuple of strings of length n
In addition to the ``read()`` and ``load()`` methods, the :class:`GenotypesPLINK` class also has methods for reading (or writing) PVAR or PSAM files separately, without having to read (or write) the PGEN file as well.
Expand All @@ -198,7 +198,7 @@ In addition to the ``read()`` and ``load()`` methods, the :class:`GenotypesPLINK
genotypes = data.GenotypesPLINK('tests/data/simple.pgen')
genotypes.read_variants()
genotypes.variants # a numpy structured array of shape p x 6
genotypes.variants # a numpy structured array of shape p x 5
genotypes.read_samples()
genotypes.samples # a tuple of strings of length n
Expand Down Expand Up @@ -305,14 +305,14 @@ To write to a **.hap** file, you must first initialize a :class:`Haplotypes` obj
Obtaining haplotype "genotypes"
*******************************
Using the ``transform()`` function, you can obtain a full instance of the :class:`GenotypesRefAlt` class where haplotypes from a :class:`Haplotypes` object are encoded as the variants in the genotype matrix.
Using the ``transform()`` function, you can obtain a full instance of the :class:`GenotypesVCF` class where haplotypes from a :class:`Haplotypes` object are encoded as the variants in the genotype matrix.

.. code-block:: python
haplotypes = data.Haplotypes.load('tests/data/example.hap.gz')
genotypes = data.GenotypesRefAlt.load('tests/data/example.vcf.gz')
genotypes = data.GenotypesVCF.load('tests/data/example.vcf.gz')
hap_gts = haplotypes.transform(genotypes)
hap_gts # a GenotypesRefAlt instance where haplotypes are variants
hap_gts # a GenotypesVCF instance where haplotypes are variants
Haplotype
+++++++++
Expand Down
14 changes: 7 additions & 7 deletions docs/api/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ As an example, let's say we would like to convert the following ``.blocks.det``
# load the genotypes file
# you can use either a VCF or PGEN file
gt = data.GenotypesRefAlt.load("input.vcf.gz")
gt = data.GenotypesVCF.load("input.vcf.gz")
gt = data.GenotypesPLINK.load("input.pgen")
# load the haplotypes
Expand All @@ -46,9 +46,9 @@ As an example, let's say we would like to convert the following ``.blocks.det``
# create variant lines for each haplotype
# Note that the .blocks.det file doesn't specify an allele, so
# we simply choose the REF allele for this example
# we simply choose the first allele (ie the REF allele) for this example
hp.data[hap_id].variants = tuple(
data.Variant(start=v["pos"], end=v["pos"]+len(v["ref"]), id=v["id"], allele=v["ref"])
data.Variant(start=v["pos"], end=v["pos"]+len(v["alleles"][0]), id=v["id"], allele=v["alleles"][0])
for v in snp_gts.variants
)
Expand Down Expand Up @@ -84,7 +84,7 @@ You can easily use the :ref:`data API <api-data>` and the :ref:`simphenotype API
# load the genotypes file
# you can use either a VCF or PGEN file
gt = data.GenotypesRefAlt("tests/data/apoe.vcf.gz")
gt = data.GenotypesVCF("tests/data/apoe.vcf.gz")
gt.read(variants=variants)
# the advantage of using a PGEN file is that you can use read_variants() to load
# the variants quickly w/o having to load the genotypes, too
Expand All @@ -96,14 +96,14 @@ You can easily use the :ref:`data API <api-data>` and the :ref:`simphenotype API
hp.data = {}
for variant in gt.variants:
ID, chrom, pos, alt = variant[["id", "chrom", "pos", "alt"]]
end = pos + len(alt)
ID, chrom, pos, alleles = variant[["id", "chrom", "pos", "alleles"]]
end = pos + len(alleles[1])
# create a haplotype line in the .hap file
# you should fill out "beta" with your own value
hp.data[ID] = Haplotype(chrom=chrom, start=pos, end=end, id=ID, beta=0.5)
# create variant lines for each haplotype
hp.data[ID].variants = (data.Variant(start=pos, end=end, id=ID, allele=alt),)
hp.data[ID].variants = (data.Variant(start=pos, end=end, id=ID, allele=alleles[1]),)
hp.write()
2 changes: 1 addition & 1 deletion haptools/data/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,4 +3,4 @@
from .covariates import Covariates
from .breakpoints import Breakpoints, HapBlock
from .haplotypes import Extra, Variant, Haplotype, Haplotypes
from .genotypes import Genotypes, GenotypesRefAlt, GenotypesPLINK
from .genotypes import Genotypes, GenotypesVCF, GenotypesPLINK
90 changes: 44 additions & 46 deletions haptools/data/genotypes.py
Original file line number Diff line number Diff line change
Expand Up @@ -491,7 +491,10 @@ def check_phase(self):
self.log.warning("Phase information has already been removed from the data")
return
# check: are there any variants that are heterozygous and unphased?
unphased = (self.data[:, :, 0] ^ self.data[:, :, 1]) & (~self.data[:, :, 2])
data = self.data
if data.dtype != np.bool_:
data = self.data.astype(np.bool_)
unphased = (data[:, :, 0] ^ data[:, :, 1]) & (~data[:, :, 2])
if np.any(unphased):
samp_idx, variant_idx = np.nonzero(unphased)
raise ValueError(
Expand Down Expand Up @@ -591,11 +594,11 @@ def check_sorted(self):
)


class GenotypesRefAlt(Genotypes):
class GenotypesVCF(Genotypes):
"""
A class for processing genotypes from a file
Unlike the base Genotypes class, this class also includes REF and ALT alleles in
the variants array
Unlike the base Genotypes class, this class also includes REF and ALT alleles as
a list of alleles in the variants array
Attributes
----------
Expand All @@ -610,24 +613,15 @@ class GenotypesRefAlt(Genotypes):
1. ID
2. CHROM
3. POS
4. REF
5. ALT
4. [REF, ALT1, ALT2, ...]
log: Logger
See documentation for :py:attr:`~.Genotypes.log`
"""

def __init__(self, fname: Path | str, log: Logger = None):
super().__init__(fname, log)
self.variants = np.array(
[],
dtype=[
("id", "U50"),
("chrom", "U10"),
("pos", np.uint32),
("ref", "U100"),
("alt", "U100"),
],
)
dtype = {k: v[0] for k, v in self.variants.dtype.fields.items()}
self.variants = np.array([], dtype=list(dtype.items()) + [("alleles", object)])

def _variant_arr(self, record: Variant):
"""
Expand All @@ -638,15 +632,14 @@ def _variant_arr(self, record: Variant):
record.ID,
record.CHROM,
record.POS,
record.REF,
record.ALT[0],
(record.REF, *record.ALT),
),
dtype=self.variants.dtype,
)

def write(self):
"""
Write the variants in this class to a VCF at :py:attr:`~.GenotypesRefAlt.fname`
Write the variants in this class to a VCF at :py:attr:`~.GenotypesVCF.fname`
"""
vcf = VariantFile(str(self.fname), mode="w")
# make sure the header is properly structured
Expand All @@ -671,13 +664,15 @@ def write(self):
for sample in self.samples:
vcf.header.add_sample(sample)
self.log.info("Writing VCF records")
phased = self._prephased or (self.data.shape[2] < 3)
missing_val = np.iinfo(np.uint8).max
for var_idx, var in enumerate(self.variants):
rec = {
"contig": var["chrom"],
"start": var["pos"],
"stop": var["pos"] + len(var["ref"]) - 1,
"stop": var["pos"] + len(var["alleles"][0]) - 1,
"qual": None,
"alleles": tuple(var[["ref", "alt"]]),
"alleles": var["alleles"],
"id": var["id"],
"filter": None,
}
Expand All @@ -686,37 +681,42 @@ def write(self):
# parse the record into a pysam.VariantRecord
record = vcf.new_record(**rec)
for samp_idx, sample in enumerate(self.samples):
# TODO: make this work when there are missing values
record.samples[sample]["GT"] = tuple(self.data[samp_idx, var_idx, :2])
# TODO: add proper phase info
record.samples[sample].phased = True
record.samples[sample]["GT"] = tuple(
None if val == missing_val else val
for val in self.data[samp_idx, var_idx, :2]
)
# add proper phasing info
if phased:
record.samples[sample].phased = True
else:
record.samples[sample].phased = self.data[samp_idx, var_idx, 2]
# write the record to a file
vcf.write(record)
vcf.close()


class GenotypesPLINK(GenotypesRefAlt):
class GenotypesPLINK(GenotypesVCF):
"""
A class for processing genotypes from a PLINK ``.pgen`` file
Attributes
----------
data : np.array
See documentation for :py:attr:`~.GenotypesRefAlt.data`
See documentation for :py:attr:`~.GenotypesVCF.data`
samples : tuple
See documentation for :py:attr:`~.GenotypesRefAlt.data`
See documentation for :py:attr:`~.GenotypesVCF.data`
variants : np.array
See documentation for :py:attr:`~.GenotypesRefAlt.data`
See documentation for :py:attr:`~.GenotypesVCF.data`
log: Logger
See documentation for :py:attr:`~.GenotypesRefAlt.data`
See documentation for :py:attr:`~.GenotypesVCF.data`
chunk_size: int, optional
The max number of variants to fetch from and write to the PGEN file at any
given time
If this value is provided, variants from the PGEN file will be loaded in
chunks so as to use less memory
_prephased: bool
See documentation for :py:attr:`~.GenotypesRefAlt.data`
See documentation for :py:attr:`~.GenotypesVCF.data`
Examples
--------
Expand Down Expand Up @@ -746,7 +746,7 @@ def read_samples(self, samples: list[str] = None):
Parameters
----------
samples : list[str], optional
See documentation for :py:attr:`~.GenotypesRefAlt.read`
See documentation for :py:attr:`~.GenotypesVCF.read`
Returns
-------
Expand Down Expand Up @@ -839,14 +839,12 @@ def _variant_arr(
npt.NDArray
A row from the :py:attr:`~.GenotypesPLINK.variants` array
"""
# TODO: remove the AAF column; right now, we just set it to 0.5 arbitrarily
return np.array(
(
record[cid["ID"]],
record[cid["CHROM"]],
record[cid["POS"]],
record[cid["REF"]],
record[cid["ALT"]],
(record[cid["REF"]], record[cid["ALT"]]),
),
dtype=self.variants.dtype,
)
Expand All @@ -865,9 +863,9 @@ def _iterate_variants(
Parameters
----------
region : str, optional
See documentation for :py:attr:`~.GenotypesRefAlt.read`
See documentation for :py:attr:`~.GenotypesVCF.read`
variants : set[str], optional
See documentation for :py:attr:`~.GenotypesRefAlt.read`
See documentation for :py:attr:`~.GenotypesVCF.read`
Yields
------
Expand Down Expand Up @@ -940,11 +938,11 @@ def read_variants(
Parameters
----------
region : str, optional
See documentation for :py:attr:`~.GenotypesRefAlt.read`
See documentation for :py:attr:`~.GenotypesVCF.read`
variants : set[str], optional
See documentation for :py:attr:`~.GenotypesRefAlt.read`
See documentation for :py:attr:`~.GenotypesVCF.read`
max_variants : int, optional
See documentation for :py:attr:`~.GenotypesRefAlt.read`
See documentation for :py:attr:`~.GenotypesVCF.read`
Returns
-------
Expand Down Expand Up @@ -997,13 +995,13 @@ def read(
Parameters
----------
region : str, optional
See documentation for :py:attr:`~.GenotypesRefAlt.read`
See documentation for :py:attr:`~.GenotypesVCF.read`
samples : list[str], optional
See documentation for :py:attr:`~.GenotypesRefAlt.read`
See documentation for :py:attr:`~.GenotypesVCF.read`
variants : set[str], optional
See documentation for :py:attr:`~.GenotypesRefAlt.read`
See documentation for :py:attr:`~.GenotypesVCF.read`
max_variants : int, optional
See documentation for :py:attr:`~.GenotypesRefAlt.read`
See documentation for :py:attr:`~.GenotypesVCF.read`
"""
super(Genotypes, self).read()
import pgenlib
Expand Down Expand Up @@ -1199,9 +1197,9 @@ def write_variants(self):
rec = {
"contig": var["chrom"],
"start": var["pos"],
"stop": var["pos"] + len(var["ref"]) - 1,
"stop": var["pos"] + len(var["alleles"][0]) - 1,
"qual": None,
"alleles": tuple(var[["ref", "alt"]]),
"alleles": var["alleles"],
"id": var["id"],
"filter": None,
}
Expand Down
Loading

0 comments on commit b57f91f

Please sign in to comment.