fix: regression in multiallelic support for simgenotype (#195)

Co-authored-by: mlamkin7 <[email protected]>
CAST-genomics · Mar 5, 2023 · b57f91f · b57f91f
1 parent 85bd494
commit b57f91f
Show file tree

Hide file tree

Showing 16 changed files with 327 additions and 186 deletions.
diff --git a/docs/api/data.rst b/docs/api/data.rst
@@ -49,7 +49,7 @@ This module supports reading and writing files that follow `the VCF <https://gat
 Documentation
 -------------
 The :ref:`genotypes.py API docs <api-haptools-data-genotypes>` contain example usage of the :class:`Genotypes` class.
-See the documentation for the :class:`GenotypesRefAlt` class for an example of extending the :class:`Genotypes` class so that it loads REF and ALT alleles as well.
+See the documentation for the :class:`GenotypesVCF` class for an example of extending the :class:`Genotypes` class so that it loads REF and ALT alleles as well.
 
 Classes
 -------
@@ -152,15 +152,15 @@ You can index into a loaded :class:`Genotypes` instance using the ``subset()`` f
 
 By default, the ``subset()`` method returns a new :class:`Genotypes` instance. The samples and variants in the new instance will be in the order specified.
 
-GenotypesRefAlt
-+++++++++++++++
-The :class:`Genotypes` class can be easily *extended* (sub-classed) to load extra fields into the ``variants`` structured array. The :class:`GenotypesRefAlt` class is an example of this where I extended the :class:`Genotypes` class to add REF and ALT fields from the VCF to the columns of the structured array. So the ``variants`` array will have named columns: "id", "chrom", "pos", "ref", and "alt".
+GenotypesVCF
+++++++++++++
+The :class:`Genotypes` class can be easily *extended* (sub-classed) to load extra fields into the ``variants`` structured array. The :class:`GenotypesVCF` class is an example of this where I extended the :class:`Genotypes` class to add REF and ALT fields from the VCF as a new column of the structured array. So the ``variants`` array will have named columns: "id", "chrom", "pos", "alleles". The new "alleles" column contains lists of alleles designed such that the first element in the list is the REF allele, the second is ALT1, the third is ALT2, etc.
 
-All of the other methods in the :class:`Genotypes` class are inherited, but the :class:`GenotypesRefAlt` class implements an additional method ``write()`` for dumping the contents of the class to the provided file.
+All of the other methods in the :class:`Genotypes` class are inherited, but the :class:`GenotypesVCF` class implements an additional method ``write()`` for dumping the contents of the class to the provided file.
 
 .. code-block:: python
 
-	genotypes = data.GenotypesRefAlt.load('tests/data/simple.vcf')
+	genotypes = data.GenotypesVCF.load('tests/data/simple.vcf')
 	# make the first sample homozygous for the alt allele of the fourth variant
 	genotypes.data[0, 3] = (1, 1)
 	genotypes.write()
@@ -182,13 +182,13 @@ The :class:`GenotypesPLINK` class offers experimental support for reading and wr
 
 		pip install haptools[files]
 
-The :class:`GenotypesPLINK` class inherits from the :class:`GenotypesRefAlt` class, so it has all the same methods and properties. Loading genotypes is the exact same, for example.
+The :class:`GenotypesPLINK` class inherits from the :class:`GenotypesVCF` class, so it has all the same methods and properties. Loading genotypes is the exact same, for example.
 
 .. code-block:: python
 
 	genotypes = data.GenotypesPLINK.load('tests/data/simple.pgen')
 	genotypes.data     # a numpy array of shape n x p x 2
-	genotypes.variants # a numpy structured array of shape p x 6
+	genotypes.variants # a numpy structured array of shape p x 5
 	genotypes.samples  # a tuple of strings of length n
 
 In addition to the ``read()`` and ``load()`` methods, the :class:`GenotypesPLINK` class also has methods for reading (or writing) PVAR or PSAM files separately, without having to read (or write) the PGEN file as well.
@@ -198,7 +198,7 @@ In addition to the ``read()`` and ``load()`` methods, the :class:`GenotypesPLINK
 	genotypes = data.GenotypesPLINK('tests/data/simple.pgen')
 
 	genotypes.read_variants()
-	genotypes.variants # a numpy structured array of shape p x 6
+	genotypes.variants # a numpy structured array of shape p x 5
 
 	genotypes.read_samples()
 	genotypes.samples  # a tuple of strings of length n
@@ -305,14 +305,14 @@ To write to a **.hap** file, you must first initialize a :class:`Haplotypes` obj
 
 Obtaining haplotype "genotypes"
 *******************************
-Using the ``transform()`` function, you can obtain a full instance of the :class:`GenotypesRefAlt` class where haplotypes from a :class:`Haplotypes` object are encoded as the variants in the genotype matrix.
+Using the ``transform()`` function, you can obtain a full instance of the :class:`GenotypesVCF` class where haplotypes from a :class:`Haplotypes` object are encoded as the variants in the genotype matrix.
 
 .. code-block:: python
 
 	haplotypes = data.Haplotypes.load('tests/data/example.hap.gz')
-	genotypes = data.GenotypesRefAlt.load('tests/data/example.vcf.gz')
+	genotypes = data.GenotypesVCF.load('tests/data/example.vcf.gz')
 	hap_gts = haplotypes.transform(genotypes)
-	hap_gts   # a GenotypesRefAlt instance where haplotypes are variants
+	hap_gts   # a GenotypesVCF instance where haplotypes are variants
 
 Haplotype
 +++++++++

diff --git a/docs/api/examples.rst b/docs/api/examples.rst
@@ -24,7 +24,7 @@ As an example, let's say we would like to convert the following ``.blocks.det``
 
     # load the genotypes file
     # you can use either a VCF or PGEN file
-    gt = data.GenotypesRefAlt.load("input.vcf.gz")
+    gt = data.GenotypesVCF.load("input.vcf.gz")
     gt = data.GenotypesPLINK.load("input.pgen")
 
     # load the haplotypes
@@ -46,9 +46,9 @@ As an example, let's say we would like to convert the following ``.blocks.det``
 
             # create variant lines for each haplotype
             # Note that the .blocks.det file doesn't specify an allele, so
-            # we simply choose the REF allele for this example
+            # we simply choose the first allele (ie the REF allele) for this example
             hp.data[hap_id].variants = tuple(
-                data.Variant(start=v["pos"], end=v["pos"]+len(v["ref"]), id=v["id"], allele=v["ref"])
+                data.Variant(start=v["pos"], end=v["pos"]+len(v["alleles"][0]), id=v["id"], allele=v["alleles"][0])
                 for v in snp_gts.variants
             )
 
@@ -84,7 +84,7 @@ You can easily use the :ref:`data API <api-data>` and the :ref:`simphenotype API
 
     # load the genotypes file
     # you can use either a VCF or PGEN file
-    gt = data.GenotypesRefAlt("tests/data/apoe.vcf.gz")
+    gt = data.GenotypesVCF("tests/data/apoe.vcf.gz")
     gt.read(variants=variants)
     # the advantage of using a PGEN file is that you can use read_variants() to load
     # the variants quickly w/o having to load the genotypes, too
@@ -96,14 +96,14 @@ You can easily use the :ref:`data API <api-data>` and the :ref:`simphenotype API
     hp.data = {}
 
     for variant in gt.variants:
-        ID, chrom, pos, alt = variant[["id", "chrom", "pos", "alt"]]
-        end = pos + len(alt)
+        ID, chrom, pos, alleles = variant[["id", "chrom", "pos", "alleles"]]
+        end = pos + len(alleles[1])
 
         # create a haplotype line in the .hap file
         # you should fill out "beta" with your own value
         hp.data[ID] = Haplotype(chrom=chrom, start=pos, end=end, id=ID, beta=0.5)
 
         # create variant lines for each haplotype
-        hp.data[ID].variants = (data.Variant(start=pos, end=end, id=ID, allele=alt),)
+        hp.data[ID].variants = (data.Variant(start=pos, end=end, id=ID, allele=alleles[1]),)
 
     hp.write()
diff --git a/haptools/data/__init__.py b/haptools/data/__init__.py
@@ -3,4 +3,4 @@
 from .covariates import Covariates
 from .breakpoints import Breakpoints, HapBlock
 from .haplotypes import Extra, Variant, Haplotype, Haplotypes
-from .genotypes import Genotypes, GenotypesRefAlt, GenotypesPLINK
+from .genotypes import Genotypes, GenotypesVCF, GenotypesPLINK
diff --git a/haptools/data/genotypes.py b/haptools/data/genotypes.py
@@ -491,7 +491,10 @@ def check_phase(self):
             self.log.warning("Phase information has already been removed from the data")
             return
         # check: are there any variants that are heterozygous and unphased?
-        unphased = (self.data[:, :, 0] ^ self.data[:, :, 1]) & (~self.data[:, :, 2])
+        data = self.data
+        if data.dtype != np.bool_:
+            data = self.data.astype(np.bool_)
+        unphased = (data[:, :, 0] ^ data[:, :, 1]) & (~data[:, :, 2])
         if np.any(unphased):
             samp_idx, variant_idx = np.nonzero(unphased)
             raise ValueError(
@@ -591,11 +594,11 @@ def check_sorted(self):
                 )
 
 
-class GenotypesRefAlt(Genotypes):
+class GenotypesVCF(Genotypes):
     """
     A class for processing genotypes from a file
-    Unlike the base Genotypes class, this class also includes REF and ALT alleles in
-    the variants array
+    Unlike the base Genotypes class, this class also includes REF and ALT alleles as
+    a list of alleles in the variants array
 
     Attributes
     ----------
@@ -610,24 +613,15 @@ class GenotypesRefAlt(Genotypes):
             1. ID
             2. CHROM
             3. POS
-            4. REF
-            5. ALT
+            4. [REF, ALT1, ALT2, ...]
     log: Logger
         See documentation for :py:attr:`~.Genotypes.log`
     """
 
     def __init__(self, fname: Path | str, log: Logger = None):
         super().__init__(fname, log)
-        self.variants = np.array(
-            [],
-            dtype=[
-                ("id", "U50"),
-                ("chrom", "U10"),
-                ("pos", np.uint32),
-                ("ref", "U100"),
-                ("alt", "U100"),
-            ],
-        )
+        dtype = {k: v[0] for k, v in self.variants.dtype.fields.items()}
+        self.variants = np.array([], dtype=list(dtype.items()) + [("alleles", object)])
 
     def _variant_arr(self, record: Variant):
         """
@@ -638,15 +632,14 @@ def _variant_arr(self, record: Variant):
                 record.ID,
                 record.CHROM,
                 record.POS,
-                record.REF,
-                record.ALT[0],
+                (record.REF, *record.ALT),
             ),
             dtype=self.variants.dtype,
         )
 
     def write(self):
         """
-        Write the variants in this class to a VCF at :py:attr:`~.GenotypesRefAlt.fname`
+        Write the variants in this class to a VCF at :py:attr:`~.GenotypesVCF.fname`
         """
         vcf = VariantFile(str(self.fname), mode="w")
         # make sure the header is properly structured
@@ -671,13 +664,15 @@ def write(self):
             for sample in self.samples:
                 vcf.header.add_sample(sample)
         self.log.info("Writing VCF records")
+        phased = self._prephased or (self.data.shape[2] < 3)
+        missing_val = np.iinfo(np.uint8).max
         for var_idx, var in enumerate(self.variants):
             rec = {
                 "contig": var["chrom"],
                 "start": var["pos"],
-                "stop": var["pos"] + len(var["ref"]) - 1,
+                "stop": var["pos"] + len(var["alleles"][0]) - 1,
                 "qual": None,
-                "alleles": tuple(var[["ref", "alt"]]),
+                "alleles": var["alleles"],
                 "id": var["id"],
                 "filter": None,
             }
@@ -686,37 +681,42 @@ def write(self):
             # parse the record into a pysam.VariantRecord
             record = vcf.new_record(**rec)
             for samp_idx, sample in enumerate(self.samples):
-                # TODO: make this work when there are missing values
-                record.samples[sample]["GT"] = tuple(self.data[samp_idx, var_idx, :2])
-                # TODO: add proper phase info
-                record.samples[sample].phased = True
+                record.samples[sample]["GT"] = tuple(
+                    None if val == missing_val else val
+                    for val in self.data[samp_idx, var_idx, :2]
+                )
+                # add proper phasing info
+                if phased:
+                    record.samples[sample].phased = True
+                else:
+                    record.samples[sample].phased = self.data[samp_idx, var_idx, 2]
             # write the record to a file
             vcf.write(record)
         vcf.close()
 
 
-class GenotypesPLINK(GenotypesRefAlt):
+class GenotypesPLINK(GenotypesVCF):
     """
     A class for processing genotypes from a PLINK ``.pgen`` file
 
     Attributes
     ----------
     data : np.array
-        See documentation for :py:attr:`~.GenotypesRefAlt.data`
+        See documentation for :py:attr:`~.GenotypesVCF.data`
     samples : tuple
-        See documentation for :py:attr:`~.GenotypesRefAlt.data`
+        See documentation for :py:attr:`~.GenotypesVCF.data`
     variants : np.array
-        See documentation for :py:attr:`~.GenotypesRefAlt.data`
+        See documentation for :py:attr:`~.GenotypesVCF.data`
     log: Logger
-        See documentation for :py:attr:`~.GenotypesRefAlt.data`
+        See documentation for :py:attr:`~.GenotypesVCF.data`
     chunk_size: int, optional
         The max number of variants to fetch from and write to the PGEN file at any
         given time
 
         If this value is provided, variants from the PGEN file will be loaded in
         chunks so as to use less memory
     _prephased: bool
-        See documentation for :py:attr:`~.GenotypesRefAlt.data`
+        See documentation for :py:attr:`~.GenotypesVCF.data`
 
     Examples
     --------
@@ -746,7 +746,7 @@ def read_samples(self, samples: list[str] = None):
         Parameters
         ----------
         samples : list[str], optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
 
         Returns
         -------
@@ -839,14 +839,12 @@ def _variant_arr(
         npt.NDArray
             A row from the :py:attr:`~.GenotypesPLINK.variants` array
         """
-        # TODO: remove the AAF column; right now, we just set it to 0.5 arbitrarily
         return np.array(
             (
                 record[cid["ID"]],
                 record[cid["CHROM"]],
                 record[cid["POS"]],
-                record[cid["REF"]],
-                record[cid["ALT"]],
+                (record[cid["REF"]], record[cid["ALT"]]),
             ),
             dtype=self.variants.dtype,
         )
@@ -865,9 +863,9 @@ def _iterate_variants(
         Parameters
         ----------
         region : str, optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
         variants : set[str], optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
 
         Yields
         ------
@@ -940,11 +938,11 @@ def read_variants(
         Parameters
         ----------
         region : str, optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
         variants : set[str], optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
         max_variants : int, optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
 
         Returns
         -------
@@ -997,13 +995,13 @@ def read(
         Parameters
         ----------
         region : str, optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
         samples : list[str], optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
         variants : set[str], optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
         max_variants : int, optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
         """
         super(Genotypes, self).read()
         import pgenlib
@@ -1199,9 +1197,9 @@ def write_variants(self):
                 rec = {
                     "contig": var["chrom"],
                     "start": var["pos"],
-                    "stop": var["pos"] + len(var["ref"]) - 1,
+                    "stop": var["pos"] + len(var["alleles"][0]) - 1,
                     "qual": None,
-                    "alleles": tuple(var[["ref", "alt"]]),
+                    "alleles": var["alleles"],
                     "id": var["id"],
                     "filter": None,
                 }