diff --git a/docs/api/data.rst b/docs/api/data.rst
index 7b2d0904..3c9e9d14 100644
--- a/docs/api/data.rst
+++ b/docs/api/data.rst
@@ -49,7 +49,7 @@ This module supports reading and writing files that follow `the VCF <https://gat
 Documentation
 -------------
 The :ref:`genotypes.py API docs <api-haptools-data-genotypes>` contain example usage of the :class:`Genotypes` class.
-See the documentation for the :class:`GenotypesRefAlt` class for an example of extending the :class:`Genotypes` class so that it loads REF and ALT alleles as well.
+See the documentation for the :class:`GenotypesVCF` class for an example of extending the :class:`Genotypes` class so that it loads REF and ALT alleles as well.
 
 Classes
 -------
@@ -152,15 +152,15 @@ You can index into a loaded :class:`Genotypes` instance using the ``subset()`` f
 
 By default, the ``subset()`` method returns a new :class:`Genotypes` instance. The samples and variants in the new instance will be in the order specified.
 
-GenotypesRefAlt
-+++++++++++++++
-The :class:`Genotypes` class can be easily *extended* (sub-classed) to load extra fields into the ``variants`` structured array. The :class:`GenotypesRefAlt` class is an example of this where I extended the :class:`Genotypes` class to add REF and ALT fields from the VCF to the columns of the structured array. So the ``variants`` array will have named columns: "id", "chrom", "pos", "ref", and "alt".
+GenotypesVCF
+++++++++++++
+The :class:`Genotypes` class can be easily *extended* (sub-classed) to load extra fields into the ``variants`` structured array. The :class:`GenotypesVCF` class is an example of this where I extended the :class:`Genotypes` class to add REF and ALT fields from the VCF as a new column of the structured array. So the ``variants`` array will have named columns: "id", "chrom", "pos", "alleles". The new "alleles" column contains lists of alleles designed such that the first element in the list is the REF allele, the second is ALT1, the third is ALT2, etc.
 
-All of the other methods in the :class:`Genotypes` class are inherited, but the :class:`GenotypesRefAlt` class implements an additional method ``write()`` for dumping the contents of the class to the provided file.
+All of the other methods in the :class:`Genotypes` class are inherited, but the :class:`GenotypesVCF` class implements an additional method ``write()`` for dumping the contents of the class to the provided file.
 
 .. code-block:: python
 
-	genotypes = data.GenotypesRefAlt.load('tests/data/simple.vcf')
+	genotypes = data.GenotypesVCF.load('tests/data/simple.vcf')
 	# make the first sample homozygous for the alt allele of the fourth variant
 	genotypes.data[0, 3] = (1, 1)
 	genotypes.write()
@@ -182,13 +182,13 @@ The :class:`GenotypesPLINK` class offers experimental support for reading and wr
 
 		pip install haptools[files]
 
-The :class:`GenotypesPLINK` class inherits from the :class:`GenotypesRefAlt` class, so it has all the same methods and properties. Loading genotypes is the exact same, for example.
+The :class:`GenotypesPLINK` class inherits from the :class:`GenotypesVCF` class, so it has all the same methods and properties. Loading genotypes is the exact same, for example.
 
 .. code-block:: python
 
 	genotypes = data.GenotypesPLINK.load('tests/data/simple.pgen')
 	genotypes.data     # a numpy array of shape n x p x 2
-	genotypes.variants # a numpy structured array of shape p x 6
+	genotypes.variants # a numpy structured array of shape p x 5
 	genotypes.samples  # a tuple of strings of length n
 
 In addition to the ``read()`` and ``load()`` methods, the :class:`GenotypesPLINK` class also has methods for reading (or writing) PVAR or PSAM files separately, without having to read (or write) the PGEN file as well.
@@ -198,7 +198,7 @@ In addition to the ``read()`` and ``load()`` methods, the :class:`GenotypesPLINK
 	genotypes = data.GenotypesPLINK('tests/data/simple.pgen')
 
 	genotypes.read_variants()
-	genotypes.variants # a numpy structured array of shape p x 6
+	genotypes.variants # a numpy structured array of shape p x 5
 
 	genotypes.read_samples()
 	genotypes.samples  # a tuple of strings of length n
@@ -305,14 +305,14 @@ To write to a **.hap** file, you must first initialize a :class:`Haplotypes` obj
 
 Obtaining haplotype "genotypes"
 *******************************
-Using the ``transform()`` function, you can obtain a full instance of the :class:`GenotypesRefAlt` class where haplotypes from a :class:`Haplotypes` object are encoded as the variants in the genotype matrix.
+Using the ``transform()`` function, you can obtain a full instance of the :class:`GenotypesVCF` class where haplotypes from a :class:`Haplotypes` object are encoded as the variants in the genotype matrix.
 
 .. code-block:: python
 
 	haplotypes = data.Haplotypes.load('tests/data/example.hap.gz')
-	genotypes = data.GenotypesRefAlt.load('tests/data/example.vcf.gz')
+	genotypes = data.GenotypesVCF.load('tests/data/example.vcf.gz')
 	hap_gts = haplotypes.transform(genotypes)
-	hap_gts   # a GenotypesRefAlt instance where haplotypes are variants
+	hap_gts   # a GenotypesVCF instance where haplotypes are variants
 
 Haplotype
 +++++++++
diff --git a/docs/api/examples.rst b/docs/api/examples.rst
index c6ae9dcf..764c739f 100644
--- a/docs/api/examples.rst
+++ b/docs/api/examples.rst
@@ -24,7 +24,7 @@ As an example, let's say we would like to convert the following ``.blocks.det``
 
     # load the genotypes file
     # you can use either a VCF or PGEN file
-    gt = data.GenotypesRefAlt.load("input.vcf.gz")
+    gt = data.GenotypesVCF.load("input.vcf.gz")
     gt = data.GenotypesPLINK.load("input.pgen")
 
     # load the haplotypes
@@ -46,9 +46,9 @@ As an example, let's say we would like to convert the following ``.blocks.det``
 
             # create variant lines for each haplotype
             # Note that the .blocks.det file doesn't specify an allele, so
-            # we simply choose the REF allele for this example
+            # we simply choose the first allele (ie the REF allele) for this example
             hp.data[hap_id].variants = tuple(
-                data.Variant(start=v["pos"], end=v["pos"]+len(v["ref"]), id=v["id"], allele=v["ref"])
+                data.Variant(start=v["pos"], end=v["pos"]+len(v["alleles"][0]), id=v["id"], allele=v["alleles"][0])
                 for v in snp_gts.variants
             )
 
@@ -84,7 +84,7 @@ You can easily use the :ref:`data API <api-data>` and the :ref:`simphenotype API
 
     # load the genotypes file
     # you can use either a VCF or PGEN file
-    gt = data.GenotypesRefAlt("tests/data/apoe.vcf.gz")
+    gt = data.GenotypesVCF("tests/data/apoe.vcf.gz")
     gt.read(variants=variants)
     # the advantage of using a PGEN file is that you can use read_variants() to load
     # the variants quickly w/o having to load the genotypes, too
@@ -96,14 +96,14 @@ You can easily use the :ref:`data API <api-data>` and the :ref:`simphenotype API
     hp.data = {}
 
     for variant in gt.variants:
-        ID, chrom, pos, alt = variant[["id", "chrom", "pos", "alt"]]
-        end = pos + len(alt)
+        ID, chrom, pos, alleles = variant[["id", "chrom", "pos", "alleles"]]
+        end = pos + len(alleles[1])
 
         # create a haplotype line in the .hap file
         # you should fill out "beta" with your own value
         hp.data[ID] = Haplotype(chrom=chrom, start=pos, end=end, id=ID, beta=0.5)
 
         # create variant lines for each haplotype
-        hp.data[ID].variants = (data.Variant(start=pos, end=end, id=ID, allele=alt),)
+        hp.data[ID].variants = (data.Variant(start=pos, end=end, id=ID, allele=alleles[1]),)
 
     hp.write()
diff --git a/haptools/data/__init__.py b/haptools/data/__init__.py
index dbd3fca5..51f0e019 100644
--- a/haptools/data/__init__.py
+++ b/haptools/data/__init__.py
@@ -3,4 +3,4 @@
 from .covariates import Covariates
 from .breakpoints import Breakpoints, HapBlock
 from .haplotypes import Extra, Variant, Haplotype, Haplotypes
-from .genotypes import Genotypes, GenotypesRefAlt, GenotypesPLINK
+from .genotypes import Genotypes, GenotypesVCF, GenotypesPLINK
diff --git a/haptools/data/genotypes.py b/haptools/data/genotypes.py
index 57b01747..6bd11329 100644
--- a/haptools/data/genotypes.py
+++ b/haptools/data/genotypes.py
@@ -491,7 +491,10 @@ def check_phase(self):
             self.log.warning("Phase information has already been removed from the data")
             return
         # check: are there any variants that are heterozygous and unphased?
-        unphased = (self.data[:, :, 0] ^ self.data[:, :, 1]) & (~self.data[:, :, 2])
+        data = self.data
+        if data.dtype != np.bool_:
+            data = self.data.astype(np.bool_)
+        unphased = (data[:, :, 0] ^ data[:, :, 1]) & (~data[:, :, 2])
         if np.any(unphased):
             samp_idx, variant_idx = np.nonzero(unphased)
             raise ValueError(
@@ -591,11 +594,11 @@ def check_sorted(self):
                 )
 
 
-class GenotypesRefAlt(Genotypes):
+class GenotypesVCF(Genotypes):
     """
     A class for processing genotypes from a file
-    Unlike the base Genotypes class, this class also includes REF and ALT alleles in
-    the variants array
+    Unlike the base Genotypes class, this class also includes REF and ALT alleles as
+    a list of alleles in the variants array
 
     Attributes
     ----------
@@ -610,24 +613,15 @@ class GenotypesRefAlt(Genotypes):
             1. ID
             2. CHROM
             3. POS
-            4. REF
-            5. ALT
+            4. [REF, ALT1, ALT2, ...]
     log: Logger
         See documentation for :py:attr:`~.Genotypes.log`
     """
 
     def __init__(self, fname: Path | str, log: Logger = None):
         super().__init__(fname, log)
-        self.variants = np.array(
-            [],
-            dtype=[
-                ("id", "U50"),
-                ("chrom", "U10"),
-                ("pos", np.uint32),
-                ("ref", "U100"),
-                ("alt", "U100"),
-            ],
-        )
+        dtype = {k: v[0] for k, v in self.variants.dtype.fields.items()}
+        self.variants = np.array([], dtype=list(dtype.items()) + [("alleles", object)])
 
     def _variant_arr(self, record: Variant):
         """
@@ -638,15 +632,14 @@ def _variant_arr(self, record: Variant):
                 record.ID,
                 record.CHROM,
                 record.POS,
-                record.REF,
-                record.ALT[0],
+                (record.REF, *record.ALT),
             ),
             dtype=self.variants.dtype,
         )
 
     def write(self):
         """
-        Write the variants in this class to a VCF at :py:attr:`~.GenotypesRefAlt.fname`
+        Write the variants in this class to a VCF at :py:attr:`~.GenotypesVCF.fname`
         """
         vcf = VariantFile(str(self.fname), mode="w")
         # make sure the header is properly structured
@@ -671,13 +664,15 @@ def write(self):
             for sample in self.samples:
                 vcf.header.add_sample(sample)
         self.log.info("Writing VCF records")
+        phased = self._prephased or (self.data.shape[2] < 3)
+        missing_val = np.iinfo(np.uint8).max
         for var_idx, var in enumerate(self.variants):
             rec = {
                 "contig": var["chrom"],
                 "start": var["pos"],
-                "stop": var["pos"] + len(var["ref"]) - 1,
+                "stop": var["pos"] + len(var["alleles"][0]) - 1,
                 "qual": None,
-                "alleles": tuple(var[["ref", "alt"]]),
+                "alleles": var["alleles"],
                 "id": var["id"],
                 "filter": None,
             }
@@ -686,29 +681,34 @@ def write(self):
             # parse the record into a pysam.VariantRecord
             record = vcf.new_record(**rec)
             for samp_idx, sample in enumerate(self.samples):
-                # TODO: make this work when there are missing values
-                record.samples[sample]["GT"] = tuple(self.data[samp_idx, var_idx, :2])
-                # TODO: add proper phase info
-                record.samples[sample].phased = True
+                record.samples[sample]["GT"] = tuple(
+                    None if val == missing_val else val
+                    for val in self.data[samp_idx, var_idx, :2]
+                )
+                # add proper phasing info
+                if phased:
+                    record.samples[sample].phased = True
+                else:
+                    record.samples[sample].phased = self.data[samp_idx, var_idx, 2]
             # write the record to a file
             vcf.write(record)
         vcf.close()
 
 
-class GenotypesPLINK(GenotypesRefAlt):
+class GenotypesPLINK(GenotypesVCF):
     """
     A class for processing genotypes from a PLINK ``.pgen`` file
 
     Attributes
     ----------
     data : np.array
-        See documentation for :py:attr:`~.GenotypesRefAlt.data`
+        See documentation for :py:attr:`~.GenotypesVCF.data`
     samples : tuple
-        See documentation for :py:attr:`~.GenotypesRefAlt.data`
+        See documentation for :py:attr:`~.GenotypesVCF.data`
     variants : np.array
-        See documentation for :py:attr:`~.GenotypesRefAlt.data`
+        See documentation for :py:attr:`~.GenotypesVCF.data`
     log: Logger
-        See documentation for :py:attr:`~.GenotypesRefAlt.data`
+        See documentation for :py:attr:`~.GenotypesVCF.data`
     chunk_size: int, optional
         The max number of variants to fetch from and write to the PGEN file at any
         given time
@@ -716,7 +716,7 @@ class GenotypesPLINK(GenotypesRefAlt):
         If this value is provided, variants from the PGEN file will be loaded in
         chunks so as to use less memory
     _prephased: bool
-        See documentation for :py:attr:`~.GenotypesRefAlt.data`
+        See documentation for :py:attr:`~.GenotypesVCF.data`
 
     Examples
     --------
@@ -746,7 +746,7 @@ def read_samples(self, samples: list[str] = None):
         Parameters
         ----------
         samples : list[str], optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
 
         Returns
         -------
@@ -839,14 +839,12 @@ def _variant_arr(
         npt.NDArray
             A row from the :py:attr:`~.GenotypesPLINK.variants` array
         """
-        # TODO: remove the AAF column; right now, we just set it to 0.5 arbitrarily
         return np.array(
             (
                 record[cid["ID"]],
                 record[cid["CHROM"]],
                 record[cid["POS"]],
-                record[cid["REF"]],
-                record[cid["ALT"]],
+                (record[cid["REF"]], record[cid["ALT"]]),
             ),
             dtype=self.variants.dtype,
         )
@@ -865,9 +863,9 @@ def _iterate_variants(
         Parameters
         ----------
         region : str, optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
         variants : set[str], optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
 
         Yields
         ------
@@ -940,11 +938,11 @@ def read_variants(
         Parameters
         ----------
         region : str, optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
         variants : set[str], optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
         max_variants : int, optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
 
         Returns
         -------
@@ -997,13 +995,13 @@ def read(
         Parameters
         ----------
         region : str, optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
         samples : list[str], optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
         variants : set[str], optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
         max_variants : int, optional
-            See documentation for :py:attr:`~.GenotypesRefAlt.read`
+            See documentation for :py:attr:`~.GenotypesVCF.read`
         """
         super(Genotypes, self).read()
         import pgenlib
@@ -1199,9 +1197,9 @@ def write_variants(self):
                 rec = {
                     "contig": var["chrom"],
                     "start": var["pos"],
-                    "stop": var["pos"] + len(var["ref"]) - 1,
+                    "stop": var["pos"] + len(var["alleles"][0]) - 1,
                     "qual": None,
-                    "alleles": tuple(var[["ref", "alt"]]),
+                    "alleles": var["alleles"],
                     "id": var["id"],
                     "filter": None,
                 }
diff --git a/haptools/data/haplotypes.py b/haptools/data/haplotypes.py
index 2ce48fb9..30e66dcc 100644
--- a/haptools/data/haplotypes.py
+++ b/haptools/data/haplotypes.py
@@ -11,7 +11,7 @@
 from pysam import TabixFile
 
 from .data import Data
-from .genotypes import GenotypesRefAlt
+from .genotypes import GenotypesVCF
 
 
 @dataclass
@@ -453,7 +453,7 @@ def extras_order(cls) -> tuple[str]:
         """
         return tuple(extra.name for extra in cls._extras)
 
-    def transform(self, genotypes: GenotypesRefAlt) -> npt.NDArray[bool]:
+    def transform(self, genotypes: GenotypesVCF) -> npt.NDArray[bool]:
         """
         Transform a genotypes matrix via the current haplotype
 
@@ -462,7 +462,7 @@ def transform(self, genotypes: GenotypesRefAlt) -> npt.NDArray[bool]:
 
         Parameters
         ----------
-        genotypes : GenotypesRefAlt
+        genotypes : GenotypesVCF
             The genotypes which to transform using the current haplotype
 
             If the genotypes have not been loaded into the Genotypes object yet, this
@@ -490,7 +490,7 @@ def transform(self, genotypes: GenotypesRefAlt) -> npt.NDArray[bool]:
         allele_arr = np.array(
             [
                 [
-                    [int(var.allele != gts.variants[i]["ref"])]
+                    [int(var.allele != gts.variants[i]["alleles"][0])]
                     for i, var in enumerate(self.variants)
                 ]
             ]
@@ -1053,9 +1053,9 @@ def write(self):
 
     def transform(
         self,
-        gts: GenotypesRefAlt,
-        hap_gts: GenotypesRefAlt = None,
-    ) -> GenotypesRefAlt:
+        gts: GenotypesVCF,
+        hap_gts: GenotypesVCF = None,
+    ) -> GenotypesVCF:
         """
         Transform a genotypes matrix via the current haplotype
 
@@ -1064,23 +1064,23 @@ def transform(
 
         Parameters
         ----------
-        gts : GenotypesRefAlt
+        gts : GenotypesVCF
             The genotypes which to transform using the current haplotype
-        hap_gts: GenotypesRefAlt
-            An empty GenotypesRefAlt object into which the haplotype genotypes should
+        hap_gts: GenotypesVCF
+            An empty GenotypesVCF object into which the haplotype genotypes should
             be stored
 
         Returns
         -------
-        GenotypesRefAlt
+        GenotypesVCF
             A Genotypes object composed of haplotypes instead of regular variants.
         """
-        # Initialize GenotypesRefAlt return value
+        # Initialize GenotypesVCF return value
         if hap_gts is None:
-            hap_gts = GenotypesRefAlt(fname=None, log=self.log)
+            hap_gts = GenotypesVCF(fname=None, log=self.log)
         hap_gts.samples = gts.samples
         hap_gts.variants = np.array(
-            [(hap.id, hap.chrom, hap.start, "A", "T") for hap in self.data.values()],
+            [(hap.id, hap.chrom, hap.start, ("A", "T")) for hap in self.data.values()],
             dtype=hap_gts.variants.dtype,
         )
         # build a fast data structure for querying the alleles in each haplotype:
@@ -1104,7 +1104,7 @@ def transform(
         # with shape (1, gts.data.shape[1], 1) for broadcasting later
         allele_arr = np.array(
             [
-                int(allele != gts.variants[i]["ref"])
+                int(allele != gts.variants[i]["alleles"][0])
                 for i, (vID, allele) in enumerate(alleles)
             ],
             dtype=gts.data.dtype,
diff --git a/haptools/ld.py b/haptools/ld.py
index 1c4f1896..6e3b33c0 100644
--- a/haptools/ld.py
+++ b/haptools/ld.py
@@ -152,7 +152,7 @@ def calc_ld(
         gt = data.GenotypesPLINK(fname=genotypes, log=log, chunk_size=chunk_size)
     else:
         log.info("Loading genotypes from VCF/BCF file")
-        gt = data.GenotypesRefAlt(fname=genotypes, log=log)
+        gt = data.GenotypesVCF(fname=genotypes, log=log)
     # gt._prephased = True
     gt.read(region=region, samples=samples, variants=variants)
     gt.check_missing(discard_also=discard_missing)
@@ -180,7 +180,7 @@ def calc_ld(
 
     if not from_gts:
         log.info("Transforming genotypes via haplotypes")
-        hp_gt = data.GenotypesRefAlt(fname=None, log=log)
+        hp_gt = data.GenotypesVCF(fname=None, log=log)
         hp.transform(gt, hp_gt)
 
     log.info("Obtaining target genotypes")
diff --git a/haptools/sim_genotype.py b/haptools/sim_genotype.py
index 6405aac2..51f83a93 100644
--- a/haptools/sim_genotype.py
+++ b/haptools/sim_genotype.py
@@ -9,7 +9,7 @@
 from pysam import VariantFile
 from collections import defaultdict
 from .admix_storage import GeneticMarker, HaplotypeSegment
-from .data import GenotypesRefAlt, GenotypesPLINK
+from .data import GenotypesVCF, GenotypesPLINK
 from .transform import GenotypesAncestry
 
 
@@ -72,7 +72,7 @@ def output_vcf(
         Outputs messages to the appropriate channel.
     """
 
-    log.info(f"Outputting VCF file {out}.vcf.gz")
+    log.info(f"Outputting file {out}")
 
     # details to know
     # vcf file: how to handle samples and which sample is which haplotype block randomly choose out of current population types
@@ -104,13 +104,17 @@ def output_vcf(
     if variant_file.endswith(".pgen"):
         vcf = GenotypesPLINK(variant_file, log=log)
     else:
-        vcf = GenotypesRefAlt(variant_file, log=log)
+        vcf = GenotypesVCF(variant_file, log=log)
     
     if not region:
         vcf.read()
     else:
         vcf.read(region=f"{region['chr']}:{region['start']}-{region['end']}")
 
+    if variant_file.endswith(".pgen"):
+        # the pgenlib library collapses multiallelic variants into a single allele
+        vcf.check_biallelic()
+
     log.debug(f"Read in variants from {variant_file}")
 
     sample_dict = {}
@@ -212,7 +216,7 @@ def output_vcf(
                 gts.valid_labels = output_labels
 
         if not pop_field and not sample_field:
-            gts = GenotypesRefAlt(out, log=log)
+            gts = GenotypesVCF(out, log=log)
 
     else:
         gts = GenotypesPLINK(out, log=log)
@@ -915,8 +919,8 @@ def validate_params(model, mapdir, chroms, popsize, invcf, sample_info, no_repla
         else:
             total_per_pop[info_pop] += 1
 
-        if sample not in vcf_samples:
-            raise Exception(f"Sample {sample} in sampleinfo file is not present in the vcf file.")
+        if sample not in vcf_samples and info_pop in pops:
+            raise Exception(f"Sample {sample} from population {info_pop} in sampleinfo file is not present in the vcf file.")
     
     # Ensure that all populations from the model file are listed in the sample info file 
     #    and that there is sufficient enough samples when no_replacement is specified
diff --git a/haptools/sim_phenotype.py b/haptools/sim_phenotype.py
index bcb1277d..34f336df 100644
--- a/haptools/sim_phenotype.py
+++ b/haptools/sim_phenotype.py
@@ -14,7 +14,6 @@
     Phenotypes,
     Haplotypes,
     GenotypesPLINK,
-    GenotypesRefAlt,
 )
 
 
diff --git a/haptools/transform.py b/haptools/transform.py
index 482d9ffd..b25b95ad 100644
--- a/haptools/transform.py
+++ b/haptools/transform.py
@@ -28,7 +28,7 @@ class HaplotypeAncestry(data.Haplotype):
         default=(data.Extra("ancestry", "s", "Local ancestry"),),
     )
 
-    def transform(self, genotypes: data.GenotypesRefAlt) -> npt.NDArray[bool]:
+    def transform(self, genotypes: data.GenotypesVCF) -> npt.NDArray[bool]:
         """
         Transform a genotypes matrix via the current haplotype and its ancestral
         population
@@ -51,7 +51,7 @@ def transform(self, genotypes: data.GenotypesRefAlt) -> npt.NDArray[bool]:
         allele_arr = np.array(
             [
                 [
-                    [int(var.allele != gts.variants[i]["ref"])]
+                    [int(var.allele != gts.variants[i]["alleles"][0])]
                     for i, var in enumerate(self.variants)
                 ]
             ]
@@ -91,14 +91,14 @@ def __init__(
     def transform(
         self,
         gts: data.GenotypesAncestry,
-        hap_gts: data.GenotypesRefAlt = None,
-    ) -> data.GenotypesRefAlt:
-        # Initialize GenotypesRefAlt return value
+        hap_gts: data.GenotypesVCF = None,
+    ) -> data.GenotypesVCF:
+        # Initialize GenotypesVCF return value
         if hap_gts is None:
-            hap_gts = data.GenotypesRefAlt(fname=None, log=self.log)
+            hap_gts = data.GenotypesVCF(fname=None, log=self.log)
         hap_gts.samples = gts.samples
         hap_gts.variants = np.array(
-            [(hap.id, hap.chrom, hap.start, "A", "T") for hap in self.data.values()],
+            [(hap.id, hap.chrom, hap.start, ("A", "T")) for hap in self.data.values()],
             dtype=hap_gts.variants.dtype,
         )
         # build a fast data structure for querying the alleles in each haplotype:
@@ -125,7 +125,7 @@ def transform(
         # with shape (1, gts.data.shape[1], 1) for broadcasting later
         allele_arr = np.array(
             [
-                int(allele != gts.variants[i]["ref"])
+                int(allele != gts.variants[i]["alleles"][0])
                 for i, (vID, allele) in enumerate(alleles)
             ],
             dtype=gts.data.dtype,
@@ -147,9 +147,9 @@ def transform(
         return hap_gts
 
 
-class GenotypesAncestry(data.GenotypesRefAlt):
+class GenotypesAncestry(data.GenotypesVCF):
     """
-    Extends the GenotypesRefAlt class for ancestry data
+    Extends the GenotypesVCF class for ancestry data
 
     The ancestry information is stored within the FORMAT field of the VCF
 
@@ -162,7 +162,7 @@ class GenotypesAncestry(data.GenotypesRefAlt):
     samples : tuple[str]
         See documentation for :py:attr:`~.Genotypes.samples`
     variants : np.array
-        See documentation for :py:attr:`~.GenotypesRefAlt.variants`
+        See documentation for :py:attr:`~.GenotypesVCF.variants`
     valid_labels: np.array
         Reference VCF sample and respective variant grabbed for
         each sample.
@@ -479,13 +479,14 @@ def write(self, chroms=None):
             for sample in self.samples:
                 vcf.header.add_sample(sample)
         self.log.info("Writing VCF records")
+        phased = self._prephased or (self.data.shape[2] < 3)
         for var_idx, var in enumerate(self.variants):
             rec = {
                 "contig": var["chrom"],
                 "start": var["pos"],
-                "stop": var["pos"] + len(var["ref"]) - 1,
+                "stop": var["pos"] + len(var["alleles"][0]) - 1,
                 "qual": None,
-                "alleles": tuple(var[["ref", "alt"]]),
+                "alleles": var["alleles"],
                 "id": var["id"],
                 "filter": None,
             }
@@ -507,8 +508,11 @@ def write(self, chroms=None):
                     record.samples[sample]["SAMPLE"] = tuple(
                         self.valid_labels[samp_idx, var_idx, :]
                     )
-                # TODO: add proper phase info
-                record.samples[sample].phased = True
+                # add proper phasing info
+                if phased:
+                    record.samples[sample].phased = True
+                else:
+                    record.samples[sample].phased = self.data[samp_idx, var_idx, 2]
             # write the record to a file
             vcf.write(record)
         vcf.close()
@@ -597,7 +601,7 @@ def transform_haps(
         if ancestry and not bps_file.exists():
             gt = GenotypesAncestry(fname=genotypes, log=log)
         else:
-            gt = data.GenotypesRefAlt(fname=genotypes, log=log)
+            gt = data.GenotypesVCF(fname=genotypes, log=log)
     # gt._prephased = True
     gt.read(region=region, samples=samples, variants=variants)
     gt.check_missing(discard_also=discard_missing)
@@ -631,7 +635,7 @@ def transform_haps(
         bps = data.Breakpoints(fname=bps_file, log=log)
         bps.read(samples=set(gt.samples))
         bps.encode()
-        # convert the GenotypesRefAlt object to a GenotypesAncestry object
+        # convert the GenotypesVCF object to a GenotypesAncestry object
         # TODO: figure out a better solution for this
         # this is just a temp hack to get output from simgenotype to load a bit faster
         gta = GenotypesAncestry(fname=None, log=log)
@@ -647,7 +651,7 @@ def transform_haps(
         hp_gt = data.GenotypesPLINK(fname=output, log=log, chunk_size=chunk_size)
     else:
         out_file_type = "VCF/BCF"
-        hp_gt = data.GenotypesRefAlt(fname=output, log=log)
+        hp_gt = data.GenotypesVCF(fname=output, log=log)
     log.info("Transforming genotypes via haplotypes")
     hp.transform(gt, hp_gt)
 
diff --git a/tests/bench_genotypes.py b/tests/bench_genotypes.py
index 1258c620..734e1d79 100755
--- a/tests/bench_genotypes.py
+++ b/tests/bench_genotypes.py
@@ -10,7 +10,7 @@
 import numpy as np
 import matplotlib.pyplot as plt
 
-from haptools.data import GenotypesRefAlt, GenotypesPLINK
+from haptools.data import GenotypesVCF, GenotypesPLINK
 
 
 # COMMAND FOR GENERATING UKB PLOT:
@@ -32,7 +32,7 @@ def create_variant_files(gts, intervals, num_samps):
         sub.fname = variant_dir / f"{val}.pgen"
         sub.write()
         # write VCF files
-        vcf = GenotypesRefAlt(sub.fname.with_suffix(".vcf"))
+        vcf = GenotypesVCF(sub.fname.with_suffix(".vcf"))
         vcf.data = sub.data
         vcf.samples = sub.samples
         vcf.variants = sub.variants
@@ -51,7 +51,7 @@ def create_sample_files(gts, intervals, num_vars):
         sub.fname = sample_dir / f"{val}.pgen"
         sub.write()
         # write VCF files
-        vcf = GenotypesRefAlt(sub.fname.with_suffix(".vcf"))
+        vcf = GenotypesVCF(sub.fname.with_suffix(".vcf"))
         vcf.data = sub.data
         vcf.samples = sub.samples
         vcf.variants = sub.variants
@@ -60,7 +60,7 @@ def create_sample_files(gts, intervals, num_vars):
 
 
 def time_vcf(vcf, max_variants):
-    GenotypesRefAlt(vcf).read(max_variants=max_variants)
+    GenotypesVCF(vcf).read(max_variants=max_variants)
 
 
 def time_plink(pgen, max_variants):
diff --git a/tests/bench_transform.py b/tests/bench_transform.py
index 16b1d98b..2aaa7753 100755
--- a/tests/bench_transform.py
+++ b/tests/bench_transform.py
@@ -11,7 +11,7 @@
 import numpy as np
 import matplotlib.pyplot as plt
 
-from haptools.data import GenotypesRefAlt, Haplotypes, Haplotype, Variant
+from haptools.data import GenotypesVCF, Haplotypes, Haplotype, Variant
 
 
 # ---------------USAGE----------------
@@ -22,7 +22,7 @@
 
 
 def create_genotypes(log, samples, variants, with_phase=False):
-    gts = GenotypesRefAlt(fname=None, log=log)
+    gts = GenotypesVCF(fname=None, log=log)
     shape = (samples, variants, 2 + with_phase)
     # create a GT matrix with shape: samples x SNPs x (strands+phase)
     gts.data = np.random.choice([0, 1], size=np.prod(shape))
diff --git a/tests/data/dat_files/correct_model.dat b/tests/data/dat_files/correct_model.dat
index 55668724..572d1dda 100644
--- a/tests/data/dat_files/correct_model.dat
+++ b/tests/data/dat_files/correct_model.dat
@@ -1,2 +1,2 @@
-40  Admix   Pop1    Pop2    Pop3
+40  Admix   CEU    YRI    Pop3
 10   0  0   0.95    .05
diff --git a/tests/data/simple-multiallelic.vcf b/tests/data/simple-multiallelic.vcf
new file mode 100644
index 00000000..25821b09
--- /dev/null
+++ b/tests/data/simple-multiallelic.vcf
@@ -0,0 +1,9 @@
+##fileformat=VCFv4.2
+##filedate=20180225
+##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
+##contig=<ID=1>
+#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	HG00096	HG00097	HG00099	HG00100	HG00101
+1	10114	1:10114:T:C	T	C	.	.	.	GT	0|0	0|0	0|0	0|0	0|0
+1	10116	1:10116:A:G	A	G,T	.	.	.	GT	0|1	0|1	2|1	1|1	0|2
+1	10117	1:10117:C:A	C	A	.	.	.	GT	0|0	0|0	0|0	0|0	0|0
+1	10122	1:10122:A:G	A	G,C	.	.	.	GT	0|0	0|0	0|0	0|0	0|0
diff --git a/tests/test_data.py b/tests/test_data.py
index 03059d73..7f882b16 100644
--- a/tests/test_data.py
+++ b/tests/test_data.py
@@ -17,7 +17,7 @@
     Covariates,
     Haplotypes,
     Breakpoints,
-    GenotypesRefAlt,
+    GenotypesVCF,
     GenotypesPLINK,
 )
 
@@ -148,14 +148,15 @@ def test_load_genotypes_example(self):
         assert gts.samples[:25] == samples
 
     def test_load_genotypes_iterate(self, caplog):
-        expected = self._get_expected_genotypes().transpose((1, 0, 2))
-        samples = ("HG00096", "HG00097", "HG00099", "HG00100", "HG00101")
+        expected = self._get_fake_genotypes()
 
         # can we load the data from the VCF?
         gts = Genotypes(DATADIR.joinpath("simple.vcf"))
         for idx, line in enumerate(gts):
-            np.testing.assert_allclose(line.data, expected[idx])
-        assert gts.samples == samples
+            np.testing.assert_allclose(line.data[:, :2], expected.data[:, idx])
+            for col in ("chrom", "pos", "id"):
+                assert line.variants[col] == expected.variants[col][idx]
+        assert gts.samples == expected.samples
 
     def test_load_genotypes_discard_multiallelic(self):
         gts = self._get_fake_genotypes()
@@ -281,7 +282,7 @@ def test_check_sorted(self, caplog):
 class TestGenotypesPLINK:
     def _get_fake_genotypes_plink(self):
         pgenlib = pytest.importorskip("pgenlib")
-        gts_ref_alt = TestGenotypesRefAlt()._get_fake_genotypes_refalt()
+        gts_ref_alt = TestGenotypesVCF()._get_fake_genotypes_refalt()
         gts = GenotypesPLINK(gts_ref_alt.fname)
         gts.data = gts_ref_alt.data
         gts.samples = gts_ref_alt.samples
@@ -299,7 +300,7 @@ def test_load_genotypes(self):
         np.testing.assert_allclose(gts.data, expected.data)
         assert gts.samples == expected.samples
         for i, x in enumerate(expected.variants):
-            for col in ("chrom", "pos", "id", "ref", "alt"):
+            for col in ("chrom", "pos", "id", "alleles"):
                 assert gts.variants[col][i] == expected.variants[col][i]
 
     def test_load_genotypes_chunked(self):
@@ -313,7 +314,7 @@ def test_load_genotypes_chunked(self):
         np.testing.assert_allclose(gts.data, expected.data)
         assert gts.samples == expected.samples
         for i, x in enumerate(expected.variants):
-            for col in ("chrom", "pos", "id", "ref", "alt"):
+            for col in ("chrom", "pos", "id", "alleles"):
                 assert gts.variants[col][i] == expected.variants[col][i]
 
     def test_load_genotypes_prephased(self):
@@ -327,7 +328,7 @@ def test_load_genotypes_prephased(self):
         np.testing.assert_allclose(gts.data, expected.data)
         assert gts.samples == expected.samples
         for i, x in enumerate(expected.variants):
-            for col in ("chrom", "pos", "id", "ref", "alt"):
+            for col in ("chrom", "pos", "id", "alleles"):
                 assert gts.variants[col][i] == expected.variants[col][i]
 
     def test_load_genotypes_iterate(self):
@@ -338,8 +339,11 @@ def test_load_genotypes_iterate(self):
         # check that everything matches what we expected
         for idx, line in enumerate(gts):
             np.testing.assert_allclose(line.data[:, :2], expected.data[:, idx])
-            for col in ("chrom", "pos", "id", "ref", "alt"):
+            for col in ("chrom", "pos", "id"):
                 assert line.variants[col] == expected.variants[col][idx]
+            assert (
+                line.variants["alleles"].tolist() == expected.variants["alleles"][idx]
+            )
         assert gts.samples == expected.samples
 
     def test_load_genotypes_subset(self):
@@ -390,7 +394,7 @@ def test_write_genotypes_chunked(self):
         np.testing.assert_allclose(gts.data, new_gts.data)
         assert gts.samples == new_gts.samples
         for i in range(len(new_gts.variants)):
-            for col in ("chrom", "pos", "id", "ref", "alt"):
+            for col in ("chrom", "pos", "id", "alleles"):
                 assert gts.variants[col][i] == new_gts.variants[col][i]
 
         # clean up afterwards: delete the files we created
@@ -413,7 +417,7 @@ def test_write_genotypes(self):
         np.testing.assert_allclose(gts.data, new_gts.data)
         assert gts.samples == new_gts.samples
         for i in range(len(new_gts.variants)):
-            for col in ("chrom", "pos", "id", "ref", "alt"):
+            for col in ("chrom", "pos", "id", "alleles"):
                 assert gts.variants[col][i] == new_gts.variants[col][i]
 
         # clean up afterwards: delete the files we created
@@ -437,7 +441,7 @@ def test_write_genotypes_prephased(self):
         np.testing.assert_allclose(gts.data, new_gts.data)
         assert gts.samples == new_gts.samples
         for i in range(len(new_gts.variants)):
-            for col in ("chrom", "pos", "id", "ref", "alt"):
+            for col in ("chrom", "pos", "id", "alleles"):
                 assert gts.variants[col][i] == new_gts.variants[col][i]
 
         # clean up afterwards: delete the files we created
@@ -462,7 +466,7 @@ def test_write_genotypes_unphased(self):
         np.testing.assert_allclose(gts.data, new_gts.data)
         assert gts.samples == new_gts.samples
         for i in range(len(new_gts.variants)):
-            for col in ("chrom", "pos", "id", "ref", "alt"):
+            for col in ("chrom", "pos", "id", "alleles"):
                 assert gts.variants[col][i] == new_gts.variants[col][i]
 
         # clean up afterwards: delete the files we created
@@ -897,7 +901,7 @@ def test_hap_transform(self):
         )
 
         hap = list(self._get_dummy_haps().data.values())[0]
-        gens = TestGenotypesRefAlt()._get_fake_genotypes_refalt()
+        gens = TestGenotypesVCF()._get_fake_genotypes_refalt()
         hap_gt = hap.transform(gens)
         np.testing.assert_allclose(hap_gt, expected)
 
@@ -914,10 +918,10 @@ def test_haps_transform(self, return_also=False):
         )
 
         haps = self._get_dummy_haps()
-        gens = TestGenotypesRefAlt()._get_fake_genotypes_refalt()
+        gens = TestGenotypesVCF()._get_fake_genotypes_refalt()
         gens.data[[2, 4], 0, 1] = 1
         gens.data[[1, 4], 2, 0] = 1
-        hap_gt = GenotypesRefAlt(fname=None)
+        hap_gt = GenotypesVCF(fname=None)
         haps.transform(gens, hap_gt)
         np.testing.assert_allclose(hap_gt.data, expected)
 
@@ -998,11 +1002,11 @@ def test_sort(self):
         test_hap1.fname.unlink()
 
 
-class TestGenotypesRefAlt:
+class TestGenotypesVCF:
     def _get_fake_genotypes_refalt(self, with_phase=False):
         base_gts = TestGenotypes()._get_fake_genotypes()
         # copy all of the fields
-        gts = GenotypesRefAlt(fname=None)
+        gts = GenotypesVCF(fname=None)
         gts.data = base_gts.data
         if with_phase:
             data_shape = (gts.data.shape[0], gts.data.shape[1], 1)
@@ -1011,42 +1015,45 @@ def _get_fake_genotypes_refalt(self, with_phase=False):
                 (gts.data, np.ones(data_shape, dtype=gts.data.dtype)), axis=2
             )
         gts.samples = base_gts.samples
-        # add additional ref and alt alleles
-        ref_alt = np.array(
+        base_dtype = {k: v[0] for k, v in base_gts.variants.dtype.fields.items()}
+        ref_alt = [
+            ("T", "C"),
+            ("A", "G"),
+            ("C", "A"),
+            ("A", "G"),
+        ]
+        gts.variants = np.array(
+            [tuple(rec) + (ref_alt[idx],) for idx, rec in enumerate(base_gts.variants)],
+            dtype=(list(base_dtype.items()) + [("alleles", object)]),
+        )
+        return gts
+
+    def _get_fake_genotypes_multiallelic(self, with_phase=False):
+        gts = self._get_fake_genotypes_refalt(with_phase=with_phase)
+        # replace the necessary properties
+        gts.variants["alleles"] = np.array(
             [
                 ("T", "C"),
-                ("A", "G"),
+                ("A", "G", "T"),
                 ("C", "A"),
-                ("A", "G"),
-            ],
-            dtype=[
-                ("ref", "U100"),
-                ("alt", "U100"),
+                ("A", "G", "C"),
             ],
+            dtype=gts.variants["alleles"].dtype,
         )
-        # see https://stackoverflow.com/a/5356137
-        gts.variants = rfn.merge_arrays((base_gts.variants, ref_alt), flatten=True)
+        gts.data[[2, 4], 1, [0, 1]] = 2
         return gts
 
     def test_read_ref_alt(self):
         # simple.vcf
-        gts_ref_alt_read = GenotypesRefAlt(DATADIR.joinpath("simple.vcf"))
-        gts_ref_alt_read.read()
-        expected = np.array(
-            [
-                ("T", "C"),
-                ("A", "G"),
-                ("C", "A"),
-                ("A", "G"),
-            ],
-            dtype=gts_ref_alt_read.variants[["ref", "alt"]].dtype,
-        )
-        for i, x in enumerate(expected):
-            assert gts_ref_alt_read.variants[["ref", "alt"]][i] == x
+        expected = self._get_fake_genotypes_refalt()
+        gts = GenotypesVCF(DATADIR.joinpath("simple.vcf"))
+        gts.read()
+        for i, x in enumerate(expected.variants["alleles"]):
+            assert gts.variants["alleles"][i] == x
 
         # example.vcf.gz
-        gts_ref_alt = GenotypesRefAlt(DATADIR.joinpath("example.vcf.gz"))
-        gts_ref_alt.read()
+        gts = GenotypesVCF(DATADIR.joinpath("example.vcf.gz"))
+        gts.read()
         expected = np.array(
             [
                 ("C", "A"),
@@ -1065,37 +1072,121 @@ def test_read_ref_alt(self):
                 ("T", "C"),
                 ("C", "T"),
             ],
-            dtype=gts_ref_alt.variants[["ref", "alt"]].dtype,
+            dtype=gts.variants["alleles"].dtype,
         )
         for i, x in enumerate(expected):
-            assert gts_ref_alt.variants[["ref", "alt"]][i] == x
+            assert gts.variants["alleles"][i] == tuple(x.tolist())
+
+    def test_read_multiallelic(self):
+        # simple-multiallelic.vcf
+        expected = self._get_fake_genotypes_multiallelic(with_phase=True)
+
+        gts = GenotypesVCF(DATADIR.joinpath("simple-multiallelic.vcf"))
+        gts.read()
+        for i, x in enumerate(expected.variants["alleles"]):
+            assert gts.variants["alleles"][i] == x
+        np.testing.assert_allclose(gts.data, expected.data)
 
-    def test_write_ref_alt(self):
+    def test_write_ref_alt(self, multiallelic=False):
         # strategy is to read in the file, write it, and then read again
-        # read genotypes
-        gts_ref_alt_write = GenotypesRefAlt(DATADIR.joinpath("simple.vcf"))
-        gts_ref_alt_write.read()
-        gts_ref_alt_write.check_phase()
+        if multiallelic:
+            expected = self._get_fake_genotypes_multiallelic()
+            # read genotypes
+            gts = GenotypesVCF(DATADIR.joinpath("simple-multiallelic.vcf"))
+        else:
+            expected = self._get_fake_genotypes_refalt()
+            # read genotypes
+            gts = GenotypesVCF(DATADIR.joinpath("simple.vcf"))
+        gts.read()
+        gts.check_phase()
         # write file to new file
         fname = DATADIR.joinpath("test.vcf")
-        gts_ref_alt_write.fname = fname
-        gts_ref_alt_write.write()
+        gts.fname = fname
+        gts.write()
         # read again
-        gts_ref_alt_write.read()
-        gts_ref_alt_write.check_phase()
+        gts.read()
+        gts.check_phase()
         # compare samples, data, variants, and ref/alt for equality
-        expected = self._get_fake_genotypes_refalt()
-        assert gts_ref_alt_write.samples == expected.samples
-        np.testing.assert_allclose(gts_ref_alt_write.data, expected.data)
+        assert gts.samples == expected.samples
+        np.testing.assert_allclose(gts.data, expected.data)
 
         for i, x in enumerate(expected.variants):
             # dtype.names gives us names of columns in the array
             for col in expected.variants.dtype.names:
                 # each row is a variant
                 # index into col, i gets specific variant, x iterates thru
-                assert gts_ref_alt_write.variants[col][i] == x[col]
+                assert gts.variants[col][i] == x[col]
 
-        os.remove(str(fname))
+        fname.unlink()
+
+    def test_write_multiallelic(self):
+        self.test_write_ref_alt(multiallelic=True)
+
+    def test_write_phase(self, prephased=True):
+        gts = self._get_fake_genotypes_refalt()
+
+        fname = DATADIR.joinpath("test_write_phase.vcf")
+        gts.fname = fname
+        if prephased:
+            gts._prephased = True
+        else:
+            # add phasing information back
+            gts.data = np.dstack(
+                (gts.data, np.ones(gts.data.shape[:2], dtype=np.uint8))
+            )
+            gts.data[:2, 1, 2] = 0
+        gts.write()
+
+        new_gts = GenotypesVCF(fname)
+        if prephased:
+            new_gts._prephased = True
+        new_gts.read()
+
+        # check that everything matches what we expected
+        np.testing.assert_allclose(gts.data, new_gts.data)
+        assert gts.samples == new_gts.samples
+        for i in range(len(new_gts.variants)):
+            for col in ("chrom", "pos", "id", "alleles"):
+                assert gts.variants[col][i] == new_gts.variants[col][i]
+
+        fname.unlink()
+
+    def test_write_unphased(self):
+        self.test_write_phase(prephased=False)
+
+    def test_write_missing(self):
+        gts = self._get_fake_genotypes_refalt()
+        gts.fname = DATADIR.joinpath("test_write_missing.vcf")
+        vals = len(np.unique(gts.data))
+
+        gts.write()
+
+        new_gts = GenotypesVCF(gts.fname)
+        new_gts.read()
+
+        new_gts.check_missing()
+
+        # force two of the samples to have a missing GT
+        gts.data = gts.data.astype(np.int8)
+        gts.data[1, 1, 1] = -1
+        gts.data = gts.data.astype(np.uint8)
+
+        gts.write()
+
+        new_gts = GenotypesVCF(gts.fname)
+        new_gts.read()
+
+        assert len(np.unique(gts.data)) == (vals + 1)
+
+        with pytest.raises(ValueError) as info:
+            gts.check_missing()
+        assert (
+            str(info.value)
+            == "Genotype with ID 1:10116:A:G at POS 1:10116 is missing for sample"
+            " HG00097"
+        )
+
+        gts.fname.unlink()
 
 
 class TestBreakpoints:
@@ -1280,3 +1371,36 @@ def test_breakpoints_to_pop_array_chrom_no_match(self):
         with pytest.raises(ValueError) as info:
             pop_arr = expected.population_array(variants[[0, 1, 3]])
         assert str(info.value).startswith("Chromosome ")
+
+
+class TestDocExamples:
+    def test_gts2hap(self):
+        # which variants do we want to write to the haplotype file?
+        variants = {"rs429358", "rs7412"}
+
+        # load the genotypes file
+        # you can use either a VCF or PGEN file
+        gt = GenotypesVCF("tests/data/apoe.vcf.gz")
+        gt.read(variants=variants)
+
+        # initialize an empty haplotype file
+        hp = Haplotypes("output.hap", haplotype=Haplotype)
+        hp.data = {}
+
+        for variant in gt.variants:
+            ID, chrom, pos, alleles = variant[["id", "chrom", "pos", "alleles"]]
+            end = pos + len(alleles[1])
+
+            # create a haplotype line in the .hap file
+            # you should fill out "beta" with your own value
+            hp.data[ID] = HaptoolsHaplotype(
+                chrom=chrom, start=pos, end=end, id=ID, beta=0.5
+            )
+
+            # create variant lines for each haplotype
+            hp.data[ID].variants = (
+                Variant(start=pos, end=end, id=ID, allele=alleles[1]),
+            )
+
+        hp.write()
+        hp.fname.unlink()
diff --git a/tests/test_outputvcf.py b/tests/test_outputvcf.py
index 74ae0064..07e9739c 100644
--- a/tests/test_outputvcf.py
+++ b/tests/test_outputvcf.py
@@ -90,9 +90,9 @@ def _get_expected_output():
     gts.data = gts.data.transpose((1, 0, 2))
     gts.variants = np.array(
         [
-            ("1:10114:T:C", "1", 10114, "T", "C"),
-            ("1:59423090:A:G", "1", 59423090, "A", "G"),
-            ("2:10122:A:G", "2", 10122, "A", "G"),
+            ("1:10114:T:C", "1", 10114, ("T", "C")),
+            ("1:59423090:A:G", "1", 59423090, ("A", "G")),
+            ("2:10122:A:G", "2", 10122, ("A", "G")),
         ],
         dtype=gts.variants.dtype,
     )
@@ -759,7 +759,10 @@ def test_model_files():
         validate_params(
             model, mapdir, chroms, popsize, vcf_file, faulty_sampleinfo_file, False
         )
-    msg = "Sample HG00022 in sampleinfo file is not present in the vcf file."
+    msg = (
+        "Sample HG00022 from population CEU in sampleinfo file is not present in the"
+        " vcf file."
+    )
     assert (str(e.value)) == msg
 
     faulty_model = DATADIR.joinpath("dat_files/faulty_model_sample_info.dat")
diff --git a/tests/test_transform.py b/tests/test_transform.py
index 930c4cc3..7aeb6a25 100644
--- a/tests/test_transform.py
+++ b/tests/test_transform.py
@@ -12,8 +12,8 @@
     GenotypesAncestry,
 )
 from haptools.__main__ import main
-from .test_data import TestGenotypesRefAlt
-from haptools.data import Variant, GenotypesRefAlt, GenotypesPLINK
+from .test_data import TestGenotypesVCF
+from haptools.data import Variant, GenotypesVCF, GenotypesPLINK
 
 
 DATADIR = Path(__file__).parent.joinpath("data")
@@ -37,7 +37,7 @@ def _get_expected_ancestry(self):
         return expected
 
     def _get_fake_genotypes(self):
-        base_gts = TestGenotypesRefAlt()._get_fake_genotypes_refalt(with_phase=True)
+        base_gts = TestGenotypesVCF()._get_fake_genotypes_refalt(with_phase=True)
         # copy all of the fields
         gts = GenotypesAncestry(fname=None)
         gts.data = base_gts.data
@@ -131,7 +131,7 @@ def test_write_genotypes(self):
         assert False
 
     @pytest.mark.xfail(reason="not implemented yet")
-    def test_write_genotypes_unphased(self):
+    def test_write_genotypes_phase(self, prephased=True):
         assert False
 
 
@@ -199,7 +199,7 @@ def test_haps_transform(self):
         gens = TestGenotypesAncestry()._get_fake_genotypes()
         gens.check_phase()
 
-        hap_gt = GenotypesRefAlt(fname=None)
+        hap_gt = GenotypesVCF(fname=None)
         haps.transform(gens, hap_gt)
         np.testing.assert_allclose(hap_gt.data, expected)
 
@@ -208,7 +208,7 @@ def test_haps_transform(self):
         gens.data[[2, 4], 0, 1] = 1
         gens.data[[1, 4], 2, 0] = 1
 
-        hap_gt = GenotypesRefAlt(fname=None)
+        hap_gt = GenotypesVCF(fname=None)
         haps.transform(gens, hap_gt)
         np.testing.assert_allclose(hap_gt.data, expected)
 
@@ -242,7 +242,7 @@ def test_basic_subset(capfd):
 """
 
     # first, remove the last two variants in the genotypes file
-    gts = GenotypesRefAlt.load("tests/data/simple.vcf")
+    gts = GenotypesVCF.load("tests/data/simple.vcf")
     gts.fname = Path("simple_minus_two.vcf")
     gts.subset(variants=tuple(gts.variants["id"][:-2]), inplace=True)
     gts.write()