Collect known PRDM9 alleles and associated zinc fingers from publications

Literature searches revealed several publications that describe zinc finger (znf) DNA sequences, znf amino acid sequences, allele znf content, allele DNA sequences and/or their accession numbers, allele mutations, and/or allele structural variants (SVs):

Oliver 2009
- Znf DNA sequences
- Allele DNA sequence accession numbers
Thomas 2009
- Znf DNA sequences
Parvanov 2010
- Allele DNA sequence accession numbers
- Allele amino acid sequences
Baudat 2010
- Allele znf content
- Allele DNA sequence accession numbers
Berg 2010
- Znf DNA sequences
- Allele znf content
- Allele DNA sequence accession numbers
Kong 2010
- Allele znf content
Ponting 2011
- Allele znf content
Berg 2011
- Znf DNA sequences
- Allele znf content
Borel 2012
- Znf DNA sequences
- Allele znf content
Jeffreys 2013
- Znf DNA sequences
- Allele znf content
Hussin 2013
- Znf DNA sequences
- Allele znf content
- Allele DNA sequence accession numbers
Beyter 2021
- Allele structural variants
  - Allele DNA sequences (inferred)
Wang 2021
- Allele mutations
  - Allele DNA sequences (inferred)
Alleva 2021
- Znf DNA sequences
- Znf amino acid sequences
- Allele znf content
- Allele DNA sequences

When possible, sequences were copy/pasted from publications and saved as firstauthor-year-type.txt in the copy-paste-files directory. After tidying up, content was saved as firstauthor-year-type.tsv in the intermediate-files directory. Type was one of the following:

znf-sequences
znf-aminos
allele-znf-content
allele-sequence-accessions
allele-sequences
allele-mutations
allele-aminos

Genbank accession downloads are described in additional documentation.

Analysis steps:

Get allele and znf sequence data from publications
Compile known znf sequences and give standardized names
Compile allele znf content, get list of unique alleles, and give them standardized names
Convert allele DNA sequences to standardized znf names to confirm known alleles and identify ones not in current list

Step 1. Get allele and znf sequence data from publications

Oliver et al. Dec 2009

Accelerated Evolution of the Prdm9 Speciation Gene across Diverse Metazoan Taxa

PMID: 19997497
GenBank Accession Numbers: FJ899863.1 - FJ899912.1

Znf DNA sequences:

Includes znfs 03-14
Copy/paste from Supplementary Dataset S1 to: copy-paste-files/oliver-2009-znf-copy.txt
- Sequences are shifted and begin with the last 9 nucleotides of previous zinc finger
Tidy file: intermediate-files/oliver-2009-znf-sequences.tsv

# extract human znfs
sed '/>/ s/$/NEWLINE/' copy-paste-files/oliver-2009-znf-copy.txt | tr -d '\n' | sed 's/>/\n/g' | grep "homo_sapien" | sed 's/NEWLINE/\t/' > intermediate-files/oliver-2009-znf-sequences.tsv

# check if all unique
wc -l intermediate-files/oliver-2009-znf-sequences.tsv
# 12
cut -f2 intermediate-files/oliver-2009-znf-sequences.tsv | sort | uniq | wc -l
# 9

# remove duplicate sequences, keeping lowest ID number
sort -u -k2,2 intermediate-files/oliver-2009-znf-sequences.tsv | sort > intermediate-files/TEMP-oliver-2009-znf-sequences.tsv
rm intermediate-files/oliver-2009-znf-sequences.tsv
mv intermediate-files/TEMP-oliver-2009-znf-sequences.tsv intermediate-files/oliver-2009-znf-sequences.tsv
wc -l intermediate-files/oliver-2009-znf-sequences.tsv
# 9

Allele DNA sequence accession numbers:

Save accession numbers to: genbank-records/oliver-2009-allele-sequence-accessions.txt

# generate sequence of numbers representing genbank accession numbers
for i in $(seq 899863 899912)
do
echo "FJ$i.1" >> genbank-records/oliver-2009-allele-sequence-accessions.txt
done

Thomas et al. Dec 2009

Extraordinary Molecular Evolution in the PRDM9 Fertility Gene

PMID: 20041164
GenBank Accession Numbers: None

Znf DNA sequences:

Includes znfs ZF1-ZF12
Znf sequences depicted in Figure 4
Not copy/pastable

Parvanov et al. Feb 2010

PRDM9 controls activation of mammalian recombination hotspots

PMID: 20044538
GenBank Accession Numbers: GU183914.1 - GU183919.1

Allele DNA sequence accession numbers:

Accession numbers listed on Supplementary Material page 3: Material and Methods, Gene Bank Numbers include from GU183909-GU183919, but only GU183914-GU183919 are human (the others are mouse)
Save accession numbers to: genbank-records/parvanov-2010-allele-sequence-accessions.txt

# generate sequence of numbers representing genbank accession numbers
for i in $(seq 183914 183919)
do
echo "GU$i.1" >> genbank-records/parvanov-2010-allele-sequence-accessions.txt
done

Allele amino acid sequences:

Includes alleles AA1-AA11, CH1-CH3, M1, M2
Copy/paste from Supplementary Figure S3A to: copy-paste-files/parvanov-2010-allele-aminos.txt
- Moved blocks to one-per-column and put names on top in this file for easier tidying
Tidy file: intermediate-files/parvanov-2010-allele-aminos.tsv

# remove extra white space, separate columns, save to temp file
sed 's/^[ \t]*//;s/[ \t]*$//' copy-paste-files/parvanov-2010-aminos.txt | sed -e '/\* [A-Z]/ s/\* /\*\t/' -e '/) [A-Z]/ s/) /)\t/' -e 's/K /K\t/' | egrep "\t[A-Z]" | cut -f2 > intermediate-files/parvanov-2010-allele-aminos-temp.txt
sed 's/^[ \t]*//;s/[ \t]*$//' copy-paste-files/parvanov-2010-aminos.txt | sed -e '/\* [A-Z]/ s/\* /\*\t/' -e '/) [A-Z]/ s/) /)\t/' -e 's/K /K\t/' | cut -f1 >> intermediate-files/parvanov-2010-allele-aminos-temp.txt

# put sequences on one line, move name to beginning, remove aminos before start (LYVCRE before CGR) & after end (DE* after CRE), sort
awk '/\*/ {printf("%s,", $0); next}1' intermediate-files/parvanov-2010-allele-aminos-temp.txt| tr -d '\n' | sed 's/)/)\n/g' | awk -F "," '{print $2 "\t" $1}' | sed -e 's/\tLYVCRE/\t/' -e 's/DE\*$//' -e 's/ (.*)\t/\t/' | sort -k1,1V > intermediate-files/parvanov-2010-allele-aminos.tsv

Baudat et al. Feb 2010

PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice

PMID: 20044539
GenBank Accession Numbers: GU216222.1 - GU216229.1

Allele znf content:

Includes alleles A-E, F, H-I, K
Image in Figure 2b depicts znf content as unnamed blocks
- Appears one block (-NHR in the legend) represents more than one znf as it appears in positions 11 & 12 in alleles A and B
- Additionally, tne block in alleles F and K (--HR) and one block in allele H (-RVS)that are not also present in A-E
Type znf content typed out by hand, triple check for accuracy, to: copy-paste-files/baudat-2010-allele-copy.txt
- Name blocks by four-character code in figure legend (orange = NTOR; grey = NTGR), with | to separate blocks
Tidy file: intermediate-files/baudat-2010-allele-znf-content-unnamed.tsv

# tidy file
awk '{print $1 "\t" $3}' copy-paste-files/baudat-2010-allele-copy.txt > intermediate-files/baudat-2010-allele-znf-content-unnamed.tsv

Allele DNA sequence accession numbers:

Includes alleles A-F, H-I
Sequences in Supplementary Figure S3B are screenshots instead of copy/pastable text (!)
Save accession numbers to: genbank-records/baudat-2010-allele-sequence-accessions.txt

# generate sequence of numbers representing genbank accession numbers
for i in $(seq 216222 216229)
do
echo "GU$i.1" >> genbank-records/baudat-2010-allele-sequence-accessions.txt
done

Berg et al. Oct 2010

PRDM9 variation strongly influences recombination hot-spot activity and meiotic instability in humans

PMID: 20818382
GenBank Accession Numbers: HM210983.1 – HM211006.1

Znf DNA sequences:

Includes znfs a-t
Copy/paste from Supplementary Figure 1a to: copy-paste-files/berg-2010-znf-copy.txt
Tidy file: intermediate-files/berg-2010-znf-sequences.tsv

# tidy file
sed -e 's/\s/\t/' -e '$a\' copy-paste-files/berg-2010-znf-copy.txt > intermediate-files/berg-2010-znf-sequences.tsv

# check if all unique
wc -l intermediate-files/berg-2010-znf-sequences.tsv
# 20
cut -f2 intermediate-files/berg-2010-znf-sequences.tsv | sort | uniq | wc -l
# 20

Allele znf content:

Includes alleles A-E, L1-L24
Copy/paste from Supplementary Figure 1b to: copy-paste-files/berg-2010-allele-copy.txt
Tidy file: intermediate-files/berg-2010-allele-znf-content.tsv

# remove extra columns, tidy file, and sort alphabetically
awk '{print $1 "\t" $3}' copy-paste-files/berg-2010-allele-copy.txt | sort -k1,1V > intermediate-files/berg-2010-allele-znf-content.tsv

Allele DNA sequence accession numbers:

Accessions for alleles L1-L24
Save accession numbers to: genbank-records/berg-2010-allele-sequence-accessions.txt

# generate sequence of numbers representing genbank accession numbers
for i in $(seq 210983 211006)
do
echo "HM$i.1" >> genbank-records/berg-2010-allele-sequence-accessions.txt
done

Kong et al. Oct 2010

Fine-scale recombination rate differences between sexes, populations and individuals

PMID: 20981099
GenBank Accession Numbers: None

Allele znf content:

Includes alleles Decode01-Decode07, YRI01-YRI09 (publication did not provide allele names, so name them here)
Image in Supplementary Figure 4 depicts znf content as blocks named with the amino acids at repeat positions -1, 3 and 6 of the alpha helix
Type znf content typed out by hand, triple check for accuracy, to: copy-paste-files/kong-2010-allele-copy.txt
- Add - to separate blocks
Tidy file: intermediate-files/kong-2010-allele-znf-content-unnamed.tsv

# remove gaps, tidy file, give allele names (first 7 are Decode sample, rest are HapMap YRI)
awk '{print $2}' copy-paste-files/kong-2010-allele-copy.txt | sed -e 's/-\{2,\}/-/g' | grep "-" | sort | uniq | awk '{if (NR<8) printf "Decode%02i\t%s\n", NR, $1; else printf "YRI%02i\t%s\n", NR-7, $1}' > intermediate-files/kong-2010-allele-znf-content-unnamed.tsv

Ponting May 2011

What are the genomic drivers of the rapid evolution of PRDM9?

PMID: 21388701
GenBank Accession Numbers: None

Allele znf content:

Includes alleles A-L24
Review paper
Image in Figure 4 depicts znf content as named blocks
- Cites Berg et al. 2010 for allele znf content
- However, znf content for allele L24 does not match that for Berg et al. 2010
- Via email communication with Ponting (Aug 2021), confirmed that L24 znf content depicted in Figure 4 is incorrect
Type Znf content out by hand, triple check for accuracy, to: copy-paste-files/ponting-2010-allele-znf-content.txt
- Include gaps represented with _
Tidy file: intermediate-files/ponting-2011-allele-znf-content.tsv

# remove gaps and sort alphabetically
sed -e 's/\s/\t/g' -e 's/_//g' copy-paste-files/ponting-2011-allele-copy.txt | cut -f1,2 | sort -k1,1V > intermediate-files/ponting-2011-allele-znf-content.tsv

Berg et al. Jul 2011

Variants of the protein PRDM9 differentially regulate a set of human meiotic recombination hotspots highly active in African populations

PMID: 21750151
GenBank Accession Numbers: None

Znf DNA sequences:

Includes znfs a-l, o-v
Copy/paste from Supplementary Figure 1A to: copy-paste-files/berg-2011-znf-copy.txt
Tidy file: intermediate-files/berg-2011-znf-sequences.tsv

# tidy file
sed -e 's/\s/\t/'  -e '$a\' copy-paste-files/berg-2011-znf-copy.txt > intermediate-files/berg-2011-znf-sequences.tsv

# check if all unique
wc -l intermediate-files/berg-2011-znf-sequences.tsv
# 20
cut -f2 intermediate-files/berg-2011-znf-sequences.tsv | sort | uniq | wc -l
# 20

Allele znf content:

Includes alleles A-E, L1-L27
Copy/paste from Supplementary Figure 1B to: copy-paste-files/berg-2011-allele-copy.txt
Tidy file: intermediate-files/berg-2010-allele-znf-content.tsv

# remove extra columns, tidy file, and sort alphabetically
awk '{print $1 "\t" $3}' copy-paste-files/berg-2011-allele-copy.txt | sort -k1,1V > intermediate-files/berg-2011-allele-znf-content.tsv

Borel et al. May 2012

Evaluation of PRDM9 variation as a risk factor for recurrent genomic disorders and chromosomal non-disjunction

PMID: 22643917
GenBank Accession Numbers: None

Znf DNA sequences:

Includes znfs a-m, q
Copy/paste from Supplementary Table S1 to: copy-paste-files/borel-2012-znf-copy.txt
Tidy file: intermediate-files/borel-2012-znf-sequences.tsv

# tidy file
sed -e 's/\s/\t/' copy-paste-files/borel-2012-znf-copy.txt > intermediate-files/borel-2012-znf-sequences.tsv

# check if all unique
wc -l intermediate-files/borel-2012-znf-sequences.tsv
# 14
cut -f2 intermediate-files/borel-2012-znf-sequences.tsv | sort | uniq | wc -l
# 14

Allele znf content:

Includes alleles A-F, I, L1, L19, L28-L31
- Cites Berg et al. 2010 for allele znf content and mentions Baudat et al. 2010 in paper
- However, znf content for allele I does not match that for Baudat et al. 2010
- Have reached out to last author for clarification, but assuming it is a typo
Copy/paste from Supplementary Table S1 to: copy-paste-files/borel-2012-allele-copy.txt
Tidy file: intermediate-files/borel-2012-allele-znf-content.tsv

# remove extra column
awk '{print $1 "\t" $3}' copy-paste-files/borel-2012-allele-copy.txt > intermediate-files/borel-2012-allele-znf-content.tsv

Jeffreys et al. Jan 2013

Recombination regulator PRDM9 influences the instability of its own coding sequence in humans

PMID: 23267059
GenBank Accession Numbers: None

Znf DNA sequences:

Study looks at low-frequency mutations in blood and genotypes of sperm cells; these znfs not necessarily observed as part of a human genotype
Includes znfs A-L, O-V, a-z, 1-9, !, @, £, $, %, &, §, *, :, ±
Copy/paste from Supplementary Figure S2 to: copy-paste-files/jeffreys-2013-znf-copy.txt
- Icons £, §, and ± may not render properly when copy/pasted (appear as _) depending on language settings; these icons are used to name sequences on lines 57, 61, and 64 respectively
Tidy file: intermediate-files/jeffreys-2013-znf-sequences.tsv

# remove extra characters and tidy file
awk '{if ($1 ~ /[A-T,V-Z]/) print $1 "\t" $3; else print $1 "\t" $2}' copy-paste-files/jeffreys-2013-znf-copy.txt > intermediate-files/jeffreys-2013-znf-sequences.tsv

# check if all unique
wc -l intermediate-files/jeffreys-2013-znf-sequences.tsv
# 64
cut -f2 intermediate-files/jeffreys-2013-znf-sequences.tsv | sort | uniq | wc -l
# 64

Allele znf content:

Study looks at low-frequency mutations in blood and genotypes of sperm cells; these alleles not necessarily observed as part of a human genotype
Includes alleles Jeffreys001-Jeffreys563 (publication did not provide allele names, so name them here)
Copy/paste from Supplementary Table S1 to: copy-paste-files/jeffreys-2013-allele-copy.txt
Tidy file: intermediate-files/jeffreys-2013-allele-znf-content.tsv

# convert from 2 'text columns' to one, remove extra columns and duplicated alleles, remove gaps, add temporary allele names Jeffreys###
egrep -v "Man|allele|origin|Fig." copy-paste-files/jeffreys-2013-allele-copy.txt | awk '{if (length($5) >= 4) print $1 "\n" $5; else if (length($6) >= 4) print $1 "\n" $6; else if (NF < 5 && length($3) >= 4) print $1 "\n" $3; else if (NF < 6 && length($3) < 4) print $1}' | sed 's/-//g' | sort | uniq | awk '{printf "Jeffreys%03i\t%s\n", NR, $1}' > intermediate-files/jeffreys-2013-allele-znf-content.tsv

Hussin et al. Mar 2013

Rare allelic forms of PRDM9 associated with childhood leukemogenesis

PMID: 23222848
GenBank Accession Numbers: JQ044371.1 – JQ044377.1

Znf DNA sequences:

Includes znfs a-x
Copy/paste from Supplementary Figure S6 to: copy-paste-files/hussin-2013-znf-copy.txt
Tidy file: intermediate-files/hussin-2013-znf-sequences.tsv

# remove extra lines and tidy file
grep -v "Zinc" copy-paste-files/hussin-2013-znf-copy.txt | sed 's/\s/\t/' > intermediate-files/hussin-2013-znf-sequences.tsv

# check if all unique
wc -l intermediate-files/hussin-2013-znf-sequences.tsv
# 24
cut -f2 intermediate-files/hussin-2013-znf-sequences.tsv | sort | uniq | wc -l
# 24

Allele znf content:

Includes alleles L32-L38
Copy/paste from Supplementary Material Page 7: Supplementary Results, Description of PRDM9 Alleles and Novel ZnF Types to copy-paste-files/hussin-2013-allele-copy.txt
Tidy file: intermediate-files/hussin-2013-allele-znf-content.tsv

# tidy file and sort alphabetically
sed -e 's/ is /\t/' -e 's/[,=]/\t/' copy-paste-files/hussin-2013-allele-copy.txt | sort -k1,1V | grep . > intermediate-files/hussin-2013-allele-znf-content.tsv

Allele DNA sequence accession numbers:

Accessions for alleles L32-L38
Save accession numbers to: genbank-records/hussin-2013-allele-sequence-accessions.txt
Znf content from DNA sequences for alleles L35 and L38 do not match content described in publication
- Via email communication with Hussin (Aug 2021), confirmed that the DNA sequences for L35 and L38 deposited in GenBank are incorrect

# generate sequence of numbers representing genbank accession numbers
for i in $(seq 44371 44377)
do
echo "JQ0$i.1" >> genbank-records/hussin-2013-allele-sequence-accessions.txt
done

Beyter et al. May 2021

Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits

PMID: 33972781
GenBank Accession Numbers: None

Allele structural variants:

Includes SVs chr5:23526974:DN.1, chr5:23526974:DN.2, chr5:23526974:DN.4, chr5:23526974:XN.5, chr5:23527530:FN.0, chr5:23527530:FN.1, chr5:23527530:FN.4
- chr5:23526974:XN.5 has no alternate allele and is therefore the reference sequence
Download SV vcf from github to: copy-paste-files/beyter-2021-allele-SVs-copy.vcf
Tidy file: intermediate-files/beyter-2021-allele-SVs.vcf

# download vcf
wget https://github.com/DecodeGenetics/LRS_SV_sets/raw/master/ont_sv_high_confidence_SVs.sorted.vcf.gz -O copy-paste-files/beyter-2021-allele-SVs-copy.vcf.gz

# index vcf, subset to PRDM9 znf region
tabix copy-paste-files/beyter-2021-allele-SVs-copy.vcf.gz
tabix -h copy-paste-files/beyter-2021-allele-SVs-copy.vcf.gz chr5:23526673-23527764 > intermediate-files/beyter-2021-allele-SVs.vcf

Allele DNA sequences:

Includes alleles chr5:23526974:DN.1, chr5:23526974:DN.2, chr5:23526974:DN.4, chr5:23526974:XN.5, chr5:23527530:FN.0, chr5:23527530:FN.1, chr5:23527530:FN.4
Modify reference sequence to replace reference sequences with alternate SV sequences from intermediate-files/beyter-2021-allele-SVs.vcf
Tidy file: intermediate-files/beyter-2021-allele-sequences.tsv

# download GRCh38 chr5 fasta
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr5.fa.gz -O copy-paste-files/GRCh38-chr5.fa.gz
zcat copy-paste-files/GRCh38-chr5.fa | grep -v ">" | tr -d '\n' | sed 's/$/\n/' | gzip > copy-paste-files/GRCh38-chr5.seq.gz

# replace reference allele sequence (znf domain) with alternate SV sequences; but leave sequence as is for variant with no alt sequence (reference sequence)
while read CHR POS ID REF ALT REMAINING
do
if [[ $ALT == "<ALT>" ]]
then awk -v ID=$ID '{print ID"\t"substr($0,23526673,23527764-23526673+1)}' <(zcat copy-paste-files/GRCh38-chr5.seq.gz) >> intermediate-files/beyter-2021-allele-sequences.tsv
else
awk -v ID=$ID -v ALT=$ALT -v POS=$POS -v REF=$REF '{print ID"\t"substr($0,23526673,POS-23526673)""ALT""substr($0,POS+length(REF),23527764-POS-length(REF)+1)}' <(zcat copy-paste-files/GRCh38-chr5.seq.gz) >> intermediate-files/beyter-2021-allele-sequences.tsv
fi
done < <(grep -v "#" intermediate-files/beyter-2021-allele-SVs.vcf)

Wang et al. Jul 2021

Pathogenic variants of meiotic double strand break (DSB) formation genes PRDM9 and ANKRD31 in premature ovarian insufficiency

PMID: 34257419
GenBank Accession Numbers: None

Allele mutations:

Includes point mutations c.229C>T:p.Arg77*, c.638T>G:p.Ile213Ser, c.677A>T:p.Lys226Met relative to allele B (NM_020227.3)
Copy/paste from Table 1 to: copy-paste-files/wang-2021-allele-mutations-copy.txt
- Only copy/pasted from Patient number to E2,pg/mL for patients 1-4 due to merged cells in publication table
Tidy file: intermediate-files/wang-2021-allele-mutations.tsv

# remove extra columns, split position, reference and alternate alleles into separate columns, and sort alphabetically
cut -f3 copy-paste-files/wang-2021-allele-mutations-copy.txt  | sed 's/\s//' | uniq | awk -F: '{print $0 "\t" substr($0, 3,3 ) "\t" substr($0, 6, 1) "\t" substr($0, 8, 1)}' > intermediate-files/wang-2021-allele-mutations.tsv

Reference allele:

Above point mutations relative to allele B (NM_020227.3)
Copy/paste NCBI fasta record to: copy-paste-files/NM_020227.3-copy.txt
Tidy file: intermediate-files/NM_020227.3.fa

# collapse reference sequence to single line
sed '/>/ s/$/NEWLINE/' copy-paste-files/NM_020227.3-copy.txt | tr -d '\n' | sed 's/>/\n>/g' | sed 's/NEWLINE/\n/' > intermediate-files/NM_020227.3.fa

# get start position for the znf region
ZNFA=$(egrep "^A" intermediate-files/berg-2010-znf-sequences.tsv | cut -f2)
FULLSEQ=$(tail -1 intermediate-files/NM_020227.3.fa)
ZNFSTART=${#${FULLSEQ%%$ZNFA*}}

echo $ZNFSTART
# 1772
# mutations observed in wang 2021 do not occur in the zinc finger domain (instead occur between base 229-638); these sequences can be ignored

Allele sequences:

Since the mutations occur before the zinc finger region, the alleles observed in this study are all reference alleles

Alleva et al. Nov 2021

Cataloging human PRDM9 allelic variation using long-read sequencing reveals PRDM9 population specificity and two distinct groupings of related alleles

PMID: 34805134
GenBank Accession Numbers: None

Znf DNA sequences:

Includes !A-!N, :A-:V, |1-|9, |A-|J, |a-|j
Save Supplementary Data File 2 to: copy-paste-files/alleva-2021-SD2-znf-copy.tsv
Tidy file: intermediate-files/alleva-2021-znf-sequences.tsv

# remove extra lines and columns
grep "TGT" copy-paste-files/alleva-2021-SD2-znf-copy.tsv | cut -f1,4 > intermediate-files/alleva-2021-znf-sequences.tsv

# check if all unique
wc -l intermediate-files/alleva-2021-znf-sequences.tsv
# 81
cut -f2 intermediate-files/alleva-2021-znf-sequences.tsv | sort | uniq | wc -l
# 81

Znf amino acid sequences:

Includes !A-!N, :A-:V, |1-|9, |A-|J, |a-|j
From the same Supplementary Data File 2 for znf DNA sequences: copy-paste-files/alleva-2021-SD2-znf-copy.tsv
Tidy file: intermediate-files/alleva-2021-znf-aminos.tsv

# remove extra lines and columns
grep "TGT" copy-paste-files/alleva-2021-SD2-znf-copy.tsv | cut -f1,5 > intermediate-files/alleva-2021-znf-aminos.tsv

Allele znf content:

Includes alleles A-E, F, H-I, L1-L24, L27, M1-M32
- Also includes 542 additional alleles from Jeffreys et al. 2013 observed in sperm or blood
Save Supplementary Data File 3 to: copy-paste-files/alleva-2021-SD3-allele-copy.tsv
Tidy file: intermediate-files/alleva-2021-allele-znf-content.tsv

# remove extra lines and columns
grep "TGT" copy-paste-files/alleva-2021-SD3-allele-copy.tsv | cut -f2,3 > intermediate-files/alleva-2021-allele-znf-content.tsv

Allele DNA sequences:

Includes alleles A-E, F, H-I, L1-L24, L27, M1-M32
- Also includes 542 additional alleles from Jeffreys et al. 2013 observed in sperm or blood
From the same Supplementary Data File 3 for allele znf content: copy-paste-files/alleva-2021-SD3-allele-copy.tsv
Tidy file: intermediate-files/alleva-2021-allele-sequences.tsv

# remove extra lines and columns
grep "TGT" copy-paste-files/alleva-2021-SD3-allele-copy.tsv | cut -f2,6 > intermediate-files/alleva-2021-allele-sequences.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

collect-known-alleles-publications.md

collect-known-alleles-publications.md

Collect known PRDM9 alleles and associated zinc fingers from publications

Step 1. Get allele and znf sequence data from publications

Oliver et al. Dec 2009

Accelerated Evolution of the Prdm9 Speciation Gene across Diverse Metazoan Taxa

Thomas et al. Dec 2009

Extraordinary Molecular Evolution in the PRDM9 Fertility Gene

Parvanov et al. Feb 2010

PRDM9 controls activation of mammalian recombination hotspots

Baudat et al. Feb 2010

PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice

Berg et al. Oct 2010

PRDM9 variation strongly influences recombination hot-spot activity and meiotic instability in humans

Kong et al. Oct 2010

Fine-scale recombination rate differences between sexes, populations and individuals

Ponting May 2011

What are the genomic drivers of the rapid evolution of PRDM9?

Berg et al. Jul 2011

Variants of the protein PRDM9 differentially regulate a set of human meiotic recombination hotspots highly active in African populations

Borel et al. May 2012

Evaluation of PRDM9 variation as a risk factor for recurrent genomic disorders and chromosomal non-disjunction

Jeffreys et al. Jan 2013

Recombination regulator PRDM9 influences the instability of its own coding sequence in humans

Hussin et al. Mar 2013

Rare allelic forms of PRDM9 associated with childhood leukemogenesis

Beyter et al. May 2021

Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits

Wang et al. Jul 2021

Pathogenic variants of meiotic double strand break (DSB) formation genes PRDM9 and ANKRD31 in premature ovarian insufficiency

Alleva et al. Nov 2021

Cataloging human PRDM9 allelic variation using long-read sequencing reveals PRDM9 population specificity and two distinct groupings of related alleles

Files

collect-known-alleles-publications.md

Latest commit

History

collect-known-alleles-publications.md

File metadata and controls

Collect known PRDM9 alleles and associated zinc fingers from publications

Step 1. Get allele and znf sequence data from publications

Oliver et al. Dec 2009

Accelerated Evolution of the Prdm9 Speciation Gene across Diverse Metazoan Taxa

Thomas et al. Dec 2009

Extraordinary Molecular Evolution in the PRDM9 Fertility Gene

Parvanov et al. Feb 2010

PRDM9 controls activation of mammalian recombination hotspots

Baudat et al. Feb 2010

PRDM9 is a major determinant of meiotic recombination hotspots in humans and mice

Berg et al. Oct 2010

PRDM9 variation strongly influences recombination hot-spot activity and meiotic instability in humans

Kong et al. Oct 2010

Fine-scale recombination rate differences between sexes, populations and individuals

Ponting May 2011

What are the genomic drivers of the rapid evolution of PRDM9?

Berg et al. Jul 2011

Variants of the protein PRDM9 differentially regulate a set of human meiotic recombination hotspots highly active in African populations

Borel et al. May 2012

Evaluation of PRDM9 variation as a risk factor for recurrent genomic disorders and chromosomal non-disjunction

Jeffreys et al. Jan 2013

Recombination regulator PRDM9 influences the instability of its own coding sequence in humans

Hussin et al. Mar 2013

Rare allelic forms of PRDM9 associated with childhood leukemogenesis

Beyter et al. May 2021

Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits

Wang et al. Jul 2021

Pathogenic variants of meiotic double strand break (DSB) formation genes PRDM9 and ANKRD31 in premature ovarian insufficiency

Alleva et al. Nov 2021

Cataloging human PRDM9 allelic variation using long-read sequencing reveals PRDM9 population specificity and two distinct groupings of related alleles