Literature searches revealed several publications that describe zinc finger (znf) DNA sequences, znf amino acid sequences, allele znf content, allele DNA sequences and/or their accession numbers, allele mutations, and/or allele structural variants (SVs):
- Oliver 2009
- Znf DNA sequences
- Allele DNA sequence accession numbers
- Thomas 2009
- Znf DNA sequences
- Parvanov 2010
- Allele DNA sequence accession numbers
- Allele amino acid sequences
- Baudat 2010
- Allele znf content
- Allele DNA sequence accession numbers
- Berg 2010
- Znf DNA sequences
- Allele znf content
- Allele DNA sequence accession numbers
- Kong 2010
- Allele znf content
- Ponting 2011
- Allele znf content
- Berg 2011
- Znf DNA sequences
- Allele znf content
- Borel 2012
- Znf DNA sequences
- Allele znf content
- Jeffreys 2013
- Znf DNA sequences
- Allele znf content
- Hussin 2013
- Znf DNA sequences
- Allele znf content
- Allele DNA sequence accession numbers
- Beyter 2021
- Allele structural variants
- Allele DNA sequences (inferred)
- Allele structural variants
- Wang 2021
- Allele mutations
- Allele DNA sequences (inferred)
- Allele mutations
- Alleva 2021
- Znf DNA sequences
- Znf amino acid sequences
- Allele znf content
- Allele DNA sequences
When possible, sequences were copy/pasted from publications and saved as firstauthor-year-type.txt
in the copy-paste-files
directory. After tidying up, content was saved as firstauthor-year-type.tsv
in the intermediate-files
directory. Type
was one of the following:
znf-sequences
znf-aminos
allele-znf-content
allele-sequence-accessions
allele-sequences
allele-mutations
allele-aminos
Genbank accession downloads are described in additional documentation.
Analysis steps:
- Get allele and znf sequence data from publications
- Compile known znf sequences and give standardized names
- Compile allele znf content, get list of unique alleles, and give them standardized names
- Convert allele DNA sequences to standardized znf names to confirm known alleles and identify ones not in current list
PMID: 19997497
GenBank Accession Numbers: FJ899863.1 - FJ899912.1
Znf DNA sequences:
- Includes znfs
03
-14
- Copy/paste from Supplementary Dataset S1 to:
copy-paste-files/oliver-2009-znf-copy.txt
- Sequences are shifted and begin with the last 9 nucleotides of previous zinc finger
- Tidy file:
intermediate-files/oliver-2009-znf-sequences.tsv
# extract human znfs
sed '/>/ s/$/NEWLINE/' copy-paste-files/oliver-2009-znf-copy.txt | tr -d '\n' | sed 's/>/\n/g' | grep "homo_sapien" | sed 's/NEWLINE/\t/' > intermediate-files/oliver-2009-znf-sequences.tsv
# check if all unique
wc -l intermediate-files/oliver-2009-znf-sequences.tsv
# 12
cut -f2 intermediate-files/oliver-2009-znf-sequences.tsv | sort | uniq | wc -l
# 9
# remove duplicate sequences, keeping lowest ID number
sort -u -k2,2 intermediate-files/oliver-2009-znf-sequences.tsv | sort > intermediate-files/TEMP-oliver-2009-znf-sequences.tsv
rm intermediate-files/oliver-2009-znf-sequences.tsv
mv intermediate-files/TEMP-oliver-2009-znf-sequences.tsv intermediate-files/oliver-2009-znf-sequences.tsv
wc -l intermediate-files/oliver-2009-znf-sequences.tsv
# 9
Allele DNA sequence accession numbers:
- Save accession numbers to:
genbank-records/oliver-2009-allele-sequence-accessions.txt
# generate sequence of numbers representing genbank accession numbers
for i in $(seq 899863 899912)
do
echo "FJ$i.1" >> genbank-records/oliver-2009-allele-sequence-accessions.txt
done
PMID: 20041164
GenBank Accession Numbers: None
Znf DNA sequences:
- Includes znfs
ZF1
-ZF12
- Znf sequences depicted in Figure 4
- Not copy/pastable
PMID: 20044538
GenBank Accession Numbers: GU183914.1 - GU183919.1
Allele DNA sequence accession numbers:
- Accession numbers listed on Supplementary Material page 3: Material and Methods, Gene Bank Numbers include from GU183909-GU183919, but only GU183914-GU183919 are human (the others are mouse)
- Save accession numbers to:
genbank-records/parvanov-2010-allele-sequence-accessions.txt
# generate sequence of numbers representing genbank accession numbers
for i in $(seq 183914 183919)
do
echo "GU$i.1" >> genbank-records/parvanov-2010-allele-sequence-accessions.txt
done
Allele amino acid sequences:
- Includes alleles
AA1
-AA11
,CH1
-CH3
,M1
,M2
- Copy/paste from Supplementary Figure S3A to:
copy-paste-files/parvanov-2010-allele-aminos.txt
- Moved blocks to one-per-column and put names on top in this file for easier tidying
- Tidy file:
intermediate-files/parvanov-2010-allele-aminos.tsv
# remove extra white space, separate columns, save to temp file
sed 's/^[ \t]*//;s/[ \t]*$//' copy-paste-files/parvanov-2010-aminos.txt | sed -e '/\* [A-Z]/ s/\* /\*\t/' -e '/) [A-Z]/ s/) /)\t/' -e 's/K /K\t/' | egrep "\t[A-Z]" | cut -f2 > intermediate-files/parvanov-2010-allele-aminos-temp.txt
sed 's/^[ \t]*//;s/[ \t]*$//' copy-paste-files/parvanov-2010-aminos.txt | sed -e '/\* [A-Z]/ s/\* /\*\t/' -e '/) [A-Z]/ s/) /)\t/' -e 's/K /K\t/' | cut -f1 >> intermediate-files/parvanov-2010-allele-aminos-temp.txt
# put sequences on one line, move name to beginning, remove aminos before start (LYVCRE before CGR) & after end (DE* after CRE), sort
awk '/\*/ {printf("%s,", $0); next}1' intermediate-files/parvanov-2010-allele-aminos-temp.txt| tr -d '\n' | sed 's/)/)\n/g' | awk -F "," '{print $2 "\t" $1}' | sed -e 's/\tLYVCRE/\t/' -e 's/DE\*$//' -e 's/ (.*)\t/\t/' | sort -k1,1V > intermediate-files/parvanov-2010-allele-aminos.tsv
PMID: 20044539
GenBank Accession Numbers: GU216222.1 - GU216229.1
Allele znf content:
- Includes alleles
A
-E
,F
,H
-I
,K
- Image in Figure 2b depicts znf content as unnamed blocks
- Appears one block (
-NHR
in the legend) represents more than one znf as it appears in positions 11 & 12 in alleles A and B - Additionally, tne block in alleles F and K (
--HR
) and one block in allele H (-RVS
)that are not also present in A-E
- Appears one block (
- Type znf content typed out by hand, triple check for accuracy, to:
copy-paste-files/baudat-2010-allele-copy.txt
- Name blocks by four-character code in figure legend (orange =
NTOR
; grey =NTGR
), with|
to separate blocks
- Name blocks by four-character code in figure legend (orange =
- Tidy file:
intermediate-files/baudat-2010-allele-znf-content-unnamed.tsv
# tidy file
awk '{print $1 "\t" $3}' copy-paste-files/baudat-2010-allele-copy.txt > intermediate-files/baudat-2010-allele-znf-content-unnamed.tsv
Allele DNA sequence accession numbers:
- Includes alleles
A
-F
,H
-I
- Sequences in Supplementary Figure S3B are screenshots instead of copy/pastable text (!)
- Save accession numbers to:
genbank-records/baudat-2010-allele-sequence-accessions.txt
# generate sequence of numbers representing genbank accession numbers
for i in $(seq 216222 216229)
do
echo "GU$i.1" >> genbank-records/baudat-2010-allele-sequence-accessions.txt
done
PRDM9 variation strongly influences recombination hot-spot activity and meiotic instability in humans
PMID: 20818382
GenBank Accession Numbers: HM210983.1 – HM211006.1
Znf DNA sequences:
- Includes znfs
a
-t
- Copy/paste from Supplementary Figure 1a to:
copy-paste-files/berg-2010-znf-copy.txt
- Tidy file:
intermediate-files/berg-2010-znf-sequences.tsv
# tidy file
sed -e 's/\s/\t/' -e '$a\' copy-paste-files/berg-2010-znf-copy.txt > intermediate-files/berg-2010-znf-sequences.tsv
# check if all unique
wc -l intermediate-files/berg-2010-znf-sequences.tsv
# 20
cut -f2 intermediate-files/berg-2010-znf-sequences.tsv | sort | uniq | wc -l
# 20
Allele znf content:
- Includes alleles
A
-E
,L1
-L24
- Copy/paste from Supplementary Figure 1b to:
copy-paste-files/berg-2010-allele-copy.txt
- Tidy file:
intermediate-files/berg-2010-allele-znf-content.tsv
# remove extra columns, tidy file, and sort alphabetically
awk '{print $1 "\t" $3}' copy-paste-files/berg-2010-allele-copy.txt | sort -k1,1V > intermediate-files/berg-2010-allele-znf-content.tsv
Allele DNA sequence accession numbers:
- Accessions for alleles L1-L24
- Save accession numbers to:
genbank-records/berg-2010-allele-sequence-accessions.txt
# generate sequence of numbers representing genbank accession numbers
for i in $(seq 210983 211006)
do
echo "HM$i.1" >> genbank-records/berg-2010-allele-sequence-accessions.txt
done
PMID: 20981099
GenBank Accession Numbers: None
Allele znf content:
- Includes alleles
Decode01
-Decode07
,YRI01
-YRI09
(publication did not provide allele names, so name them here) - Image in Supplementary Figure 4 depicts znf content as blocks named with the amino acids at repeat positions -1, 3 and 6 of the alpha helix
- Type znf content typed out by hand, triple check for accuracy, to:
copy-paste-files/kong-2010-allele-copy.txt
- Add
-
to separate blocks
- Add
- Tidy file:
intermediate-files/kong-2010-allele-znf-content-unnamed.tsv
# remove gaps, tidy file, give allele names (first 7 are Decode sample, rest are HapMap YRI)
awk '{print $2}' copy-paste-files/kong-2010-allele-copy.txt | sed -e 's/-\{2,\}/-/g' | grep "-" | sort | uniq | awk '{if (NR<8) printf "Decode%02i\t%s\n", NR, $1; else printf "YRI%02i\t%s\n", NR-7, $1}' > intermediate-files/kong-2010-allele-znf-content-unnamed.tsv
PMID: 21388701
GenBank Accession Numbers: None
Allele znf content:
- Includes alleles
A
-L24
- Review paper
- Image in Figure 4 depicts znf content as named blocks
- Cites Berg et al. 2010 for allele znf content
- However, znf content for allele L24 does not match that for Berg et al. 2010
- Via email communication with Ponting (Aug 2021), confirmed that L24 znf content depicted in Figure 4 is incorrect
- Type Znf content out by hand, triple check for accuracy, to:
copy-paste-files/ponting-2010-allele-znf-content.txt
- Include gaps represented with
_
- Include gaps represented with
- Tidy file:
intermediate-files/ponting-2011-allele-znf-content.tsv
# remove gaps and sort alphabetically
sed -e 's/\s/\t/g' -e 's/_//g' copy-paste-files/ponting-2011-allele-copy.txt | cut -f1,2 | sort -k1,1V > intermediate-files/ponting-2011-allele-znf-content.tsv
Variants of the protein PRDM9 differentially regulate a set of human meiotic recombination hotspots highly active in African populations
PMID: 21750151
GenBank Accession Numbers: None
Znf DNA sequences:
- Includes znfs
a
-l
,o
-v
- Copy/paste from Supplementary Figure 1A to:
copy-paste-files/berg-2011-znf-copy.txt
- Tidy file:
intermediate-files/berg-2011-znf-sequences.tsv
# tidy file
sed -e 's/\s/\t/' -e '$a\' copy-paste-files/berg-2011-znf-copy.txt > intermediate-files/berg-2011-znf-sequences.tsv
# check if all unique
wc -l intermediate-files/berg-2011-znf-sequences.tsv
# 20
cut -f2 intermediate-files/berg-2011-znf-sequences.tsv | sort | uniq | wc -l
# 20
Allele znf content:
- Includes alleles
A
-E
,L1
-L27
- Copy/paste from Supplementary Figure 1B to:
copy-paste-files/berg-2011-allele-copy.txt
- Tidy file:
intermediate-files/berg-2010-allele-znf-content.tsv
# remove extra columns, tidy file, and sort alphabetically
awk '{print $1 "\t" $3}' copy-paste-files/berg-2011-allele-copy.txt | sort -k1,1V > intermediate-files/berg-2011-allele-znf-content.tsv
Evaluation of PRDM9 variation as a risk factor for recurrent genomic disorders and chromosomal non-disjunction
PMID: 22643917
GenBank Accession Numbers: None
Znf DNA sequences:
- Includes znfs
a
-m
,q
- Copy/paste from Supplementary Table S1 to:
copy-paste-files/borel-2012-znf-copy.txt
- Tidy file:
intermediate-files/borel-2012-znf-sequences.tsv
# tidy file
sed -e 's/\s/\t/' copy-paste-files/borel-2012-znf-copy.txt > intermediate-files/borel-2012-znf-sequences.tsv
# check if all unique
wc -l intermediate-files/borel-2012-znf-sequences.tsv
# 14
cut -f2 intermediate-files/borel-2012-znf-sequences.tsv | sort | uniq | wc -l
# 14
Allele znf content:
- Includes alleles
A
-F
,I
,L1
,L19
,L28
-L31
- Cites Berg et al. 2010 for allele znf content and mentions Baudat et al. 2010 in paper
- However, znf content for allele I does not match that for Baudat et al. 2010
- Have reached out to last author for clarification, but assuming it is a typo
- Copy/paste from Supplementary Table S1 to:
copy-paste-files/borel-2012-allele-copy.txt
- Tidy file:
intermediate-files/borel-2012-allele-znf-content.tsv
# remove extra column
awk '{print $1 "\t" $3}' copy-paste-files/borel-2012-allele-copy.txt > intermediate-files/borel-2012-allele-znf-content.tsv
PMID: 23267059
GenBank Accession Numbers: None
Znf DNA sequences:
- Study looks at low-frequency mutations in blood and genotypes of sperm cells; these znfs not necessarily observed as part of a human genotype
- Includes znfs
A
-L
,O
-V
,a
-z
,1-9
,!
,@
,£
,$
,%
,&
,§
,*
,:
,±
- Copy/paste from Supplementary Figure S2 to:
copy-paste-files/jeffreys-2013-znf-copy.txt
- Icons
£
,§
, and±
may not render properly when copy/pasted (appear as_
) depending on language settings; these icons are used to name sequences on lines 57, 61, and 64 respectively
- Icons
- Tidy file:
intermediate-files/jeffreys-2013-znf-sequences.tsv
# remove extra characters and tidy file
awk '{if ($1 ~ /[A-T,V-Z]/) print $1 "\t" $3; else print $1 "\t" $2}' copy-paste-files/jeffreys-2013-znf-copy.txt > intermediate-files/jeffreys-2013-znf-sequences.tsv
# check if all unique
wc -l intermediate-files/jeffreys-2013-znf-sequences.tsv
# 64
cut -f2 intermediate-files/jeffreys-2013-znf-sequences.tsv | sort | uniq | wc -l
# 64
Allele znf content:
- Study looks at low-frequency mutations in blood and genotypes of sperm cells; these alleles not necessarily observed as part of a human genotype
- Includes alleles
Jeffreys001
-Jeffreys563
(publication did not provide allele names, so name them here) - Copy/paste from Supplementary Table S1 to:
copy-paste-files/jeffreys-2013-allele-copy.txt
- Tidy file:
intermediate-files/jeffreys-2013-allele-znf-content.tsv
# convert from 2 'text columns' to one, remove extra columns and duplicated alleles, remove gaps, add temporary allele names Jeffreys###
egrep -v "Man|allele|origin|Fig." copy-paste-files/jeffreys-2013-allele-copy.txt | awk '{if (length($5) >= 4) print $1 "\n" $5; else if (length($6) >= 4) print $1 "\n" $6; else if (NF < 5 && length($3) >= 4) print $1 "\n" $3; else if (NF < 6 && length($3) < 4) print $1}' | sed 's/-//g' | sort | uniq | awk '{printf "Jeffreys%03i\t%s\n", NR, $1}' > intermediate-files/jeffreys-2013-allele-znf-content.tsv
PMID: 23222848
GenBank Accession Numbers: JQ044371.1 – JQ044377.1
Znf DNA sequences:
- Includes znfs
a
-x
- Copy/paste from Supplementary Figure S6 to:
copy-paste-files/hussin-2013-znf-copy.txt
- Tidy file:
intermediate-files/hussin-2013-znf-sequences.tsv
# remove extra lines and tidy file
grep -v "Zinc" copy-paste-files/hussin-2013-znf-copy.txt | sed 's/\s/\t/' > intermediate-files/hussin-2013-znf-sequences.tsv
# check if all unique
wc -l intermediate-files/hussin-2013-znf-sequences.tsv
# 24
cut -f2 intermediate-files/hussin-2013-znf-sequences.tsv | sort | uniq | wc -l
# 24
Allele znf content:
- Includes alleles
L32
-L38
- Copy/paste from Supplementary Material Page 7: Supplementary Results, Description of PRDM9 Alleles and Novel ZnF Types to
copy-paste-files/hussin-2013-allele-copy.txt
- Tidy file:
intermediate-files/hussin-2013-allele-znf-content.tsv
# tidy file and sort alphabetically
sed -e 's/ is /\t/' -e 's/[,=]/\t/' copy-paste-files/hussin-2013-allele-copy.txt | sort -k1,1V | grep . > intermediate-files/hussin-2013-allele-znf-content.tsv
Allele DNA sequence accession numbers:
- Accessions for alleles L32-L38
- Save accession numbers to:
genbank-records/hussin-2013-allele-sequence-accessions.txt
- Znf content from DNA sequences for alleles L35 and L38 do not match content described in publication
- Via email communication with Hussin (Aug 2021), confirmed that the DNA sequences for L35 and L38 deposited in GenBank are incorrect
# generate sequence of numbers representing genbank accession numbers
for i in $(seq 44371 44377)
do
echo "JQ0$i.1" >> genbank-records/hussin-2013-allele-sequence-accessions.txt
done
Long-read sequencing of 3,622 Icelanders provides insight into the role of structural variants in human diseases and other traits
PMID: 33972781
GenBank Accession Numbers: None
Allele structural variants:
- Includes SVs
chr5:23526974:DN.1
,chr5:23526974:DN.2
,chr5:23526974:DN.4
,chr5:23526974:XN.5
,chr5:23527530:FN.0
,chr5:23527530:FN.1
,chr5:23527530:FN.4
chr5:23526974:XN.5
has no alternate allele and is therefore the reference sequence
- Download SV vcf from github to:
copy-paste-files/beyter-2021-allele-SVs-copy.vcf
- Tidy file:
intermediate-files/beyter-2021-allele-SVs.vcf
# download vcf
wget https://github.com/DecodeGenetics/LRS_SV_sets/raw/master/ont_sv_high_confidence_SVs.sorted.vcf.gz -O copy-paste-files/beyter-2021-allele-SVs-copy.vcf.gz
# index vcf, subset to PRDM9 znf region
tabix copy-paste-files/beyter-2021-allele-SVs-copy.vcf.gz
tabix -h copy-paste-files/beyter-2021-allele-SVs-copy.vcf.gz chr5:23526673-23527764 > intermediate-files/beyter-2021-allele-SVs.vcf
Allele DNA sequences:
- Includes alleles
chr5:23526974:DN.1
,chr5:23526974:DN.2
,chr5:23526974:DN.4
,chr5:23526974:XN.5
,chr5:23527530:FN.0
,chr5:23527530:FN.1
,chr5:23527530:FN.4
- Modify reference sequence to replace reference sequences with alternate SV sequences from
intermediate-files/beyter-2021-allele-SVs.vcf
- Tidy file:
intermediate-files/beyter-2021-allele-sequences.tsv
# download GRCh38 chr5 fasta
wget http://hgdownload.soe.ucsc.edu/goldenPath/hg38/chromosomes/chr5.fa.gz -O copy-paste-files/GRCh38-chr5.fa.gz
zcat copy-paste-files/GRCh38-chr5.fa | grep -v ">" | tr -d '\n' | sed 's/$/\n/' | gzip > copy-paste-files/GRCh38-chr5.seq.gz
# replace reference allele sequence (znf domain) with alternate SV sequences; but leave sequence as is for variant with no alt sequence (reference sequence)
while read CHR POS ID REF ALT REMAINING
do
if [[ $ALT == "<ALT>" ]]
then awk -v ID=$ID '{print ID"\t"substr($0,23526673,23527764-23526673+1)}' <(zcat copy-paste-files/GRCh38-chr5.seq.gz) >> intermediate-files/beyter-2021-allele-sequences.tsv
else
awk -v ID=$ID -v ALT=$ALT -v POS=$POS -v REF=$REF '{print ID"\t"substr($0,23526673,POS-23526673)""ALT""substr($0,POS+length(REF),23527764-POS-length(REF)+1)}' <(zcat copy-paste-files/GRCh38-chr5.seq.gz) >> intermediate-files/beyter-2021-allele-sequences.tsv
fi
done < <(grep -v "#" intermediate-files/beyter-2021-allele-SVs.vcf)
Pathogenic variants of meiotic double strand break (DSB) formation genes PRDM9 and ANKRD31 in premature ovarian insufficiency
PMID: 34257419
GenBank Accession Numbers: None
Allele mutations:
- Includes point mutations
c.229C>T:p.Arg77*
,c.638T>G:p.Ile213Ser
,c.677A>T:p.Lys226Met
relative to allele B (NM_020227.3) - Copy/paste from Table 1 to:
copy-paste-files/wang-2021-allele-mutations-copy.txt
- Only copy/pasted from
Patient number
toE2,pg/mL
for patients 1-4 due to merged cells in publication table
- Only copy/pasted from
- Tidy file:
intermediate-files/wang-2021-allele-mutations.tsv
# remove extra columns, split position, reference and alternate alleles into separate columns, and sort alphabetically
cut -f3 copy-paste-files/wang-2021-allele-mutations-copy.txt | sed 's/\s//' | uniq | awk -F: '{print $0 "\t" substr($0, 3,3 ) "\t" substr($0, 6, 1) "\t" substr($0, 8, 1)}' > intermediate-files/wang-2021-allele-mutations.tsv
Reference allele:
- Above point mutations relative to allele B (NM_020227.3)
- Copy/paste NCBI fasta record to:
copy-paste-files/NM_020227.3-copy.txt
- Tidy file:
intermediate-files/NM_020227.3.fa
# collapse reference sequence to single line
sed '/>/ s/$/NEWLINE/' copy-paste-files/NM_020227.3-copy.txt | tr -d '\n' | sed 's/>/\n>/g' | sed 's/NEWLINE/\n/' > intermediate-files/NM_020227.3.fa
# get start position for the znf region
ZNFA=$(egrep "^A" intermediate-files/berg-2010-znf-sequences.tsv | cut -f2)
FULLSEQ=$(tail -1 intermediate-files/NM_020227.3.fa)
ZNFSTART=${#${FULLSEQ%%$ZNFA*}}
echo $ZNFSTART
# 1772
# mutations observed in wang 2021 do not occur in the zinc finger domain (instead occur between base 229-638); these sequences can be ignored
Allele sequences:
- Since the mutations occur before the zinc finger region, the alleles observed in this study are all reference alleles
Cataloging human PRDM9 allelic variation using long-read sequencing reveals PRDM9 population specificity and two distinct groupings of related alleles
PMID: 34805134
GenBank Accession Numbers: None
Znf DNA sequences:
- Includes
!A
-!N
,:A
-:V
,|1
-|9
,|A
-|J
,|a
-|j
- Save Supplementary Data File 2 to:
copy-paste-files/alleva-2021-SD2-znf-copy.tsv
- Tidy file:
intermediate-files/alleva-2021-znf-sequences.tsv
# remove extra lines and columns
grep "TGT" copy-paste-files/alleva-2021-SD2-znf-copy.tsv | cut -f1,4 > intermediate-files/alleva-2021-znf-sequences.tsv
# check if all unique
wc -l intermediate-files/alleva-2021-znf-sequences.tsv
# 81
cut -f2 intermediate-files/alleva-2021-znf-sequences.tsv | sort | uniq | wc -l
# 81
Znf amino acid sequences:
- Includes
!A
-!N
,:A
-:V
,|1
-|9
,|A
-|J
,|a
-|j
- From the same Supplementary Data File 2 for znf DNA sequences:
copy-paste-files/alleva-2021-SD2-znf-copy.tsv
- Tidy file:
intermediate-files/alleva-2021-znf-aminos.tsv
# remove extra lines and columns
grep "TGT" copy-paste-files/alleva-2021-SD2-znf-copy.tsv | cut -f1,5 > intermediate-files/alleva-2021-znf-aminos.tsv
Allele znf content:
- Includes alleles
A
-E
,F
,H
-I
,L1
-L24
,L27
,M1
-M32
- Also includes 542 additional alleles from Jeffreys et al. 2013 observed in sperm or blood
- Save Supplementary Data File 3 to:
copy-paste-files/alleva-2021-SD3-allele-copy.tsv
- Tidy file:
intermediate-files/alleva-2021-allele-znf-content.tsv
# remove extra lines and columns
grep "TGT" copy-paste-files/alleva-2021-SD3-allele-copy.tsv | cut -f2,3 > intermediate-files/alleva-2021-allele-znf-content.tsv
Allele DNA sequences:
- Includes alleles
A
-E
,F
,H
-I
,L1
-L24
,L27
,M1
-M32
- Also includes 542 additional alleles from Jeffreys et al. 2013 observed in sperm or blood
- From the same Supplementary Data File 3 for allele znf content:
copy-paste-files/alleva-2021-SD3-allele-copy.tsv
- Tidy file:
intermediate-files/alleva-2021-allele-sequences.tsv
# remove extra lines and columns
grep "TGT" copy-paste-files/alleva-2021-SD3-allele-copy.tsv | cut -f2,6 > intermediate-files/alleva-2021-allele-sequences.tsv