-
Notifications
You must be signed in to change notification settings - Fork 2
/
Copy pathREADME
178 lines (124 loc) · 5.08 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
**Be aware that VCF files must contain GTs only and no other INFO for each sample and variant site otherwise GTs will be read incorrectly. This is a bug that will be fixed eventually. You can ensure proper loading of your vcf file by running: bcftools annotate -x FMT in.vcf -Ov -o out.vcf until this is fixed.
asd is a multi-threaded program to quickly compute pairwise allele sharing distances, GRMs, or frequency weighted allele similarities between diploid individuals with an arbitrary number of alleles per locus (GRMs biallelic only). asd reads TPED/TFAM, STRU, and VCF formatted files and will load gzipped files without first requiring decompression. If your data are not biallelic, you must use the --multiallelic flag. This flag is not compatible with --grm and is automatically set when using --weighted.
usage
./asd --stru <filename> --id-col <column where individual names are>
or
./asd --tped <filename>.tped --tfam <filename>.tfam
or
./asd --vcf <filename>.vcf
optional flags --partial, --log, --ibs-all, --out, --missing-stru, --missing-tped, --biallelic, --combine, --long, --long-ibs, --grm, --weighted
CHANGE LOG
----------
18 FEB 2020 - Release of v1.1.0a fixes a bug when reading VCF without --multiallelic set.
14 FEB 2020 - Release of v1.1.0 includes support for VCF files and computing the weighted allele similarity score from Greenbaum et al. 2019 Genome Research 29:2020-2033.
asd v1.1.0a
----------Command Line Arguments----------
--combine <string1> ... <stringN>: Combine several files
generated with --partial.
Default: __none
--grm <bool>: Calculate the genomic relationship matrix.
Default: false
--help <bool>: Prints this help dialog.
Default: false
--ibs <bool>: Set to output all IBS sharing matrices too.
Default: false
--id <int>: Column where individual IDs are. (stru files only)
Default: 1
--log <bool>: Transforms allele sharing distance by -ln(ps) instead of 1-(ps).
Default: false
--long <bool>: Instead of printing a matrix, print allele sharing distances one
per row. Formatted <ID1> <ID2> <allele sharing distance>.
Not compatible with --partial.
Default: false
--long-ibs <bool>: Instead of printing a matrix, print IBS calculations one
per row. Formatted <ID1> <ID2> <IBS0/1/2>.
Not compatible with --partial.
Default: false
--missing-stru <string>: For stru files, set the missing data value.
Default: -9
--missing-tped <string>: For tped files, set the missing data value.
Default: 0
--multiallelic <bool>: Set if there are more than two alleles at at least one locus.
Computations are less efficient with this flag set.
Default: false
--out <string>: Basename for output files.
Default: outfile
--partial <bool>: If set, outputs two nind x nind matrices.
The first is the number of loci used for each pairwise
comparison, and the second is the untransformed dist/IBS
matrix. Useful for splitting up very large jobs.
Can combine with --combine.
Default: false
--stru <string>: The input data filename (stru format).
Change missing data code with --missing-stru.
Requires two header rows: locus names, and locus positions.
Default: __none
--tfam <string>: The input data filename (tfam format).
Requires a .tped file (--tped).
Default: __none
--threads <int>: Number of threads to spawn for faster calculation.
Default: 1
--tped <string>: The input data filename (tped format).
Requires a .tfam file (--tfam).
Change missing data code with --missing-tped.
Default: __none
--vcf <string>: The input data filename (vcf format).
Default: __none
--weighted <bool>: Calculate weighted allele sharing similarity scores from Greenbaum et al. 2019.
Default: false
NOTE: Multi-dimensional scaling plots can be created easily in R with the following commands (assuming --partial was not used):
data<-read.table("<outfile>.asd.dist",header=TRUE)
data.mds<-cmdscale(data)
plot(data.mds)
A toy example with no missing data:
./asd --stru test_data_1.stru --id-col 1
output:
ind0 ind1 ind2 ind3 ind4 ind5
ind0 0 0.5 0.3 0.2 0.3 0.5
ind1 0.5 0 0.2 0.3 0.4 0.8
ind2 0.3 0.2 0 0.1 0.2 0.6
ind3 0.2 0.3 0.1 0 0.3 0.5
ind4 0.3 0.4 0.2 0.3 0 0.6
ind5 0.5 0.8 0.6 0.5 0.6 0
./asd --stru test_data_1.stru --id-col 1 --partial
output:
6
5 5 5 5 5 5
5 5 5 5 5 5
5 5 5 5 5 5
5 5 5 5 5 5
5 5 5 5 5 5
5 5 5 5 5 5
ind0 ind1 ind2 ind3 ind4 ind5
ind0 5 2.5 3.5 4 3.5 2.5
ind1 2.5 5 4 3.5 3 1
ind2 3.5 4 5 4.5 4 2
ind3 4 3.5 4.5 5 3.5 2.5
ind4 3.5 3 4 3.5 5 2
ind5 2.5 1 2 2.5 2 5
A toy example with missing data:
./asd --stru test_data_2.stru --id-col 1
output:
ind0 ind1 ind2 ind3 ind4 ind5
ind0 0 0.5 0.25 0.125 0.333333 0.166667
ind1 0.5 0 0.125 0.25 0.333333 0.666667
ind2 0.25 0.125 0 0.1 0.125 0.5
ind3 0.125 0.25 0.1 0 0.25 0.375
ind4 0.333333 0.333333 0.125 0.25 0 0.625
ind5 0.166667 0.666667 0.5 0.375 0.625 0
./asd --stru test_data_2.stru --id-col 1 --partial
output:
6
4 3 4 4 3 3
3 4 4 4 3 3
4 4 5 5 4 4
4 4 5 5 4 4
3 3 4 4 4 4
3 3 4 4 4 4
ind0 ind1 ind2 ind3 ind4 ind5
ind0 4 1.5 3 3.5 2 2.5
ind1 1.5 4 3.5 3 2 1
ind2 3 3.5 5 4.5 3.5 2
ind3 3.5 3 4.5 5 3 2.5
ind4 2 2 3.5 3 4 1.5
ind5 2.5 1 2 2.5 1.5 4