This repository contains the programming language (Python, JavaScript, TypeScript, Rust, CSS) based character-based ngrams (unigrams, bigrams, trigrams) used for the development of the Granite Layout which are compatible with the Keyboard Layout Optimizer by Dario Götz. The corpus was cleaned from untypical characters before creating the ngrams (see below for details).
The used corpus is a mixture of few dataset with following weights
- 40% Python (94.2 MB text corpus)
- 10% Rust (29.2 MB text corpus)
- 20% TypeScript (80.3 MB text corpus)
- 20% JavaScript (142.6 MB text corpus)
- 10% CSS (33.0 MB text corpus)
Other related repos: granite-english-ngrams, granite-finnish-ngrams and granite-tools.
Most common ngrams using the ngram_show
tool from granite-tools, case ignored (unit: percents)
Most common unigrams (Granite Code)
────────────────────── code ──────────────────────
1: ␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.85
2: e ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.81
3: t ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.94
4: a ▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.71
5: r ▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.70
6: s ▇▇▇▇▇▇▇▇▇▇▇▇ 4.46
7: o ▇▇▇▇▇▇▇▇▇▇▇▇ 4.33
8: n ▇▇▇▇▇▇▇▇▇▇▇▇ 4.16
9: i ▇▇▇▇▇▇▇▇▇▇▇ 4.02
10: ⏎ ▇▇▇▇▇▇▇▇▇ 3.30
11: l ▇▇▇▇▇▇▇▇▇ 3.23
12: c ▇▇▇▇▇▇▇ 2.55
13: d ▇▇▇▇▇▇▇ 2.38
14: p ▇▇▇▇▇▇▇ 2.38
15: . ▇▇▇▇▇▇ 2.09
16: u ▇▇▇▇▇▇ 2.02
17: m ▇▇▇▇▇▇ 2.00
18: - ▇▇▇▇▇ 1.92
19: , ▇▇▇▇▇ 1.82
20: f ▇▇▇▇▇ 1.76
21: ) ▇▇▇▇ 1.46
22: ( ▇▇▇▇ 1.46
23: " ▇▇▇▇ 1.44
24: h ▇▇▇▇ 1.33
25: : ▇▇▇▇ 1.25
26: g ▇▇▇ 1.21
27: b ▇▇▇ 1.21
28: _ ▇▇▇ 1.20
29: 0 ▇▇▇ 1.07
30: 1 ▇▇▇ 1.06
31: = ▇▇▇ 1.00
32: y ▇▇ 0.88
33: 2 ▇▇ 0.79
34: x ▇▇ 0.77
35: v ▇▇ 0.76
36: ' ▇▇ 0.70
37: { ▇▇ 0.57
38: } ▇▇ 0.57
39: / ▇▇ 0.56
40: w ▇ 0.53
41: ; ▇ 0.53
42: 3 ▇ 0.49
43: k ▇ 0.48
44: 4 ▇ 0.48
45: 5 ▇ 0.42
46: > ▇ 0.35
47: 6 ▇ 0.34
48: 8 ▇ 0.34
49: [ ▇ 0.32
50: ] ▇ 0.32
51: * ▇ 0.30
52: 9 ▇ 0.28
53: 7 ▇ 0.27
54: q ▇ 0.21
55: # ▇ 0.19
56: < ▇ 0.18
57: ` 0.18
58: j 0.17
59: z 0.17
60: \ 0.12
61: | 0.10
62: + 0.09
63: & 0.09
64: ! 0.07
65: @ 0.06
66: $ 0.06
67: ? 0.05
68: % 0.04
69: ^ 0.01
70: ~ 0.01
71: 0.00
72: € 0.00
Most common bigrams (Granite Code)
────────────────────── code ──────────────────────
1: er ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.01
2: on ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.98
3: ,␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.98
4: re ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.94
5: in ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.93
6: te ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.88
7: se ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.80
8: or ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.76
9: st ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.72
10: at ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.70
11: es ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.67
12: :␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.66
13: de ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.66
14: en ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.65
15: t␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.64
16: co ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.63
17: le ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.62
18: nt ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.61
19: e␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.60
20: ar ▇▇▇▇▇▇▇▇▇▇▇▇ 0.59
21: ti ▇▇▇▇▇▇▇▇▇▇▇▇ 0.56
22: -- ▇▇▇▇▇▇▇▇▇▇▇ 0.55
23: ,⏎ ▇▇▇▇▇▇▇▇▇▇▇ 0.54
24: th ▇▇▇▇▇▇▇▇▇▇▇ 0.53
25: ␣= ▇▇▇▇▇▇▇▇▇▇▇ 0.53
26: al ▇▇▇▇▇▇▇▇▇▇▇ 0.52
27: ro ▇▇▇▇▇▇▇▇▇▇▇ 0.51
28: =␣ ▇▇▇▇▇▇▇▇▇▇ 0.49
29: ␣t ▇▇▇▇▇▇▇▇▇▇ 0.49
30: me ▇▇▇▇▇▇▇▇▇▇ 0.49
31: el ▇▇▇▇▇▇▇▇▇▇ 0.48
32: et ▇▇▇▇▇▇▇▇▇ 0.45
33: ;⏎ ▇▇▇▇▇▇▇▇▇ 0.44
34: as ▇▇▇▇▇▇▇▇▇ 0.43
35: an ▇▇▇▇▇▇▇▇▇ 0.41
36: s␣ ▇▇▇▇▇▇▇▇▇ 0.41
37: ex ▇▇▇▇▇▇▇▇▇ 0.41
38: ⏎⏎ ▇▇▇▇▇▇▇▇ 0.40
39: it ▇▇▇▇▇▇▇▇ 0.40
40: ra ▇▇▇▇▇▇▇▇ 0.39
Most common bigrams (Granite Code), whitespace ignored
────────────────────── code ──────────────────────
1: er ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.28
2: on ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.26
3: re ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.19
4: in ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.18
5: te ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.13
6: se ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.02
7: or ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.96
8: st ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.91
9: at ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.90
10: es ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.86
11: de ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.84
12: en ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.83
13: co ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.81
14: le ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.79
15: nt ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.78
16: ar ▇▇▇▇▇▇▇▇▇▇▇▇ 0.75
17: ti ▇▇▇▇▇▇▇▇▇▇▇▇ 0.72
18: -- ▇▇▇▇▇▇▇▇▇▇▇ 0.70
19: th ▇▇▇▇▇▇▇▇▇▇▇ 0.68
20: al ▇▇▇▇▇▇▇▇▇▇▇ 0.66
21: ro ▇▇▇▇▇▇▇▇▇▇▇ 0.65
22: me ▇▇▇▇▇▇▇▇▇▇ 0.62
23: el ▇▇▇▇▇▇▇▇▇▇ 0.61
24: et ▇▇▇▇▇▇▇▇▇ 0.58
25: as ▇▇▇▇▇▇▇▇▇ 0.55
26: an ▇▇▇▇▇▇▇▇▇ 0.53
27: ex ▇▇▇▇▇▇▇▇▇ 0.52
28: it ▇▇▇▇▇▇▇▇ 0.51
29: ra ▇▇▇▇▇▇▇▇ 0.50
30: ta ▇▇▇▇▇▇▇▇ 0.50
31: is ▇▇▇▇▇▇▇▇ 0.50
32: rt ▇▇▇▇▇▇▇▇ 0.49
33: he ▇▇▇▇▇▇▇▇ 0.49
34: ne ▇▇▇▇▇▇▇▇ 0.48
35: ct ▇▇▇▇▇▇▇▇ 0.48
36: to ▇▇▇▇▇▇▇▇ 0.48
37: ed ▇▇▇▇▇▇▇▇ 0.48
38: tr ▇▇▇▇▇▇▇▇ 0.47
39: nd ▇▇▇▇▇▇▇▇ 0.47
40: io ▇▇▇▇▇▇▇ 0.45
Most common trigrams (Granite Code)
────────────────────── code ──────────────────────
1: --- ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.42
2: ␣=␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.39
3: ion ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.33
4: ent ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.29
5: con ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.28
6: tio ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.27
7: ␣{⏎ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.25
8: the ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.22
9: ing ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.21
10: ate ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.21
11: sel ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.20
12: ␣th ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.20
13: ␣␣␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.20
14: ass ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.19
15: ect ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
16: ons ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
17: ort ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
18: );⏎ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
19: rt␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.17
20: ser ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.17
21: elf ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.17
22: def ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
23: ame ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
24: por ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
25: ter ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
26: est ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
27: ␣in ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
28: val ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
29: ⏎}⏎ ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
30: for ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
31: exp ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
32: ert ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
33: typ ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
34: one ▇▇▇▇▇▇▇▇▇▇▇▇ 0.14
35: ype ▇▇▇▇▇▇▇▇▇▇▇▇ 0.14
36: str ▇▇▇▇▇▇▇▇▇▇▇▇ 0.14
37: ext ▇▇▇▇▇▇▇▇▇▇▇ 0.14
38: dat ▇▇▇▇▇▇▇▇▇▇▇ 0.14
39: col ▇▇▇▇▇▇▇▇▇▇▇ 0.13
40: ␣re ▇▇▇▇▇▇▇▇▇▇▇ 0.13
Most common trigrams (Granite Code), whitespace ignored
────────────────────── code ──────────────────────
1: --- ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.61
2: ion ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.47
3: ent ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.42
4: con ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.41
5: tio ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.39
6: the ▇▇▇▇▇▇▇▇▇▇ 0.31
7: ing ▇▇▇▇▇▇▇▇▇▇ 0.30
8: ate ▇▇▇▇▇▇▇▇▇▇ 0.30
9: sel ▇▇▇▇▇▇▇▇▇▇ 0.29
10: ass ▇▇▇▇▇▇▇▇▇ 0.27
11: ect ▇▇▇▇▇▇▇▇▇ 0.26
12: ons ▇▇▇▇▇▇▇▇ 0.25
13: ort ▇▇▇▇▇▇▇▇ 0.25
14: ser ▇▇▇▇▇▇▇▇ 0.25
15: elf ▇▇▇▇▇▇▇▇ 0.24
16: def ▇▇▇▇▇▇▇▇ 0.24
17: ame ▇▇▇▇▇▇▇▇ 0.23
18: por ▇▇▇▇▇▇▇▇ 0.23
19: ter ▇▇▇▇▇▇▇ 0.23
20: est ▇▇▇▇▇▇▇ 0.22
21: val ▇▇▇▇▇▇▇ 0.22
22: for ▇▇▇▇▇▇▇ 0.22
23: exp ▇▇▇▇▇▇▇ 0.22
24: ert ▇▇▇▇▇▇▇ 0.21
25: typ ▇▇▇▇▇▇▇ 0.21
26: one ▇▇▇▇▇▇▇ 0.20
27: ype ▇▇▇▇▇▇▇ 0.20
28: str ▇▇▇▇▇▇▇ 0.20
29: ext ▇▇▇▇▇▇▇ 0.20
30: dat ▇▇▇▇▇▇ 0.20
31: col ▇▇▇▇▇▇ 0.19
32: tes ▇▇▇▇▇▇ 0.19
33: der ▇▇▇▇▇▇ 0.19
34: mpo ▇▇▇▇▇▇ 0.19
35: nte ▇▇▇▇▇▇ 0.19
36: ont ▇▇▇▇▇▇ 0.19
37: tur ▇▇▇▇▇▇ 0.19
38: res ▇▇▇▇▇▇ 0.19
39: sta ▇▇▇▇▇▇ 0.18
40: pro ▇▇▇▇▇▇ 0.18
Most common trigrams (Granite Python)
───────────────────── python ─────────────────────
1: --- ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.98
2: ␣=␣ ▇▇▇▇▇▇▇▇▇▇▇ 0.53
3: sel ▇▇▇▇▇▇▇ 0.35
4: ␣␣␣ ▇▇▇▇▇▇▇ 0.34
5: elf ▇▇▇▇▇▇▇ 0.33
6: ass ▇▇▇▇▇▇▇ 0.32
7: ser ▇▇▇▇▇▇▇ 0.32
8: ion ▇▇▇▇▇▇ 0.32
9: the ▇▇▇▇▇▇ 0.27
10: ert ▇▇▇▇▇ 0.25
11: tio ▇▇▇▇▇ 0.25
12: rt␣ ▇▇▇▇▇ 0.25
13: def ▇▇▇▇▇ 0.24
14: ␣th ▇▇▇▇▇ 0.24
15: ␣in ▇▇▇▇▇ 0.24
16: sse ▇▇▇▇▇ 0.23
17: ):⏎ ▇▇▇▇▇ 0.23
18: ate ▇▇▇▇ 0.22
19: est ▇▇▇▇ 0.21
20: ame ▇▇▇▇ 0.21
21: )⏎⏎ ▇▇▇▇ 0.21
22: ing ▇▇▇▇ 0.21
23: lf. ▇▇▇▇ 0.21
24: val ▇▇▇▇ 0.21
25: tes ▇▇▇▇ 0.21
26: ␣no ▇▇▇▇ 0.20
27: ⏎de ▇▇▇▇ 0.20
28: dat ▇▇▇▇ 0.19
29: ef␣ ▇▇▇▇ 0.19
30: ect ▇▇▇▇ 0.18
31: ent ▇▇▇▇ 0.18
32: one ▇▇▇▇ 0.17
33: for ▇▇▇▇ 0.17
34: he␣ ▇▇▇▇ 0.17
35: int ▇▇▇ 0.17
36: typ ▇▇▇ 0.17
37: ",␣ ▇▇▇ 0.17
38: ":␣ ▇▇▇ 0.17
39: ⏎#␣ ▇▇▇ 0.17
40: ␣se ▇▇▇ 0.16
Most common trigrams (Granite TypeScript)
─────────────────── typescript ───────────────────
1: con ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.49
2: ent ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.43
3: ␣{⏎ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.43
4: ons ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.35
5: ion ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.34
6: ␣=␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.32
7: ort ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.31
8: tio ▇▇▇▇▇▇▇▇▇▇▇▇ 0.30
9: ":␣ ▇▇▇▇▇▇▇▇▇▇▇▇ 0.29
10: );⏎ ▇▇▇▇▇▇▇▇▇▇▇ 0.28
11: por ▇▇▇▇▇▇▇▇▇▇▇ 0.28
12: nst ▇▇▇▇▇▇▇▇▇▇▇ 0.27
13: the ▇▇▇▇▇▇▇▇▇▇ 0.26
14: st␣ ▇▇▇▇▇▇▇▇▇▇ 0.26
15: pro ▇▇▇▇▇▇▇▇▇▇ 0.25
16: rt␣ ▇▇▇▇▇▇▇▇▇▇ 0.25
17: ect ▇▇▇▇▇▇▇▇▇▇ 0.25
18: mpo ▇▇▇▇▇▇▇▇▇▇ 0.25
19: ⏎co ▇▇▇▇▇▇▇▇▇▇ 0.24
20: ␣th ▇▇▇▇▇▇▇▇▇▇ 0.24
21: ate ▇▇▇▇▇▇▇▇▇ 0.23
22: ing ▇▇▇▇▇▇▇▇▇ 0.23
23: ,⏎" ▇▇▇▇▇▇▇▇▇ 0.22
24: exp ▇▇▇▇▇▇▇▇▇ 0.22
25: :␣" ▇▇▇▇▇▇▇▇▇ 0.22
26: ;⏎⏎ ▇▇▇▇▇▇▇▇ 0.20
27: rop ▇▇▇▇▇▇▇▇ 0.20
28: rea ▇▇▇▇▇▇▇▇ 0.20
29: ext ▇▇▇▇▇▇▇▇ 0.19
30: rom ▇▇▇▇▇▇▇▇ 0.19
31: :␣' ▇▇▇▇▇▇▇▇ 0.19
32: fro ▇▇▇▇▇▇▇ 0.18
33: ␣re ▇▇▇▇▇▇▇ 0.18
34: ⏎ex ▇▇▇▇▇▇▇ 0.18
35: om␣ ▇▇▇▇▇▇▇ 0.17
36: men ▇▇▇▇▇▇▇ 0.17
37: act ▇▇▇▇▇▇▇ 0.17
38: der ▇▇▇▇▇▇▇ 0.17
39: ont ▇▇▇▇▇▇▇ 0.17
40: ",⏎ ▇▇▇▇▇▇▇ 0.17
The Keyboard Layout Analysis: Creating the Corpus, Bigram Chains, and Shakespeare's Monkeys1 (here: kla-code
) contains ngram dataset for programming use called code-frequency.txt
. The dataset only contains unigrams. One of the main differences is that the kla-code is not for optimizing for any specific programming language. The source corpus is taken from acmeism/RosettaCodeData. See for example Solve_a_Numbrix_puzzle for some code samples.
kla-code vs Granite Code: unigrams
- The kla-code clearly has not removed indentation as the amount of space chracters is so high (~25%).
- The Granite Code does not have any tabs (likely that tabs have been replaced by spaces in the repos used). Typically IDEs jump to the correct indendation level, so tabs are needed less than number of tabs in a file.
- Parenthesis (
(
and)
) are used less in the corpus used by the Granite Code ngrams. - Equals sign (
=
) is used less in the corpus used by the Granite Code ngrams.
─────────────────────kla-code───────────────────── ───────────────────────code───────────────────────
1: ␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 24.87 1 ( +0): ␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.85
2: e ▇▇▇▇▇ 5.08 2 ( +0): e ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.53
3: t ▇▇▇▇ 4.11 3 ( +0): t ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.62
4: ⏎ ▇▇▇ 3.61 4 ( +3): r ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.48
5: n ▇▇▇ 3.55 5 ( +3): a ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.43
6: i ▇▇▇ 3.40 6 ( +3): o ▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.18
7: r ▇▇▇ 3.34 7 ( +3): s ▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.12
8: a ▇▇▇ 3.00 8 ( -3): n ▇▇▇▇▇▇▇▇▇▇▇▇ 3.96
9: o ▇▇▇ 2.80 9 ( -3): i ▇▇▇▇▇▇▇▇▇▇▇ 3.73
10: s ▇▇▇ 2.77 10 ( -6): ⏎ ▇▇▇▇▇▇▇▇▇▇ 3.30
11: l ▇▇ 2.09 11 ( +0): l ▇▇▇▇▇▇▇▇▇ 3.09
12: ) ▇▇ 1.90 12 ( +3): c ▇▇▇▇▇▇▇ 2.27
13: ( ▇▇ 1.90 13 ( +1): d ▇▇▇▇▇▇▇ 2.18
14: d ▇▇ 1.73 14 ( +4): p ▇▇▇▇▇▇▇ 2.14
15: c ▇ 1.58 15 ( +8): . ▇▇▇▇▇▇ 2.09
16: , ▇ 1.49 16 ( +1): u ▇▇▇▇▇▇ 1.94
17: u ▇ 1.46 17 ( +8): - ▇▇▇▇▇▇ 1.92
18: p ▇ 1.33 18 ( -2): , ▇▇▇▇▇▇ 1.82
19: m ▇ 1.30 19 ( +0): m ▇▇▇▇▇▇ 1.81
20: f ▇ 1.18 20 ( +0): f ▇▇▇▇▇ 1.58
21: = ▇ 1.12 21 ( -9): ) ▇▇▇▇ 1.46
22: " ▇ 1.09 22 ( -9): ( ▇▇▇▇ 1.46
23: . ▇ 1.05 23 ( -1): " ▇▇▇▇ 1.44
24: h ▇ 1.02 24 ( +7): : ▇▇▇▇ 1.25
25: - ▇ 1.01 25 ( -1): h ▇▇▇▇ 1.24
26: 1 ▇ 1.01 26 (+13): _ ▇▇▇▇ 1.20
27: 0 ▇ 0.98 27 ( +1): g ▇▇▇ 1.11
28: g ▇ 0.90 28 ( -1): 0 ▇▇▇ 1.07
29: ; ▇ 0.78 29 ( -3): 1 ▇▇▇ 1.06
30: b ▇ 0.74 30 ( +0): b ▇▇▇ 1.00
31: : ▇ 0.70 31 (-10): = ▇▇▇ 1.00
32: y ▇ 0.61 32 ( +0): y ▇▇▇ 0.86
33: x ▇ 0.58 33 ( +1): 2 ▇▇ 0.79
34: 2 ▇ 0.57 34 ( -1): x ▇▇ 0.73
35: \t 0.53 35 (+10): ' ▇▇ 0.70
36: w 0.52 36 ( +4): v ▇▇ 0.67
37: [ 0.48 37 (+12): { ▇▇ 0.57
38: ] 0.47 38 (+13): } ▇▇ 0.57
39: _ 0.47 39 ( +8): / ▇▇ 0.56
40: v 0.47 40 (-11): ; ▇▇ 0.53
45: ' 0.41 43 ( -7): w ▇ 0.48
47: / 0.37 51 (-14): [ ▇ 0.32
49: { 0.37 52 (-14): ] ▇ 0.32
51: } 0.37 ??? (???): \t 0.00
Download link: iweb-corpus-samples-cleaned.txt.xz. This is the same as the ngrams/shai_english
in the Keyboard Layout Optimizer2, named after Shai Coleman, the creator of Colemak layout.3
Note that the iweb
dataset contains ngrams for English, not programming.
iweb vs Granite Code: unigrams
───────────────────────iweb─────────────────────── ───────────────────────code───────────────────────
1: ␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 16.84 1 ( +0): ␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.85
2: e ▇▇▇▇▇▇▇▇▇▇▇▇▇ 9.59 2 ( +0): e ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.81
3: t ▇▇▇▇▇▇▇▇▇▇ 7.28 3 ( +0): t ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.94
4: a ▇▇▇▇▇▇▇▇▇ 6.50 4 ( +0): a ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.71
5: o ▇▇▇▇▇▇▇▇ 6.22 5 ( +4): r ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.70
6: i ▇▇▇▇▇▇▇▇ 5.81 6 ( +2): s ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.46
7: n ▇▇▇▇▇▇▇▇ 5.54 7 ( -2): o ▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.33
8: s ▇▇▇▇▇▇▇ 5.24 8 ( -1): n ▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.16
9: r ▇▇▇▇▇▇▇ 4.93 9 ( -3): i ▇▇▇▇▇▇▇▇▇▇▇▇ 4.02
10: h ▇▇▇▇▇ 3.73 10 (+16): ⏎ ▇▇▇▇▇▇▇▇▇▇ 3.30
11: l ▇▇▇▇▇ 3.38 11 ( +0): l ▇▇▇▇▇▇▇▇▇▇ 3.23
12: d ▇▇▇▇ 2.96 12 ( +1): c ▇▇▇▇▇▇▇▇ 2.55
13: c ▇▇▇ 2.53 13 ( -1): d ▇▇▇▇▇▇▇ 2.38
14: u ▇▇▇ 2.36 14 ( +3): p ▇▇▇▇▇▇▇ 2.38
15: m ▇▇▇ 1.98 15 ( +7): . ▇▇▇▇▇▇ 2.09
16: f ▇▇ 1.74 16 ( -2): u ▇▇▇▇▇▇ 2.02
17: p ▇▇ 1.71 17 ( -2): m ▇▇▇▇▇▇ 2.00
18: g ▇▇ 1.67 18 ( +9): - ▇▇▇▇▇▇ 1.92
19: y ▇▇ 1.57 19 ( +5): , ▇▇▇▇▇▇ 1.82
20: w ▇▇ 1.47 20 ( -4): f ▇▇▇▇▇ 1.76
21: b ▇▇ 1.22 21 (+16): ) ▇▇▇▇ 1.46
22: . ▇ 0.89 22 (+16): ( ▇▇▇▇ 1.46
23: v ▇ 0.87 23 ( +7): " ▇▇▇▇ 1.44
24: , ▇ 0.82 24 (-14): h ▇▇▇▇ 1.33
25: k ▇ 0.65 25 (+14): : ▇▇▇▇ 1.25
26: ⏎ 0.35 26 ( -8): g ▇▇▇▇ 1.21
27: - 0.21 27 ( -6): b ▇▇▇▇ 1.21
28: ' 0.21 28 (???): _ ▇▇▇▇ 1.20
29: x 0.18 29 ( +2): 0 ▇▇▇ 1.07
30: " 0.15 30 ( +3): 1 ▇▇▇ 1.06
31: 0 0.15 31 (+26): = ▇▇▇ 1.00
32: j 0.14 32 (-13): y ▇▇▇ 0.88
33: 1 0.13 33 ( +1): 2 ▇▇ 0.79
34: 2 0.10 34 ( -5): x ▇▇ 0.77
35: q 0.08 35 (-12): v ▇▇ 0.76
36: z 0.08 36 ( -8): ' ▇▇ 0.70
37: ) 0.08 37 (???): { ▇▇ 0.57
38: ( 0.08 38 (???): } ▇▇ 0.57
39: : 0.06 39 (+10): / ▇▇ 0.56
40: 5 0.05 40 (-20): w ▇▇ 0.53
41: 3 0.05 41 ( +9): ; ▇▇ 0.53
42: 9 0.04 42 ( -1): 3 ▇▇ 0.49
43: ? 0.04 43 (-18): k ▇ 0.48
44: 4 0.04 44 ( +0): 4 ▇ 0.48
45: ! 0.04 45 ( -5): 5 ▇ 0.42
46: 6 0.04 46 (+10): > ▇ 0.35
47: 8 0.03 47 ( -1): 6 ▇ 0.34
48: 7 0.03 48 ( -1): 8 ▇ 0.34
49: / 0.02 49 (???): [ ▇ 0.32
50: ; 0.02 50 (???): ] ▇ 0.32
51: $ 0.01 51 ( +4): * ▇ 0.30
52: % 0.01 52 (-10): 9 ▇ 0.28
53: & 0.01 53 ( -5): 7 ▇ 0.27
54: + 0.01 54 (-19): q ▇ 0.21
55: * 0.00 55 ( +3): # ▇ 0.19
56: > 0.00 56 ( +4): < ▇ 0.18
57: = 0.00 57 (???): ` ▇ 0.18
58: # 0.00 58 (-26): j ▇ 0.17
59: @ 0.00 59 (-23): z ▇ 0.17
60: < 0.00 60 (???): \ 0.12
61: 0.00 61 (???): | 0.10
62: ’ 0.00 62 ( -8): + 0.09
63: ” 0.00 63 (-10): & 0.09
64: “ 0.00 64 (-19): ! 0.07
65: — 0.00 65 ( -6): @ 0.06
66: ü 0.00 66 (-15): $ 0.06
67: – 0.00 67 (-24): ? 0.05
68: • 0.00 68 (-16): % 0.04
69: ¢ 0.00 69 (???): ^ 0.01
70: ´ 0.00 70 (???): ~ 0.01
71: é 0.00 71 (???): 0.00
72: ʼ 0.00 72 (???): € 0.00
73: ® 0.00 ??? (???): ö 0.00
74: ¤ 0.00 ??? (???): ° 0.00
75: ‐ 0.00 ??? (???): ⅜ 0.00
76: § 0.00 ??? (???): ⅓ 0.00
77: ä 0.00 ??? (???): ‐ 0.00
78: ⅓ 0.00 ??? (???): ⅔ 0.00
79: ′ 0.00 ??? (???): ü 0.00
80: ć 0.00 ??? (???): ′ 0.00
81: → 0.00 ??? (???): ® 0.00
82: и 0.00 ??? (???): ć 0.00
83: ⅔ 0.00 ??? (???): “ 0.00
84: ⅜ 0.00 ??? (???): 0.00
85: ¬ 0.00 ??? (???): ´ 0.00
86: 0.00 ??? (???): — 0.00
87: › 0.00 ??? (???): ” 0.00
88: ö 0.00 ??? (???): ¬ 0.00
89: ☺ 0.00 ??? (???): › 0.00
90: · 0.00 ??? (???): é 0.00
91: ° 0.00 ??? (???): § 0.00
92: ÷ 0.00 ??? (???): â 0.00
93: â 0.00 ??? (???): · 0.00
94: ⅛ 0.00 ??? (???): и 0.00
???: \ 0.00 ??? (???): ⅛ 0.00
???: 0.00 ??? (???): ÷ 0.00
???: ^ 0.00 ??? (???): ¤ 0.00
???: } 0.00 ??? (???): – 0.00
???: _ 0.00 ??? (???): ☺ 0.00
???: ` 0.00 ??? (???): • 0.00
???: € 0.00 ??? (???): ¢ 0.00
???: | 0.00 ??? (???): 0.00
???: ] 0.00 ??? (???): → 0.00
???: [ 0.00 ??? (???): ’ 0.00
???: { 0.00 ??? (???): ʼ 0.00
???: ~ 0.00 ??? (???): ä 0.00
iweb vs Granite Code: bigrams
───────────────────────iweb─────────────────────── ───────────────────────code───────────────────────
1: e␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.94 1 ( +10): er ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.01
2: ␣t ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.50 2 ( +15): on ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.98
3: th ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.03 3 ( +23): ,␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.98
4: ␣a ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.99 4 ( +10): re ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.94
5: s␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.84 5 ( +2): in ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.93
6: he ▇▇▇▇▇▇▇▇▇▇▇▇ 1.66 6 ( +28): te ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.88
7: in ▇▇▇▇▇▇▇▇▇▇▇ 1.55 7 ( +44): se ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.80
8: t␣ ▇▇▇▇▇▇▇▇▇▇▇ 1.49 8 ( +15): or ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.76
9: d␣ ▇▇▇▇▇▇▇▇▇▇ 1.40 9 ( +28): st ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.72
10: an ▇▇▇▇▇▇▇▇▇ 1.25 10 ( +11): at ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.70
11: er ▇▇▇▇▇▇▇▇▇ 1.21 11 ( +16): es ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.67
12: n␣ ▇▇▇▇▇▇▇▇▇ 1.20 12 ( +295): :␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.66
13: ␣i ▇▇▇▇▇▇▇▇▇ 1.20 13 ( +52): de ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.66
14: re ▇▇▇▇▇▇▇▇ 1.14 14 ( +11): en ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.65
15: ␣s ▇▇▇▇▇▇▇▇ 1.12 15 ( -7): t␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.64
16: ␣o ▇▇▇▇▇▇▇▇ 1.06 16 ( +44): co ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.63
17: on ▇▇▇▇▇▇▇ 0.99 17 ( +33): le ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.62
18: r␣ ▇▇▇▇▇▇▇ 0.97 18 ( +28): nt ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.61
19: ␣w ▇▇▇▇▇▇▇ 0.95 19 ( -18): e␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.60
20: ␣c ▇▇▇▇▇▇ 0.86 20 ( +18): ar ▇▇▇▇▇▇▇▇▇▇▇▇ 0.59
21: at ▇▇▇▇▇▇ 0.86 21 ( +15): ti ▇▇▇▇▇▇▇▇▇▇▇▇ 0.56
22: y␣ ▇▇▇▇▇▇ 0.84 22 ( +542): -- ▇▇▇▇▇▇▇▇▇▇▇ 0.55
23: or ▇▇▇▇▇▇ 0.83 23 (+1096): ,⏎ ▇▇▇▇▇▇▇▇▇▇▇ 0.54
24: nd ▇▇▇▇▇▇ 0.83 24 ( -21): th ▇▇▇▇▇▇▇▇▇▇▇ 0.53
25: en ▇▇▇▇▇▇ 0.80 25 ( +811): ␣= ▇▇▇▇▇▇▇▇▇▇▇ 0.53
26: ,␣ ▇▇▇▇▇▇ 0.79 26 ( +16): al ▇▇▇▇▇▇▇▇▇▇▇ 0.52
27: es ▇▇▇▇▇▇ 0.79 27 ( +35): ro ▇▇▇▇▇▇▇▇▇▇▇ 0.51
28: o␣ ▇▇▇▇▇▇ 0.77 28 ( +798): =␣ ▇▇▇▇▇▇▇▇▇▇ 0.49
29: to ▇▇▇▇▇ 0.76 29 ( -27): ␣t ▇▇▇▇▇▇▇▇▇▇ 0.49
30: ou ▇▇▇▇▇ 0.76 30 ( +28): me ▇▇▇▇▇▇▇▇▇▇ 0.49
31: ␣b ▇▇▇▇▇ 0.75 31 ( +53): el ▇▇▇▇▇▇▇▇▇▇ 0.48
32: ng ▇▇▇▇▇ 0.73 32 ( +62): et ▇▇▇▇▇▇▇▇▇ 0.45
33: it ▇▇▇▇▇ 0.72 33 ( +959): ;⏎ ▇▇▇▇▇▇▇▇▇ 0.44
34: te ▇▇▇▇▇ 0.70 34 ( +19): as ▇▇▇▇▇▇▇▇▇ 0.43
35: ␣f ▇▇▇▇▇ 0.70 35 ( -25): an ▇▇▇▇▇▇▇▇▇ 0.41
36: ti ▇▇▇▇▇ 0.69 36 ( -31): s␣ ▇▇▇▇▇▇▇▇▇ 0.41
37: st ▇▇▇▇▇ 0.69 37 ( +150): ex ▇▇▇▇▇▇▇▇▇ 0.41
38: ar ▇▇▇▇▇ 0.68 38 (???): ⏎⏎ ▇▇▇▇▇▇▇▇ 0.40
39: ␣p ▇▇▇▇▇ 0.67 39 ( -6): it ▇▇▇▇▇▇▇▇ 0.40
40: ␣m ▇▇▇▇▇ 0.64 40 ( +34): ra ▇▇▇▇▇▇▇▇ 0.39
42: al ▇▇▇▇▇ 0.63 42 ( -27): ␣s ▇▇▇▇▇▇▇▇ 0.39
46: nt ▇▇▇▇ 0.59 45 ( -39): he ▇▇▇▇▇▇▇▇ 0.38
50: le ▇▇▇▇ 0.55 48 ( -35): ␣i ▇▇▇▇▇▇▇▇ 0.38
51: se ▇▇▇▇ 0.55 49 ( -20): to ▇▇▇▇▇▇▇▇ 0.38
53: as ▇▇▇▇ 0.52 51 ( -39): n␣ ▇▇▇▇▇▇▇▇ 0.37
58: me ▇▇▇▇ 0.50 53 ( -29): nd ▇▇▇▇▇▇▇▇ 0.37
60: co ▇▇▇ 0.47 55 ( -51): ␣a ▇▇▇▇▇▇▇▇ 0.36
62: ro ▇▇▇ 0.46 64 ( -32): ng ▇▇▇▇▇▇▇ 0.32
65: de ▇▇▇ 0.44 69 ( -49): ␣c ▇▇▇▇▇▇ 0.29
74: ra ▇▇▇ 0.37 72 ( -37): ␣f ▇▇▇▇▇▇ 0.29
84: el ▇▇ 0.33 90 ( -72): r␣ ▇▇▇▇▇ 0.25
94: et ▇▇ 0.29 102 ( -72): ou ▇▇▇▇▇ 0.23
187: ex ▇ 0.12 104 ( -95): d␣ ▇▇▇▇▇ 0.23
307: :␣ 0.04 111 ( -80): ␣b ▇▇▇▇ 0.21
564: -- 0.01 113 ( -74): ␣p ▇▇▇▇ 0.21
826: =␣ 0.00 138 ( -122): ␣o ▇▇▇▇ 0.18
836: ␣= 0.00 170 ( -130): ␣m ▇▇▇ 0.15
992: ;⏎ 0.00 225 ( -203): y␣ ▇▇ 0.11
1119: ,⏎ 0.00 229 ( -210): ␣w ▇▇ 0.11
???: ⏎⏎ 0.00 258 ( -230): o␣ ▇▇ 0.09
iweb vs Granite Code: bigrams, whitespace ignored
───────────────────────iweb─────────────────────── ───────────────────────code───────────────────────
1: th ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.10 1 ( +4): er ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.28
2: he ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.53 2 ( +5): on ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.26
3: in ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.36 3 ( +3): re ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.19
4: an ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.90 4 ( -1): in ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.18
5: er ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.85 5 ( +12): te ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.13
6: re ▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.74 6 ( +22): se ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.02
7: on ▇▇▇▇▇▇▇▇▇▇▇ 1.51 7 ( +2): or ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.96
8: at ▇▇▇▇▇▇▇▇▇▇ 1.31 8 ( +11): st ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.91
9: or ▇▇▇▇▇▇▇▇▇ 1.27 9 ( -1): at ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.90
10: nd ▇▇▇▇▇▇▇▇▇ 1.27 10 ( +2): es ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.86
11: en ▇▇▇▇▇▇▇▇▇ 1.22 11 ( +26): de ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.84
12: es ▇▇▇▇▇▇▇▇▇ 1.20 12 ( -1): en ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.83
13: to ▇▇▇▇▇▇▇▇▇ 1.16 13 ( +20): co ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.81
14: ou ▇▇▇▇▇▇▇▇▇ 1.16 14 ( +13): le ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.79
15: ng ▇▇▇▇▇▇▇▇ 1.11 15 ( +10): nt ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.78
16: it ▇▇▇▇▇▇▇▇ 1.09 16 ( +4): ar ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.75
17: te ▇▇▇▇▇▇▇▇ 1.07 17 ( +1): ti ▇▇▇▇▇▇▇▇▇▇▇▇ 0.72
18: ti ▇▇▇▇▇▇▇▇ 1.06 18 (+438): -- ▇▇▇▇▇▇▇▇▇▇▇▇ 0.70
19: st ▇▇▇▇▇▇▇▇ 1.05 19 ( -18): th ▇▇▇▇▇▇▇▇▇▇▇▇ 0.68
20: ar ▇▇▇▇▇▇▇▇ 1.03 20 ( +1): al ▇▇▇▇▇▇▇▇▇▇▇ 0.66
21: al ▇▇▇▇▇▇▇ 0.97 21 ( +14): ro ▇▇▇▇▇▇▇▇▇▇▇ 0.65
22: is ▇▇▇▇▇▇▇ 0.96 22 ( +10): me ▇▇▇▇▇▇▇▇▇▇▇ 0.62
23: ed ▇▇▇▇▇▇▇ 0.96 23 ( +29): el ▇▇▇▇▇▇▇▇▇▇ 0.61
24: ha ▇▇▇▇▇▇▇ 0.93 24 ( +36): et ▇▇▇▇▇▇▇▇▇▇ 0.58
25: nt ▇▇▇▇▇▇▇ 0.90 25 ( +4): as ▇▇▇▇▇▇▇▇▇ 0.55
26: ve ▇▇▇▇▇▇ 0.86 26 ( -22): an ▇▇▇▇▇▇▇▇▇ 0.53
27: le ▇▇▇▇▇▇ 0.84 27 (+117): ex ▇▇▇▇▇▇▇▇▇ 0.52
28: se ▇▇▇▇▇▇ 0.84 28 ( -12): it ▇▇▇▇▇▇▇▇▇ 0.51
29: as ▇▇▇▇▇▇ 0.79 29 ( +14): ra ▇▇▇▇▇▇▇▇▇ 0.50
30: ea ▇▇▇▇▇▇ 0.77 30 ( +26): ta ▇▇▇▇▇▇▇▇▇ 0.50
31: of ▇▇▇▇▇▇ 0.76 31 ( -9): is ▇▇▇▇▇▇▇▇ 0.50
32: me ▇▇▇▇▇▇ 0.76 32 ( +55): rt ▇▇▇▇▇▇▇▇ 0.49
33: co ▇▇▇▇▇ 0.71 33 ( -31): he ▇▇▇▇▇▇▇▇ 0.49
34: ll ▇▇▇▇▇ 0.70 34 ( +2): ne ▇▇▇▇▇▇▇▇ 0.48
35: ro ▇▇▇▇▇ 0.69 35 ( +45): ct ▇▇▇▇▇▇▇▇ 0.48
36: ne ▇▇▇▇▇ 0.69 36 ( -23): to ▇▇▇▇▇▇▇▇ 0.48
37: de ▇▇▇▇▇ 0.67 37 ( -14): ed ▇▇▇▇▇▇▇▇ 0.48
38: hi ▇▇▇▇▇ 0.66 38 ( +44): tr ▇▇▇▇▇▇▇▇ 0.47
39: ri ▇▇▇▇▇ 0.62 39 ( -29): nd ▇▇▇▇▇▇▇▇ 0.47
40: li ▇▇▇▇ 0.60 40 ( +2): io ▇▇▇▇▇▇▇▇ 0.45
42: io ▇▇▇▇ 0.58 41 ( -1): li ▇▇▇▇▇▇▇▇ 0.45
43: ra ▇▇▇▇ 0.57 42 ( -3): ri ▇▇▇▇▇▇▇▇ 0.45
52: el ▇▇▇▇ 0.51 47 ( -32): ng ▇▇▇▇▇▇▇ 0.41
56: ta ▇▇▇▇ 0.49 67 ( -41): ve ▇▇▇▇▇ 0.31
60: et ▇▇▇ 0.45 75 ( -61): ou ▇▇▇▇▇ 0.29
80: ct ▇▇▇ 0.37 76 ( -46): ea ▇▇▇▇▇ 0.29
82: tr ▇▇▇ 0.36 96 ( -62): ll ▇▇▇▇ 0.24
87: rt ▇▇▇ 0.34 120 ( -96): ha ▇▇▇ 0.20
144: ex ▇ 0.19 132 ( -94): hi ▇▇▇ 0.18
456: -- 0.01 207 (-176): of ▇▇ 0.11
iweb vs Granite Code: trigrams
───────────────────────iweb─────────────────────── ───────────────────────code───────────────────────
1: ␣th ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.57 1 ( +7138): --- ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.42
2: the ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.29 2 ( +4046): ␣=␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.39
3: he␣ ▇▇▇▇▇▇▇▇▇▇▇▇ 1.03 3 ( +21): ion ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.33
4: ␣an ▇▇▇▇▇▇▇▇ 0.63 4 ( +25): ent ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.29
5: ing ▇▇▇▇▇▇▇ 0.61 5 ( +112): con ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.28
6: nd␣ ▇▇▇▇▇▇▇ 0.61 6 ( +30): tio ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.27
7: and ▇▇▇▇▇▇▇ 0.59 7 (???): ␣{⏎ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.25
8: ␣to ▇▇▇▇▇▇▇ 0.58 8 ( -6): the ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.22
9: ng␣ ▇▇▇▇▇▇▇ 0.54 9 ( -4): ing ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.21
10: to␣ ▇▇▇▇▇▇ 0.53 10 ( +64): ate ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.21
11: ␣in ▇▇▇▇▇▇ 0.50 11 ( +653): sel ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.20
12: ed␣ ▇▇▇▇▇▇ 0.48 12 ( -11): ␣th ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.20
13: ␣of ▇▇▇▇▇▇ 0.47 13 (???): ␣␣␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.20
14: of␣ ▇▇▇▇▇ 0.43 14 ( +456): ass ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.19
15: ␣a␣ ▇▇▇▇▇ 0.40 15 ( +153): ect ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
16: er␣ ▇▇▇▇▇ 0.40 16 ( +121): ons ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
17: is␣ ▇▇▇▇ 0.36 17 ( +213): ort ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
18: in␣ ▇▇▇▇ 0.35 18 (+12047): );⏎ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
19: ␣co ▇▇▇▇ 0.35 19 ( +375): rt␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.17
20: re␣ ▇▇▇▇ 0.35 20 ( +398): ser ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.17
21: on␣ ▇▇▇▇ 0.35 21 ( +1257): elf ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.17
22: e␣t ▇▇▇▇ 0.34 22 ( +1451): def ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
23: s␣a ▇▇▇▇ 0.33 23 ( +275): ame ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
24: ion ▇▇▇▇ 0.33 24 ( +379): por ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
25: at␣ ▇▇▇▇ 0.32 25 ( +31): ter ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
26: or␣ ▇▇▇▇ 0.32 26 ( +124): est ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
27: es␣ ▇▇▇▇ 0.30 27 ( -16): ␣in ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
28: e␣a ▇▇▇▇ 0.30 28 ( +1089): val ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
29: ent ▇▇▇▇ 0.29 29 (???): ⏎}⏎ ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
30: ␣re ▇▇▇ 0.29 30 ( +2): for ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
31: ␣be ▇▇▇ 0.29 31 ( +550): exp ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
32: for ▇▇▇ 0.28 32 ( +678): ert ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
33: you ▇▇▇ 0.27 33 ( +1697): typ ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
34: ␣fo ▇▇▇ 0.27 34 ( +118): one ▇▇▇▇▇▇▇▇▇▇▇ 0.14
35: ␣yo ▇▇▇ 0.27 35 ( +1899): ype ▇▇▇▇▇▇▇▇▇▇▇ 0.14
36: tio ▇▇▇ 0.26 36 ( +236): str ▇▇▇▇▇▇▇▇▇▇▇ 0.14
37: as␣ ▇▇▇ 0.26 37 ( +790): ext ▇▇▇▇▇▇▇▇▇▇▇ 0.14
38: ␣wi ▇▇▇ 0.26 38 ( +888): dat ▇▇▇▇▇▇▇▇▇▇▇ 0.14
39: n␣t ▇▇▇ 0.25 39 ( +710): col ▇▇▇▇▇▇▇▇▇▇▇ 0.13
40: s␣t ▇▇▇ 0.25 40 ( -10): ␣re ▇▇▇▇▇▇▇▇▇▇▇ 0.13
56: ter ▇▇ 0.20 44 ( -41): he␣ ▇▇▇▇▇▇▇▇▇▇▇ 0.13
74: ate ▇▇ 0.18 52 ( -33): ␣co ▇▇▇▇▇▇▇▇▇▇ 0.13
117: con ▇▇ 0.13 81 ( -60): on␣ ▇▇▇▇▇▇▇▇▇ 0.11
137: ons ▇ 0.12 89 ( -63): or␣ ▇▇▇▇▇▇▇▇ 0.10
150: est ▇ 0.11 90 ( -78): ed␣ ▇▇▇▇▇▇▇▇ 0.10
152: one ▇ 0.11 98 ( -91): and ▇▇▇▇▇▇▇ 0.09
168: ect ▇ 0.11 111 ( -94): is␣ ▇▇▇▇▇▇▇ 0.09
230: ort ▇ 0.08 112 ( -96): er␣ ▇▇▇▇▇▇▇ 0.09
272: str ▇ 0.07 147 ( -139): ␣to ▇▇▇▇▇▇ 0.08
298: ame ▇ 0.07 152 ( -134): in␣ ▇▇▇▇▇▇ 0.07
394: rt␣ ▇ 0.06 154 ( -127): es␣ ▇▇▇▇▇▇ 0.07
403: por ▇ 0.05 166 ( -162): ␣an ▇▇▇▇▇▇ 0.07
418: ser ▇ 0.05 180 ( -146): ␣fo ▇▇▇▇▇▇ 0.07
470: ass ▇ 0.05 201 ( -192): ng␣ ▇▇▇▇▇ 0.07
581: exp 0.04 214 ( -204): to␣ ▇▇▇▇▇ 0.06
664: sel 0.03 253 ( -238): ␣a␣ ▇▇▇▇▇ 0.06
710: ert 0.03 262 ( -248): of␣ ▇▇▇▇ 0.06
749: col 0.03 282 ( -276): nd␣ ▇▇▇▇ 0.05
827: ext 0.03 294 ( -281): ␣of ▇▇▇▇ 0.05
926: dat 0.02 308 ( -271): as␣ ▇▇▇▇ 0.05
1117: val 0.02 316 ( -294): e␣t ▇▇▇▇ 0.05
1278: elf 0.02 326 ( -306): re␣ ▇▇▇▇ 0.05
1473: def 0.01 374 ( -351): s␣a ▇▇▇ 0.04
1730: typ 0.01 396 ( -368): e␣a ▇▇▇ 0.04
1934: ype 0.01 407 ( -368): n␣t ▇▇▇ 0.04
4048: ␣=␣ 0.00 416 ( -385): ␣be ▇▇▇ 0.04
7139: --- 0.00 433 ( -395): ␣wi ▇▇▇ 0.04
12065: );⏎ 0.00 442 ( -402): s␣t ▇▇▇ 0.04
???: ⏎}⏎ 0.00 649 ( -624): at␣ ▇▇ 0.03
???: ␣{⏎ 0.00 1546 ( -1513): you ▇ 0.01
???: ␣␣␣ 0.00 3114 ( -3079): ␣yo 0.01
iweb vs Granite Code: trigrams, whitespace ignored
───────────────────────iweb─────────────────────── ───────────────────────code───────────────────────
1: the ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.63 1 (+4702): --- ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.61
2: ing ▇▇▇▇▇▇▇▇▇▇ 1.25 2 ( +2): ion ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.47
3: and ▇▇▇▇▇▇▇▇▇ 1.19 3 ( +2): ent ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.42
4: ion ▇▇▇▇▇ 0.67 4 ( +25): con ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.41
5: ent ▇▇▇▇ 0.59 5 ( +3): tio ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.39
6: for ▇▇▇▇ 0.57 6 ( -5): the ▇▇▇▇▇▇▇▇▇▇ 0.31
7: you ▇▇▇▇ 0.55 7 ( -5): ing ▇▇▇▇▇▇▇▇▇▇ 0.30
8: tio ▇▇▇▇ 0.53 8 ( +9): ate ▇▇▇▇▇▇▇▇▇▇ 0.30
9: hat ▇▇▇▇ 0.48 9 ( +316): sel ▇▇▇▇▇▇▇▇▇▇ 0.29
10: tha ▇▇▇ 0.46 10 ( +194): ass ▇▇▇▇▇▇▇▇▇ 0.27
11: her ▇▇▇ 0.45 11 ( +35): ect ▇▇▇▇▇▇▇▇▇ 0.26
12: ter ▇▇▇ 0.41 12 ( +24): ons ▇▇▇▇▇▇▇▇ 0.25
13: all ▇▇▇ 0.39 13 ( +57): ort ▇▇▇▇▇▇▇▇ 0.25
14: ati ▇▇▇ 0.38 14 ( +162): ser ▇▇▇▇▇▇▇▇ 0.25
15: thi ▇▇▇ 0.36 15 ( +733): elf ▇▇▇▇▇▇▇▇ 0.24
16: ver ▇▇▇ 0.36 16 ( +882): def ▇▇▇▇▇▇▇▇ 0.24
17: ate ▇▇▇ 0.36 17 ( +95): ame ▇▇▇▇▇▇▇▇ 0.23
18: our ▇▇▇ 0.36 18 ( +152): por ▇▇▇▇▇▇▇▇ 0.23
19: are ▇▇▇ 0.34 19 ( -7): ter ▇▇▇▇▇▇▇ 0.23
20: ere ▇▇▇ 0.34 20 ( +24): est ▇▇▇▇▇▇▇ 0.22
21: ith ▇▇▇ 0.34 21 ( +617): val ▇▇▇▇▇▇▇ 0.22
22: wit ▇▇ 0.33 22 ( -16): for ▇▇▇▇▇▇▇ 0.22
23: ers ▇▇ 0.33 23 ( +250): exp ▇▇▇▇▇▇▇ 0.22
24: his ▇▇ 0.32 24 ( +328): ert ▇▇▇▇▇▇▇ 0.21
25: pro ▇▇ 0.30 25 (+1063): typ ▇▇▇▇▇▇▇ 0.21
26: rea ▇▇ 0.29 26 ( +19): one ▇▇▇▇▇▇▇ 0.20
27: res ▇▇ 0.27 27 (+1205): ype ▇▇▇▇▇▇▇ 0.20
28: eve ▇▇ 0.27 28 ( +70): str ▇▇▇▇▇▇▇ 0.20
29: con ▇▇ 0.27 29 ( +402): ext ▇▇▇▇▇▇▇ 0.20
30: com ▇▇ 0.27 30 ( +465): dat ▇▇▇▇▇▇ 0.20
31: ill ▇▇ 0.26 31 ( +349): col ▇▇▇▇▇▇ 0.19
32: ive ▇▇ 0.24 32 ( +166): tes ▇▇▇▇▇▇ 0.19
33: out ▇▇ 0.24 33 ( +33): der ▇▇▇▇▇▇ 0.19
34: ess ▇▇ 0.24 34 ( +681): mpo ▇▇▇▇▇▇ 0.19
35: ome ▇▇ 0.24 35 ( +32): nte ▇▇▇▇▇▇ 0.19
36: ons ▇▇ 0.24 36 ( +86): ont ▇▇▇▇▇▇ 0.19
37: ted ▇▇ 0.24 37 ( +67): tur ▇▇▇▇▇▇ 0.19
38: ave ▇▇ 0.24 38 ( -11): res ▇▇▇▇▇▇ 0.19
39: nce ▇▇ 0.24 39 ( +4): sta ▇▇▇▇▇▇ 0.18
40: men ▇▇ 0.24 40 ( -15): pro ▇▇▇▇▇▇ 0.18
43: sta ▇▇ 0.23 41 ( -1): men ▇▇▇▇▇▇ 0.18
44: est ▇▇ 0.23 42 ( -16): rea ▇▇▇▇▇▇ 0.18
45: one ▇▇ 0.23 47 ( -33): ati ▇▇▇▇▇▇ 0.17
46: ect ▇▇ 0.22 60 ( -44): ver ▇▇▇▇▇ 0.16
66: der ▇ 0.17 66 ( -36): com ▇▇▇▇▇ 0.14
67: nte ▇ 0.17 71 ( -68): and ▇▇▇▇ 0.13
70: ort ▇ 0.17 74 ( -37): ted ▇▇▇▇ 0.13
98: str ▇ 0.15 95 ( -82): all ▇▇▇▇ 0.11
104: tur ▇ 0.14 106 ( -83): ers ▇▇▇▇ 0.11
112: ame ▇ 0.14 111 ( -96): thi ▇▇▇ 0.10
122: ont ▇ 0.13 115 ( -81): ess ▇▇▇ 0.10
170: por ▇ 0.11 127 ( -103): his ▇▇▇ 0.10
176: ser ▇ 0.11 141 ( -130): her ▇▇▇ 0.09
198: tes ▇ 0.10 145 ( -124): ith ▇▇▇ 0.09
204: ass ▇ 0.10 155 ( -133): wit ▇▇▇ 0.09
273: exp ▇ 0.08 162 ( -129): out ▇▇▇ 0.08
325: sel ▇ 0.07 179 ( -160): are ▇▇▇ 0.08
352: ert ▇ 0.07 189 ( -169): ere ▇▇ 0.07
380: col 0.06 193 ( -161): ive ▇▇ 0.07
431: ext 0.06 199 ( -160): nce ▇▇ 0.07
495: dat 0.05 243 ( -215): eve ▇▇ 0.06
638: val 0.04 319 ( -284): ome ▇▇ 0.05
715: mpo 0.04 330 ( -312): our ▇▇ 0.05
748: elf 0.03 505 ( -495): tha ▇ 0.04
898: def 0.03 517 ( -486): ill ▇ 0.03
1088: typ 0.02 603 ( -594): hat ▇ 0.03
1232: ype 0.02 826 ( -788): ave ▇ 0.02
4703: --- 0.00 1022 (-1015): you ▇ 0.02
If you want to get this corpus with slight modifications, including some of the following:
- Converting upper case characters to lower case
- Ignoring whitespace (removing ngrams with whitespace)
- Ignoring ngrams with one or more special symbols or other characters
You may use the ngram_show
from the granite-tools (v.0.2.0+).
The code ngrams are based on source code from multiple popular open source repositories. The following repositories where cloned 2024-09-30. The latest commit-hash
and the dominant language(s) is shown in parenthesis:
- github.com/psf/requests (
1ae6fc3
, Python) - github.com/psf/black (
f1a2f92
, Python) - github.com/sphinx-doc/sphinx (
abb3ead
, Python) - github.com/mkdocs/mkdocs (
bb7e8b6
, Python) - github.com/django/django(
5ed7208
, Python) - github.com/pallets/flask (
2fec0b2
, Python) - github.com/fastapi/fastapi (
847296e
, Python) - github.com/bokeh/bokeh (
e07125d
, TypeScript / Python) - github.com/matplotlib/matplotlib (
6404c95
, Python) - github.com/mwaskom/seaborn (
b4e5f8d
, Python) - github.com/pandas-dev/pandas (
90c26ce
, Python) - github.com/sympy/sympy (
dc69be1
, Python) - github.com/pola-rs/polars (
53a3590
, Rust / Python) - github.com/astral-sh/ruff (
5118166
, Rust) - github.com/boto/boto3 (
2879c42
, Python) - github.com/ipython/ipython (
6c00cea
, Python) - github.com/python-attrs/attrs (
a59c5d7
, Python) - github.com/pydantic/pydantic (
01b5929
, Python) - github.com/pydantic/pydantic-core (
f389728
, Rust / Python) - github.com/astral-sh/uv (
958da0b
, Rust) - github.com/facebook/react (
db24098
, JavaScript / TypeScript) - github.com/mui/material-ui (
590f4ffe
, TypeScript / JavaScript) - github.com/react-bootstrap/react-bootstrap (
eec36e5
, TypeScript / JavaScript) - github.com/jgthms/bulma (
4301d4d
, CSS / SCSS) - github.com/microsoft/fluentui (
ce750131
, TypeScript / JavaScript) - github.com/palantir/blueprint (
3f4a894
, TypeScript) - github.com/primefaces/primereact (
5d9098e
, CSS / JavaScript) - github.com/reactstrap/reactstrap (
090bc1e
, JavaScript / TypeScript) - github.com/carbon-design-system/carbon (
29a4646
, JavaScript / TypeScript) - github.com/vercel/next.js (
c3006f6c
, JavaScript / TypeScript / Rust) - github.com/gatsbyjs/gatsby (
7bc2dae
, JavaScript, TypeScript) - github.com/styled-components/styled-components (
6f6db18
, TypeScript, JavaScript) - github.com/formatjs/formatjs (
3f86838
, TypeScript) - github.com/reduxjs/redux (
05ba9d1
, TypeScript, JavaScript) - github.com/webpack/webpack (
5ac66f4
, JavaScript) - github.com/jestjs/jest (
b0eb836
, TypeScript, JavaScript) - github.com/storybookjs/storybook (
13db1c7
, TypeScript, JavaScript)
All the code from .py, .pyi, .rs, .js, .jsx, .ts, .tsx, .css, .scss and .less files were collected from the above repositories to a single corpus file per language. Whitespace on the left side of each row (indentation) was removed when reading the files, because otherwise space character would be having most of the weight.
Script for extracting the code corpus from the repos
from collections import defaultdict
from pathlib import Path
raw_folder_root = Path(__file__).parent / "raw"
corpus_folder = (Path(__file__).parent / "corpus-not-clean").resolve()
corpus_folder.mkdir(exist_ok=True)
extension_mapping = {
".py": "python",
".pyi": "python",
".rs": "rust",
".js": "javascript",
".jsx": "javascript",
".ts": "typescript",
".tsx": "typescript",
".css": "css",
".scss": "css",
".less": "css",
}
def iter_files(folder: Path):
for file in folder.rglob("*.*"):
if not file.is_file():
continue
if file.suffix in extension_mapping:
yield file, extension_mapping[file.suffix]
if __name__ == "__main__":
total_bytes = defaultdict(int)
files = {
lang: open(corpus_folder / f"{lang}.txt", "w")
for lang in extension_mapping.values()
}
for file, lang in iter_files(raw_folder_root):
try:
contents = file.read_text()
# Remove indentation
contents_pruned = "\n".join(c.lstrip() for c in contents.split("\n"))
files[lang].write(contents_pruned)
except Exception as e:
print(f"Error while reading {file}: {e}. Skipping.")
for lang in files:
files[lang].close()
Resulting file structure:
📁 corpus-not-cleaned/
├─📄 javascript.txt # 142.6 MB
├─📄 python.txt # 94.2 MB
├─📄 typescript.txt # 80.3 MB
├─📄 rust.txt # 29.2 MB
└─📄 css.txt # 33.0 MB
After cleaning the data (see below), the ngrams
binary from the Keyboard Layout Optimizer was used to form the ngrams for each language (ngrams/javascript
, ngrams/python
, ngrams/typescript
, ngrams/rust
, ngrams/css
).
The Granite Code ngrams were created from the python
, rust
, javascript
, typescript
and the css
ngrams by first normalizing the ngram files using the normalize.py
script from the Keyboard Layout Optimizer
❯ python ./scripts/ngrams/normalize.py /somepath/ngrams/<corpusname>
and then merging the ngram files with weighting using the ngram_merge
binary from the Keyboard Layout Optimizer:
❯ ./target/release/ngram_merge code/ngrams/code code/ngrams/python:0.4 code/ngrams/rust:0.1 code/ngrams/javascript:0.2 code/
ngrams/typescript:0.2 code/ngrams/css:0.1
The used version of the Keyboard Layout Optimizer is specified by commit f93bd06.
Before creating the ngrams, only following ASCII alphanumerics (62 characters):
qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM1234567890
and following punctuation and special characters (33 characters):
,.!?;:_'"^~#%&/\()[]{}<>=+-*`@$€|
were accepted to the ngrams. The conversion script below was used to either remove or replace characters which belong to the following set:
¡£¤§«®°²³´¶·¹»½¾¿ÁÃÅÆÉÌÓ×ØßàáâãåæçèéêëìíîïðñòóôõøúûüýāăćčīłŋōŠšūŽžə͜͡αβεηικλμνοπρςστυАЩавдеиклмнорстأابةتخدرسعقلمنهويนรลอาเ‒–—―‘’“”•…₹™→−─│┆┌┐└┘┬┴═╞╡╪♪♫月语𞤫
Corpora cleanup script
This is the cleanup script which was used to clean the English, Code and Finnish corpora.
from pathlib import Path
root = Path(__file__).parent.parent
TYPABLE_CHARS = "qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM1234567890"
TYPABLE_CHARS += r""",.!?;:_'"^~#%&/\()[]{}<>=+-*`@$€|"""
TYPABLE_CHARS += " \t\n"
ALLOWED_CHARACTERS = set(TYPABLE_CHARS)
ALLOWED_CHARACTERS_FINNISH = ALLOWED_CHARACTERS | set("äöÄÖ")
replacements_finnish = {
"¹": "1",
"²": "2",
"³": "3",
"½": "1/2",
"¾": "3/4",
"Á": "A",
"Ã": "A",
"Å": "A",
"Æ": "AE",
"É": "E",
"Ì": "I",
"Ó": "O",
"×": "x",
"Ø": "Ö",
"ß": "ss",
"à": "a",
"á": "a",
"â": "a",
"ã": "a",
"å": "a",
"æ": "ae",
"ç": "c",
"è": "e",
"é": "e",
"ê": "e",
"ë": "e",
"ì": "i",
"í": "i",
"î": "i",
"ï": "i",
"ð": "d",
"ñ": "n",
"ò": "o",
"ó": "o",
"ô": "o",
"õ": "o",
"ø": "ö",
"ú": "u",
"û": "u",
"ü": "u",
"ý": "y",
"ā": "a",
"ă": "a",
"ć": "c",
"č": "c",
"ī": "i",
"ł": "l",
"ŋ": "NG",
"ō": "o",
"Š": "S",
"š": "s",
"ū": "u",
"Ž": "Z",
"ž": "z",
"α": "a",
"β": "b",
"ε": "e",
"η": "e",
"ι": "i",
"κ": "k",
"λ": "l",
"μ": "m",
"ν": "n",
"ο": "o",
"π": "p",
"ρ": "r",
"ς": "s",
"σ": "s",
"τ": "t",
"υ": "u",
"А": "A",
"Щ": "Shch",
"а": "a",
"в": "v",
"д": "d",
"е": "e",
"и": "i",
"к": "k",
"л": "l",
"м": "m",
"н": "n",
"о": "o",
"р": "r",
"с": "s",
"т": "t",
"‒": "-",
"–": "-",
"—": "--",
"―": "--",
"−": "-",
"─": "-",
"‘": "'",
"’": "'",
"´": "'",
"“": '"',
"”": '"',
"«": '"',
"»": '"',
"•": "*",
"·": "*",
"…": "...",
"™": "(tm)",
"®": "(r)",
}
replacements = replacements_finnish.copy()
replacements["Ø"] = "O"
replacements["ø"] = "o"
def process_file(input_path, output_path, replacements, allowed_chars):
chunk_size = 50 * 1024 * 1024 # 50 MB chunk size
with open(input_path, "r", encoding="utf-8") as infile, open(
output_path, "w", encoding="utf-8"
) as outfile:
while True:
chunk = infile.read(chunk_size)
if not chunk:
break # End of file
for old_str, new_str in replacements.items():
chunk = chunk.replace(old_str, new_str)
chunk = "".join(char for char in chunk if char in allowed_chars)
outfile.write(chunk)
def cleanup_folder(
folder: Path,
folder_out: Path,
allowed_characters: set[str],
used_replacements: dict[str, str],
):
folder_out.mkdir(exist_ok=True)
for file in folder.glob("*.txt"):
file_out = folder_out / file.name
print(f"Processing {file} -> {file_out}")
process_file(file, file_out, used_replacements, allowed_characters)
if __name__ == "__main__":
import time
start = time.time()
for lang in ("english", "code", "finnish"):
langfolder = root / lang
allowed_characters = (
ALLOWED_CHARACTERS_FINNISH if lang == "finnish" else ALLOWED_CHARACTERS
)
used_replacements = replacements_finnish if lang == "finnish" else replacements
cleanup_folder(
langfolder / "corpus-not-clean",
langfolder / "corpus-clean",
allowed_characters,
used_replacements,
)
print("Done in", time.time() - start, "seconds")
Footnotes
-
Douglas, Ian. “Keyboard Layout Analysis: Creating the Corpus, Bigram Chains, and Shakespeare's Monkeys”. Zenodo, March 29, 2021. doi.org/10.5281/zenodo.5501838. ↩
-
It is unclear if the corpus has been superceded by larger corpora later, as the current version of Colemak Design page refers to English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU ↩