Skip to content

Code corpus (Python, JavaScript, TypeScript, Rust, CSS) used for development of the Granite keyboard layout as character ngrams

License

Notifications You must be signed in to change notification settings

fohrloop/granite-code-ngrams

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Granite Code Ngrams

This repository contains the programming language (Python, JavaScript, TypeScript, Rust, CSS) based character-based ngrams (unigrams, bigrams, trigrams) used for the development of the Granite Layout which are compatible with the Keyboard Layout Optimizer by Dario Götz. The corpus was cleaned from untypical characters before creating the ngrams (see below for details).

The used corpus is a mixture of few dataset with following weights

  • 40% Python (94.2 MB text corpus)
  • 10% Rust (29.2 MB text corpus)
  • 20% TypeScript (80.3 MB text corpus)
  • 20% JavaScript (142.6 MB text corpus)
  • 10% CSS (33.0 MB text corpus)

Other related repos: granite-english-ngrams, granite-finnish-ngrams and granite-tools.

Most common ngrams

Most common ngrams using the ngram_show tool from granite-tools, case ignored (unit: percents)

Most common unigrams (Granite Code)
────────────────────── code ──────────────────────
   1: ␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.85
   2: e ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.81
   3: t ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.94
   4: a ▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.71
   5: r ▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.70
   6: s ▇▇▇▇▇▇▇▇▇▇▇▇ 4.46
   7: o ▇▇▇▇▇▇▇▇▇▇▇▇ 4.33
   8: n ▇▇▇▇▇▇▇▇▇▇▇▇ 4.16
   9: i ▇▇▇▇▇▇▇▇▇▇▇ 4.02
  10: ⏎ ▇▇▇▇▇▇▇▇▇ 3.30
  11: l ▇▇▇▇▇▇▇▇▇ 3.23
  12: c ▇▇▇▇▇▇▇ 2.55
  13: d ▇▇▇▇▇▇▇ 2.38
  14: p ▇▇▇▇▇▇▇ 2.38
  15: . ▇▇▇▇▇▇ 2.09
  16: u ▇▇▇▇▇▇ 2.02
  17: m ▇▇▇▇▇▇ 2.00
  18: - ▇▇▇▇▇ 1.92
  19: , ▇▇▇▇▇ 1.82
  20: f ▇▇▇▇▇ 1.76
  21: ) ▇▇▇▇ 1.46
  22: ( ▇▇▇▇ 1.46
  23: " ▇▇▇▇ 1.44
  24: h ▇▇▇▇ 1.33
  25: : ▇▇▇▇ 1.25
  26: g ▇▇▇ 1.21
  27: b ▇▇▇ 1.21
  28: _ ▇▇▇ 1.20
  29: 0 ▇▇▇ 1.07
  30: 1 ▇▇▇ 1.06
  31: = ▇▇▇ 1.00
  32: y ▇▇ 0.88
  33: 2 ▇▇ 0.79
  34: x ▇▇ 0.77
  35: v ▇▇ 0.76
  36: ' ▇▇ 0.70
  37: { ▇▇ 0.57
  38: } ▇▇ 0.57
  39: / ▇▇ 0.56
  40: w ▇ 0.53
  41: ; ▇ 0.53
  42: 3 ▇ 0.49
  43: k ▇ 0.48
  44: 4 ▇ 0.48
  45: 5 ▇ 0.42
  46: > ▇ 0.35
  47: 6 ▇ 0.34
  48: 8 ▇ 0.34
  49: [ ▇ 0.32
  50: ] ▇ 0.32
  51: * ▇ 0.30
  52: 9 ▇ 0.28
  53: 7 ▇ 0.27
  54: q ▇ 0.21
  55: # ▇ 0.19
  56: < ▇ 0.18
  57: `  0.18
  58: j  0.17
  59: z  0.17
  60: \  0.12
  61: |  0.10
  62: +  0.09
  63: &  0.09
  64: !  0.07
  65: @  0.06
  66: $  0.06
  67: ?  0.05
  68: %  0.04
  69: ^  0.01
  70: ~  0.01
  71:     0.00
  72: €  0.00
Most common bigrams (Granite Code)
────────────────────── code ──────────────────────
   1: er ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.01
   2: on ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.98
   3: ,␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.98
   4: re ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.94
   5: in ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.93
   6: te ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.88
   7: se ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.80
   8: or ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.76
   9: st ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.72
  10: at ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.70
  11: es ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.67
  12: :␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.66
  13: de ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.66
  14: en ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.65
  15: t␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.64
  16: co ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.63
  17: le ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.62
  18: nt ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.61
  19: e␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.60
  20: ar ▇▇▇▇▇▇▇▇▇▇▇▇ 0.59
  21: ti ▇▇▇▇▇▇▇▇▇▇▇▇ 0.56
  22: -- ▇▇▇▇▇▇▇▇▇▇▇ 0.55
  23: ,⏎ ▇▇▇▇▇▇▇▇▇▇▇ 0.54
  24: th ▇▇▇▇▇▇▇▇▇▇▇ 0.53
  25: ␣= ▇▇▇▇▇▇▇▇▇▇▇ 0.53
  26: al ▇▇▇▇▇▇▇▇▇▇▇ 0.52
  27: ro ▇▇▇▇▇▇▇▇▇▇▇ 0.51
  28: =␣ ▇▇▇▇▇▇▇▇▇▇ 0.49
  29: ␣t ▇▇▇▇▇▇▇▇▇▇ 0.49
  30: me ▇▇▇▇▇▇▇▇▇▇ 0.49
  31: el ▇▇▇▇▇▇▇▇▇▇ 0.48
  32: et ▇▇▇▇▇▇▇▇▇ 0.45
  33: ;⏎ ▇▇▇▇▇▇▇▇▇ 0.44
  34: as ▇▇▇▇▇▇▇▇▇ 0.43
  35: an ▇▇▇▇▇▇▇▇▇ 0.41
  36: s␣ ▇▇▇▇▇▇▇▇▇ 0.41
  37: ex ▇▇▇▇▇▇▇▇▇ 0.41
  38: ⏎⏎ ▇▇▇▇▇▇▇▇ 0.40
  39: it ▇▇▇▇▇▇▇▇ 0.40
  40: ra ▇▇▇▇▇▇▇▇ 0.39
Most common bigrams (Granite Code), whitespace ignored
────────────────────── code ──────────────────────
   1: er ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.28
   2: on ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.26
   3: re ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.19
   4: in ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.18
   5: te ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.13
   6: se ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.02
   7: or ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.96
   8: st ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.91
   9: at ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.90
  10: es ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.86
  11: de ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.84
  12: en ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.83
  13: co ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.81
  14: le ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.79
  15: nt ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.78
  16: ar ▇▇▇▇▇▇▇▇▇▇▇▇ 0.75
  17: ti ▇▇▇▇▇▇▇▇▇▇▇▇ 0.72
  18: -- ▇▇▇▇▇▇▇▇▇▇▇ 0.70
  19: th ▇▇▇▇▇▇▇▇▇▇▇ 0.68
  20: al ▇▇▇▇▇▇▇▇▇▇▇ 0.66
  21: ro ▇▇▇▇▇▇▇▇▇▇▇ 0.65
  22: me ▇▇▇▇▇▇▇▇▇▇ 0.62
  23: el ▇▇▇▇▇▇▇▇▇▇ 0.61
  24: et ▇▇▇▇▇▇▇▇▇ 0.58
  25: as ▇▇▇▇▇▇▇▇▇ 0.55
  26: an ▇▇▇▇▇▇▇▇▇ 0.53
  27: ex ▇▇▇▇▇▇▇▇▇ 0.52
  28: it ▇▇▇▇▇▇▇▇ 0.51
  29: ra ▇▇▇▇▇▇▇▇ 0.50
  30: ta ▇▇▇▇▇▇▇▇ 0.50
  31: is ▇▇▇▇▇▇▇▇ 0.50
  32: rt ▇▇▇▇▇▇▇▇ 0.49
  33: he ▇▇▇▇▇▇▇▇ 0.49
  34: ne ▇▇▇▇▇▇▇▇ 0.48
  35: ct ▇▇▇▇▇▇▇▇ 0.48
  36: to ▇▇▇▇▇▇▇▇ 0.48
  37: ed ▇▇▇▇▇▇▇▇ 0.48
  38: tr ▇▇▇▇▇▇▇▇ 0.47
  39: nd ▇▇▇▇▇▇▇▇ 0.47
  40: io ▇▇▇▇▇▇▇ 0.45
Most common trigrams (Granite Code)
────────────────────── code ──────────────────────
   1: --- ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.42
   2: ␣=␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.39
   3: ion ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.33
   4: ent ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.29
   5: con ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.28
   6: tio ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.27
   7: ␣{⏎ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.25
   8: the ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.22
   9: ing ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.21
  10: ate ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.21
  11: sel ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.20
  12: ␣th ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.20
  13: ␣␣␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.20
  14: ass ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.19
  15: ect ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
  16: ons ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
  17: ort ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
  18: );⏎ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
  19: rt␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.17
  20: ser ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.17
  21: elf ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.17
  22: def ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
  23: ame ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
  24: por ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
  25: ter ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
  26: est ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
  27: ␣in ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
  28: val ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
  29: ⏎}⏎ ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
  30: for ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
  31: exp ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
  32: ert ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
  33: typ ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
  34: one ▇▇▇▇▇▇▇▇▇▇▇▇ 0.14
  35: ype ▇▇▇▇▇▇▇▇▇▇▇▇ 0.14
  36: str ▇▇▇▇▇▇▇▇▇▇▇▇ 0.14
  37: ext ▇▇▇▇▇▇▇▇▇▇▇ 0.14
  38: dat ▇▇▇▇▇▇▇▇▇▇▇ 0.14
  39: col ▇▇▇▇▇▇▇▇▇▇▇ 0.13
  40: ␣re ▇▇▇▇▇▇▇▇▇▇▇ 0.13
Most common trigrams (Granite Code), whitespace ignored
────────────────────── code ──────────────────────
   1: --- ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.61
   2: ion ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.47
   3: ent ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.42
   4: con ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.41
   5: tio ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.39
   6: the ▇▇▇▇▇▇▇▇▇▇ 0.31
   7: ing ▇▇▇▇▇▇▇▇▇▇ 0.30
   8: ate ▇▇▇▇▇▇▇▇▇▇ 0.30
   9: sel ▇▇▇▇▇▇▇▇▇▇ 0.29
  10: ass ▇▇▇▇▇▇▇▇▇ 0.27
  11: ect ▇▇▇▇▇▇▇▇▇ 0.26
  12: ons ▇▇▇▇▇▇▇▇ 0.25
  13: ort ▇▇▇▇▇▇▇▇ 0.25
  14: ser ▇▇▇▇▇▇▇▇ 0.25
  15: elf ▇▇▇▇▇▇▇▇ 0.24
  16: def ▇▇▇▇▇▇▇▇ 0.24
  17: ame ▇▇▇▇▇▇▇▇ 0.23
  18: por ▇▇▇▇▇▇▇▇ 0.23
  19: ter ▇▇▇▇▇▇▇ 0.23
  20: est ▇▇▇▇▇▇▇ 0.22
  21: val ▇▇▇▇▇▇▇ 0.22
  22: for ▇▇▇▇▇▇▇ 0.22
  23: exp ▇▇▇▇▇▇▇ 0.22
  24: ert ▇▇▇▇▇▇▇ 0.21
  25: typ ▇▇▇▇▇▇▇ 0.21
  26: one ▇▇▇▇▇▇▇ 0.20
  27: ype ▇▇▇▇▇▇▇ 0.20
  28: str ▇▇▇▇▇▇▇ 0.20
  29: ext ▇▇▇▇▇▇▇ 0.20
  30: dat ▇▇▇▇▇▇ 0.20
  31: col ▇▇▇▇▇▇ 0.19
  32: tes ▇▇▇▇▇▇ 0.19
  33: der ▇▇▇▇▇▇ 0.19
  34: mpo ▇▇▇▇▇▇ 0.19
  35: nte ▇▇▇▇▇▇ 0.19
  36: ont ▇▇▇▇▇▇ 0.19
  37: tur ▇▇▇▇▇▇ 0.19
  38: res ▇▇▇▇▇▇ 0.19
  39: sta ▇▇▇▇▇▇ 0.18
  40: pro ▇▇▇▇▇▇ 0.18
Most common trigrams (Granite Python)
───────────────────── python ─────────────────────
   1: --- ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.98
   2: ␣=␣ ▇▇▇▇▇▇▇▇▇▇▇ 0.53
   3: sel ▇▇▇▇▇▇▇ 0.35
   4: ␣␣␣ ▇▇▇▇▇▇▇ 0.34
   5: elf ▇▇▇▇▇▇▇ 0.33
   6: ass ▇▇▇▇▇▇▇ 0.32
   7: ser ▇▇▇▇▇▇▇ 0.32
   8: ion ▇▇▇▇▇▇ 0.32
   9: the ▇▇▇▇▇▇ 0.27
  10: ert ▇▇▇▇▇ 0.25
  11: tio ▇▇▇▇▇ 0.25
  12: rt␣ ▇▇▇▇▇ 0.25
  13: def ▇▇▇▇▇ 0.24
  14: ␣th ▇▇▇▇▇ 0.24
  15: ␣in ▇▇▇▇▇ 0.24
  16: sse ▇▇▇▇▇ 0.23
  17: ):⏎ ▇▇▇▇▇ 0.23
  18: ate ▇▇▇▇ 0.22
  19: est ▇▇▇▇ 0.21
  20: ame ▇▇▇▇ 0.21
  21: )⏎⏎ ▇▇▇▇ 0.21
  22: ing ▇▇▇▇ 0.21
  23: lf. ▇▇▇▇ 0.21
  24: val ▇▇▇▇ 0.21
  25: tes ▇▇▇▇ 0.21
  26: ␣no ▇▇▇▇ 0.20
  27: ⏎de ▇▇▇▇ 0.20
  28: dat ▇▇▇▇ 0.19
  29: ef␣ ▇▇▇▇ 0.19
  30: ect ▇▇▇▇ 0.18
  31: ent ▇▇▇▇ 0.18
  32: one ▇▇▇▇ 0.17
  33: for ▇▇▇▇ 0.17
  34: he␣ ▇▇▇▇ 0.17
  35: int ▇▇▇ 0.17
  36: typ ▇▇▇ 0.17
  37: ",␣ ▇▇▇ 0.17
  38: ":␣ ▇▇▇ 0.17
  39: ⏎#␣ ▇▇▇ 0.17
  40: ␣se ▇▇▇ 0.16
Most common trigrams (Granite TypeScript)
─────────────────── typescript ───────────────────
   1: con ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.49
   2: ent ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.43
   3: ␣{⏎ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.43
   4: ons ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.35
   5: ion ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.34
   6: ␣=␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.32
   7: ort ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.31
   8: tio ▇▇▇▇▇▇▇▇▇▇▇▇ 0.30
   9: ":␣ ▇▇▇▇▇▇▇▇▇▇▇▇ 0.29
  10: );⏎ ▇▇▇▇▇▇▇▇▇▇▇ 0.28
  11: por ▇▇▇▇▇▇▇▇▇▇▇ 0.28
  12: nst ▇▇▇▇▇▇▇▇▇▇▇ 0.27
  13: the ▇▇▇▇▇▇▇▇▇▇ 0.26
  14: st␣ ▇▇▇▇▇▇▇▇▇▇ 0.26
  15: pro ▇▇▇▇▇▇▇▇▇▇ 0.25
  16: rt␣ ▇▇▇▇▇▇▇▇▇▇ 0.25
  17: ect ▇▇▇▇▇▇▇▇▇▇ 0.25
  18: mpo ▇▇▇▇▇▇▇▇▇▇ 0.25
  19: ⏎co ▇▇▇▇▇▇▇▇▇▇ 0.24
  20: ␣th ▇▇▇▇▇▇▇▇▇▇ 0.24
  21: ate ▇▇▇▇▇▇▇▇▇ 0.23
  22: ing ▇▇▇▇▇▇▇▇▇ 0.23
  23: ,⏎" ▇▇▇▇▇▇▇▇▇ 0.22
  24: exp ▇▇▇▇▇▇▇▇▇ 0.22
  25: :␣" ▇▇▇▇▇▇▇▇▇ 0.22
  26: ;⏎⏎ ▇▇▇▇▇▇▇▇ 0.20
  27: rop ▇▇▇▇▇▇▇▇ 0.20
  28: rea ▇▇▇▇▇▇▇▇ 0.20
  29: ext ▇▇▇▇▇▇▇▇ 0.19
  30: rom ▇▇▇▇▇▇▇▇ 0.19
  31: :␣' ▇▇▇▇▇▇▇▇ 0.19
  32: fro ▇▇▇▇▇▇▇ 0.18
  33: ␣re ▇▇▇▇▇▇▇ 0.18
  34: ⏎ex ▇▇▇▇▇▇▇ 0.18
  35: om␣ ▇▇▇▇▇▇▇ 0.17
  36: men ▇▇▇▇▇▇▇ 0.17
  37: act ▇▇▇▇▇▇▇ 0.17
  38: der ▇▇▇▇▇▇▇ 0.17
  39: ont ▇▇▇▇▇▇▇ 0.17
  40: ",⏎ ▇▇▇▇▇▇▇ 0.17

Comparison to other corpora

granite-code vs kla-code

The Keyboard Layout Analysis: Creating the Corpus, Bigram Chains, and Shakespeare's Monkeys1 (here: kla-code) contains ngram dataset for programming use called code-frequency.txt. The dataset only contains unigrams. One of the main differences is that the kla-code is not for optimizing for any specific programming language. The source corpus is taken from acmeism/RosettaCodeData. See for example Solve_a_Numbrix_puzzle for some code samples.

kla-code vs Granite Code: unigrams
  • The kla-code clearly has not removed indentation as the amount of space chracters is so high (~25%).
  • The Granite Code does not have any tabs (likely that tabs have been replaced by spaces in the repos used). Typically IDEs jump to the correct indendation level, so tabs are needed less than number of tabs in a file.
  • Parenthesis (( and )) are used less in the corpus used by the Granite Code ngrams.
  • Equals sign (=) is used less in the corpus used by the Granite Code ngrams.
─────────────────────kla-code───────────────────── ───────────────────────code───────────────────────
 1: ␣  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 24.87                 1 ( +0): ␣  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.85
 2: e  ▇▇▇▇▇ 5.08                                    2 ( +0): e  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.53
 3: t  ▇▇▇▇ 4.11                                     3 ( +0): t  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.62
 4: ⏎  ▇▇▇ 3.61                                      4 ( +3): r  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.48
 5: n  ▇▇▇ 3.55                                      5 ( +3): a  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.43
 6: i  ▇▇▇ 3.40                                      6 ( +3): o  ▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.18
 7: r  ▇▇▇ 3.34                                      7 ( +3): s  ▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.12
 8: a  ▇▇▇ 3.00                                      8 ( -3): n  ▇▇▇▇▇▇▇▇▇▇▇▇ 3.96
 9: o  ▇▇▇ 2.80                                      9 ( -3): i  ▇▇▇▇▇▇▇▇▇▇▇ 3.73
10: s  ▇▇▇ 2.77                                     10 ( -6): ⏎  ▇▇▇▇▇▇▇▇▇▇ 3.30
11: l  ▇▇ 2.09                                      11 ( +0): l  ▇▇▇▇▇▇▇▇▇ 3.09
12: )  ▇▇ 1.90                                      12 ( +3): c  ▇▇▇▇▇▇▇ 2.27
13: (  ▇▇ 1.90                                      13 ( +1): d  ▇▇▇▇▇▇▇ 2.18
14: d  ▇▇ 1.73                                      14 ( +4): p  ▇▇▇▇▇▇▇ 2.14
15: c  ▇ 1.58                                       15 ( +8): .  ▇▇▇▇▇▇ 2.09
16: ,  ▇ 1.49                                       16 ( +1): u  ▇▇▇▇▇▇ 1.94
17: u  ▇ 1.46                                       17 ( +8): -  ▇▇▇▇▇▇ 1.92
18: p  ▇ 1.33                                       18 ( -2): ,  ▇▇▇▇▇▇ 1.82
19: m  ▇ 1.30                                       19 ( +0): m  ▇▇▇▇▇▇ 1.81
20: f  ▇ 1.18                                       20 ( +0): f  ▇▇▇▇▇ 1.58
21: =  ▇ 1.12                                       21 ( -9): )  ▇▇▇▇ 1.46
22: "  ▇ 1.09                                       22 ( -9): (  ▇▇▇▇ 1.46
23: .  ▇ 1.05                                       23 ( -1): "  ▇▇▇▇ 1.44
24: h  ▇ 1.02                                       24 ( +7): :  ▇▇▇▇ 1.25
25: -  ▇ 1.01                                       25 ( -1): h  ▇▇▇▇ 1.24
26: 1  ▇ 1.01                                       26 (+13): _  ▇▇▇▇ 1.20
27: 0  ▇ 0.98                                       27 ( +1): g  ▇▇▇ 1.11
28: g  ▇ 0.90                                       28 ( -1): 0  ▇▇▇ 1.07
29: ;  ▇ 0.78                                       29 ( -3): 1  ▇▇▇ 1.06
30: b  ▇ 0.74                                       30 ( +0): b  ▇▇▇ 1.00
31: :  ▇ 0.70                                       31 (-10): =  ▇▇▇ 1.00
32: y  ▇ 0.61                                       32 ( +0): y  ▇▇▇ 0.86
33: x  ▇ 0.58                                       33 ( +1): 2  ▇▇ 0.79
34: 2  ▇ 0.57                                       34 ( -1): x  ▇▇ 0.73
35: \t  0.53                                        35 (+10): '  ▇▇ 0.70
36: w   0.52                                        36 ( +4): v  ▇▇ 0.67
37: [   0.48                                        37 (+12): {  ▇▇ 0.57
38: ]   0.47                                        38 (+13): }  ▇▇ 0.57
39: _   0.47                                        39 ( +8): /  ▇▇ 0.56
40: v   0.47                                        40 (-11): ;  ▇▇ 0.53
45: '   0.41                                        43 ( -7): w  ▇ 0.48
47: /   0.37                                        51 (-14): [  ▇ 0.32
49: {   0.37                                        52 (-14): ]  ▇ 0.32
51: }   0.37                                       ??? (???): \t  0.00

granite-code vs colemak-iweb

Download link: iweb-corpus-samples-cleaned.txt.xz. This is the same as the ngrams/shai_english in the Keyboard Layout Optimizer2, named after Shai Coleman, the creator of Colemak layout.3

Note that the iweb dataset contains ngrams for English, not programming.

iweb vs Granite Code: unigrams
───────────────────────iweb─────────────────────── ───────────────────────code───────────────────────
 1: ␣  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 16.84                 1 ( +0): ␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.85
 2: e  ▇▇▇▇▇▇▇▇▇▇▇▇▇ 9.59                            2 ( +0): e ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 7.81
 3: t  ▇▇▇▇▇▇▇▇▇▇ 7.28                               3 ( +0): t ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 5.94
 4: a  ▇▇▇▇▇▇▇▇▇ 6.50                                4 ( +0): a ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.71
 5: o  ▇▇▇▇▇▇▇▇ 6.22                                 5 ( +4): r ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.70
 6: i  ▇▇▇▇▇▇▇▇ 5.81                                 6 ( +2): s ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.46
 7: n  ▇▇▇▇▇▇▇▇ 5.54                                 7 ( -2): o ▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.33
 8: s  ▇▇▇▇▇▇▇ 5.24                                  8 ( -1): n ▇▇▇▇▇▇▇▇▇▇▇▇▇ 4.16
 9: r  ▇▇▇▇▇▇▇ 4.93                                  9 ( -3): i ▇▇▇▇▇▇▇▇▇▇▇▇ 4.02
10: h  ▇▇▇▇▇ 3.73                                   10 (+16): ⏎ ▇▇▇▇▇▇▇▇▇▇ 3.30
11: l  ▇▇▇▇▇ 3.38                                   11 ( +0): l ▇▇▇▇▇▇▇▇▇▇ 3.23
12: d  ▇▇▇▇ 2.96                                    12 ( +1): c ▇▇▇▇▇▇▇▇ 2.55
13: c  ▇▇▇ 2.53                                     13 ( -1): d ▇▇▇▇▇▇▇ 2.38
14: u  ▇▇▇ 2.36                                     14 ( +3): p ▇▇▇▇▇▇▇ 2.38
15: m  ▇▇▇ 1.98                                     15 ( +7): . ▇▇▇▇▇▇ 2.09
16: f  ▇▇ 1.74                                      16 ( -2): u ▇▇▇▇▇▇ 2.02
17: p  ▇▇ 1.71                                      17 ( -2): m ▇▇▇▇▇▇ 2.00
18: g  ▇▇ 1.67                                      18 ( +9): - ▇▇▇▇▇▇ 1.92
19: y  ▇▇ 1.57                                      19 ( +5): , ▇▇▇▇▇▇ 1.82
20: w  ▇▇ 1.47                                      20 ( -4): f ▇▇▇▇▇ 1.76
21: b  ▇▇ 1.22                                      21 (+16): ) ▇▇▇▇ 1.46
22: .  ▇ 0.89                                       22 (+16): ( ▇▇▇▇ 1.46
23: v  ▇ 0.87                                       23 ( +7): " ▇▇▇▇ 1.44
24: ,  ▇ 0.82                                       24 (-14): h ▇▇▇▇ 1.33
25: k  ▇ 0.65                                       25 (+14): : ▇▇▇▇ 1.25
26: ⏎   0.35                                        26 ( -8): g ▇▇▇▇ 1.21
27: -   0.21                                        27 ( -6): b ▇▇▇▇ 1.21
28: '   0.21                                        28 (???): _ ▇▇▇▇ 1.20
29: x   0.18                                        29 ( +2): 0 ▇▇▇ 1.07
30: "   0.15                                        30 ( +3): 1 ▇▇▇ 1.06
31: 0   0.15                                        31 (+26): = ▇▇▇ 1.00
32: j   0.14                                        32 (-13): y ▇▇▇ 0.88
33: 1   0.13                                        33 ( +1): 2 ▇▇ 0.79
34: 2   0.10                                        34 ( -5): x ▇▇ 0.77
35: q   0.08                                        35 (-12): v ▇▇ 0.76
36: z   0.08                                        36 ( -8): ' ▇▇ 0.70
37: )   0.08                                        37 (???): { ▇▇ 0.57
38: (   0.08                                        38 (???): } ▇▇ 0.57
39: :   0.06                                        39 (+10): / ▇▇ 0.56
40: 5   0.05                                        40 (-20): w ▇▇ 0.53
41: 3   0.05                                        41 ( +9): ; ▇▇ 0.53
42: 9   0.04                                        42 ( -1): 3 ▇▇ 0.49
43: ?   0.04                                        43 (-18): k ▇ 0.48
44: 4   0.04                                        44 ( +0): 4 ▇ 0.48
45: !   0.04                                        45 ( -5): 5 ▇ 0.42
46: 6   0.04                                        46 (+10): > ▇ 0.35
47: 8   0.03                                        47 ( -1): 6 ▇ 0.34
48: 7   0.03                                        48 ( -1): 8 ▇ 0.34
49: /   0.02                                        49 (???): [ ▇ 0.32
50: ;   0.02                                        50 (???): ] ▇ 0.32
51: $   0.01                                        51 ( +4): * ▇ 0.30
52: %   0.01                                        52 (-10): 9 ▇ 0.28
53: &   0.01                                        53 ( -5): 7 ▇ 0.27
54: +   0.01                                        54 (-19): q ▇ 0.21
55: *   0.00                                        55 ( +3): # ▇ 0.19
56: >   0.00                                        56 ( +4): < ▇ 0.18
57: =   0.00                                        57 (???): ` ▇ 0.18
58: #   0.00                                        58 (-26): j ▇ 0.17
59: @   0.00                                        59 (-23): z ▇ 0.17
60: <   0.00                                        60 (???): \  0.12
61:     0.00                                        61 (???): |  0.10
62: ’   0.00                                        62 ( -8): +  0.09
63: ”   0.00                                        63 (-10): &  0.09
64: “   0.00                                        64 (-19): !  0.07
65: —   0.00                                        65 ( -6): @  0.06
66: ü   0.00                                        66 (-15): $  0.06
67: –   0.00                                        67 (-24): ?  0.05
68: •   0.00                                        68 (-16): %  0.04
69: ¢   0.00                                        69 (???): ^  0.01
70: ´   0.00                                        70 (???): ~  0.01
71: é   0.00                                        71 (???):     0.00
72: ʼ   0.00                                        72 (???): €  0.00
73: ®   0.00                                       ??? (???): ö  0.00
74: ¤   0.00                                       ??? (???): °  0.00
75: ‐   0.00                                       ??? (???): ⅜  0.00
76: §   0.00                                       ??? (???): ⅓  0.00
77: ä   0.00                                       ??? (???): ‐  0.00
78: ⅓   0.00                                       ??? (???): ⅔  0.00
79: ′   0.00                                       ??? (???): ü  0.00
80: ć   0.00                                       ??? (???): ′  0.00
81: →   0.00                                       ??? (???): ®  0.00
82: и   0.00                                       ??? (???): ć  0.00
83: ⅔   0.00                                       ??? (???): “  0.00
84: ⅜   0.00                                       ??? (???): ­  0.00
85: ¬   0.00                                       ??? (???): ´  0.00
86: ­   0.00                                       ??? (???): —  0.00
87: ›   0.00                                       ??? (???): ”  0.00
88: ö   0.00                                       ??? (???): ¬  0.00
89: ☺   0.00                                       ??? (???): ›  0.00
90: ·   0.00                                       ??? (???): é  0.00
91: °   0.00                                       ??? (???): §  0.00
92: ÷   0.00                                       ??? (???): â  0.00
93: â   0.00                                       ??? (???): ·  0.00
94: ⅛   0.00                                       ??? (???): и  0.00
???: \  0.00                                       ??? (???): ⅛  0.00
???:      0.00                                       ??? (???): ÷  0.00
???: ^  0.00                                       ??? (???): ¤  0.00
???: }  0.00                                       ??? (???): –  0.00
???: _  0.00                                       ??? (???): ☺  0.00
???: `  0.00                                       ??? (???): •  0.00
???: €  0.00                                       ??? (???): ¢  0.00
???: |  0.00                                       ??? (???):    0.00
???: ]  0.00                                       ??? (???): →  0.00
???: [  0.00                                       ??? (???): ’  0.00
???: {  0.00                                       ??? (???): ʼ  0.00
???: ~  0.00                                       ??? (???): ä  0.00
iweb vs Granite Code: bigrams
───────────────────────iweb─────────────────────── ───────────────────────code───────────────────────
 1: e␣   ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.94                    1 (  +10): er ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.01
 2: ␣t   ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.50                       2 (  +15): on ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.98
 3: th   ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.03                          3 (  +23): ,␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.98
 4: ␣a   ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.99                           4 (  +10): re ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.94
 5: s␣   ▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.84                            5 (   +2): in ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.93
 6: he   ▇▇▇▇▇▇▇▇▇▇▇▇ 1.66                             6 (  +28): te ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.88
 7: in   ▇▇▇▇▇▇▇▇▇▇▇ 1.55                              7 (  +44): se ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.80
 8: t␣   ▇▇▇▇▇▇▇▇▇▇▇ 1.49                              8 (  +15): or ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.76
 9: d␣   ▇▇▇▇▇▇▇▇▇▇ 1.40                               9 (  +28): st ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.72
10: an   ▇▇▇▇▇▇▇▇▇ 1.25                               10 (  +11): at ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.70
11: er   ▇▇▇▇▇▇▇▇▇ 1.21                               11 (  +16): es ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.67
12: n␣   ▇▇▇▇▇▇▇▇▇ 1.20                               12 ( +295): :␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.66
13: ␣i   ▇▇▇▇▇▇▇▇▇ 1.20                               13 (  +52): de ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.66
14: re   ▇▇▇▇▇▇▇▇ 1.14                                14 (  +11): en ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.65
15: ␣s   ▇▇▇▇▇▇▇▇ 1.12                                15 (   -7): t␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.64
16: ␣o   ▇▇▇▇▇▇▇▇ 1.06                                16 (  +44): co ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.63
17: on   ▇▇▇▇▇▇▇ 0.99                                 17 (  +33): le ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.62
18: r␣   ▇▇▇▇▇▇▇ 0.97                                 18 (  +28): nt ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.61
19: ␣w   ▇▇▇▇▇▇▇ 0.95                                 19 (  -18): e␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.60
20: ␣c   ▇▇▇▇▇▇ 0.86                                  20 (  +18): ar ▇▇▇▇▇▇▇▇▇▇▇▇ 0.59
21: at   ▇▇▇▇▇▇ 0.86                                  21 (  +15): ti ▇▇▇▇▇▇▇▇▇▇▇▇ 0.56
22: y␣   ▇▇▇▇▇▇ 0.84                                  22 ( +542): -- ▇▇▇▇▇▇▇▇▇▇▇ 0.55
23: or   ▇▇▇▇▇▇ 0.83                                  23 (+1096): ,⏎ ▇▇▇▇▇▇▇▇▇▇▇ 0.54
24: nd   ▇▇▇▇▇▇ 0.83                                  24 (  -21): th ▇▇▇▇▇▇▇▇▇▇▇ 0.53
25: en   ▇▇▇▇▇▇ 0.80                                  25 ( +811): ␣= ▇▇▇▇▇▇▇▇▇▇▇ 0.53
26: ,␣   ▇▇▇▇▇▇ 0.79                                  26 (  +16): al ▇▇▇▇▇▇▇▇▇▇▇ 0.52
27: es   ▇▇▇▇▇▇ 0.79                                  27 (  +35): ro ▇▇▇▇▇▇▇▇▇▇▇ 0.51
28: o␣   ▇▇▇▇▇▇ 0.77                                  28 ( +798): =␣ ▇▇▇▇▇▇▇▇▇▇ 0.49
29: to   ▇▇▇▇▇ 0.76                                   29 (  -27): ␣t ▇▇▇▇▇▇▇▇▇▇ 0.49
30: ou   ▇▇▇▇▇ 0.76                                   30 (  +28): me ▇▇▇▇▇▇▇▇▇▇ 0.49
31: ␣b   ▇▇▇▇▇ 0.75                                   31 (  +53): el ▇▇▇▇▇▇▇▇▇▇ 0.48
32: ng   ▇▇▇▇▇ 0.73                                   32 (  +62): et ▇▇▇▇▇▇▇▇▇ 0.45
33: it   ▇▇▇▇▇ 0.72                                   33 ( +959): ;⏎ ▇▇▇▇▇▇▇▇▇ 0.44
34: te   ▇▇▇▇▇ 0.70                                   34 (  +19): as ▇▇▇▇▇▇▇▇▇ 0.43
35: ␣f   ▇▇▇▇▇ 0.70                                   35 (  -25): an ▇▇▇▇▇▇▇▇▇ 0.41
36: ti   ▇▇▇▇▇ 0.69                                   36 (  -31): s␣ ▇▇▇▇▇▇▇▇▇ 0.41
37: st   ▇▇▇▇▇ 0.69                                   37 ( +150): ex ▇▇▇▇▇▇▇▇▇ 0.41
38: ar   ▇▇▇▇▇ 0.68                                   38 (???): ⏎⏎   ▇▇▇▇▇▇▇▇ 0.40
39: ␣p   ▇▇▇▇▇ 0.67                                   39 (   -6): it ▇▇▇▇▇▇▇▇ 0.40
40: ␣m   ▇▇▇▇▇ 0.64                                   40 (  +34): ra ▇▇▇▇▇▇▇▇ 0.39
42: al   ▇▇▇▇▇ 0.63                                   42 (  -27): ␣s ▇▇▇▇▇▇▇▇ 0.39
46: nt   ▇▇▇▇ 0.59                                    45 (  -39): he ▇▇▇▇▇▇▇▇ 0.38
50: le   ▇▇▇▇ 0.55                                    48 (  -35): ␣i ▇▇▇▇▇▇▇▇ 0.38
51: se   ▇▇▇▇ 0.55                                    49 (  -20): to ▇▇▇▇▇▇▇▇ 0.38
53: as   ▇▇▇▇ 0.52                                    51 (  -39): n␣ ▇▇▇▇▇▇▇▇ 0.37
58: me   ▇▇▇▇ 0.50                                    53 (  -29): nd ▇▇▇▇▇▇▇▇ 0.37
60: co   ▇▇▇ 0.47                                     55 (  -51): ␣a ▇▇▇▇▇▇▇▇ 0.36
62: ro   ▇▇▇ 0.46                                     64 (  -32): ng ▇▇▇▇▇▇▇ 0.32
65: de   ▇▇▇ 0.44                                     69 (  -49): ␣c ▇▇▇▇▇▇ 0.29
74: ra   ▇▇▇ 0.37                                     72 (  -37): ␣f ▇▇▇▇▇▇ 0.29
84: el   ▇▇ 0.33                                      90 (  -72): r␣ ▇▇▇▇▇ 0.25
94: et   ▇▇ 0.29                                     102 (  -72): ou ▇▇▇▇▇ 0.23
187: ex  ▇ 0.12                                      104 (  -95): d␣ ▇▇▇▇▇ 0.23
307: :␣   0.04                                       111 (  -80): ␣b ▇▇▇▇ 0.21
564: --   0.01                                       113 (  -74): ␣p ▇▇▇▇ 0.21
826: =␣   0.00                                       138 ( -122): ␣o ▇▇▇▇ 0.18
836: ␣=   0.00                                       170 ( -130): ␣m ▇▇▇ 0.15
992: ;⏎   0.00                                       225 ( -203): y␣ ▇▇ 0.11
1119: ,⏎  0.00                                       229 ( -210): ␣w ▇▇ 0.11
???: ⏎⏎   0.00                                       258 ( -230): o␣ ▇▇ 0.09
iweb vs Granite Code: bigrams, whitespace ignored
───────────────────────iweb─────────────────────── ───────────────────────code───────────────────────
 1: th  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 3.10                  1 (  +4): er ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.28
 2: he  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.53                      2 (  +5): on ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.26
 3: in  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.36                       3 (  +3): re ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.19
 4: an  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.90                           4 (  -1): in ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.18
 5: er  ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.85                           5 ( +12): te ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.13
 6: re  ▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.74                            6 ( +22): se ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.02
 7: on  ▇▇▇▇▇▇▇▇▇▇▇ 1.51                              7 (  +2): or ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.96
 8: at  ▇▇▇▇▇▇▇▇▇▇ 1.31                               8 ( +11): st ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.91
 9: or  ▇▇▇▇▇▇▇▇▇ 1.27                                9 (  -1): at ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.90
10: nd  ▇▇▇▇▇▇▇▇▇ 1.27                               10 (  +2): es ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.86
11: en  ▇▇▇▇▇▇▇▇▇ 1.22                               11 ( +26): de ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.84
12: es  ▇▇▇▇▇▇▇▇▇ 1.20                               12 (  -1): en ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.83
13: to  ▇▇▇▇▇▇▇▇▇ 1.16                               13 ( +20): co ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.81
14: ou  ▇▇▇▇▇▇▇▇▇ 1.16                               14 ( +13): le ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.79
15: ng  ▇▇▇▇▇▇▇▇ 1.11                                15 ( +10): nt ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.78
16: it  ▇▇▇▇▇▇▇▇ 1.09                                16 (  +4): ar ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.75
17: te  ▇▇▇▇▇▇▇▇ 1.07                                17 (  +1): ti ▇▇▇▇▇▇▇▇▇▇▇▇ 0.72
18: ti  ▇▇▇▇▇▇▇▇ 1.06                                18 (+438): -- ▇▇▇▇▇▇▇▇▇▇▇▇ 0.70
19: st  ▇▇▇▇▇▇▇▇ 1.05                                19 ( -18): th ▇▇▇▇▇▇▇▇▇▇▇▇ 0.68
20: ar  ▇▇▇▇▇▇▇▇ 1.03                                20 (  +1): al ▇▇▇▇▇▇▇▇▇▇▇ 0.66
21: al  ▇▇▇▇▇▇▇ 0.97                                 21 ( +14): ro ▇▇▇▇▇▇▇▇▇▇▇ 0.65
22: is  ▇▇▇▇▇▇▇ 0.96                                 22 ( +10): me ▇▇▇▇▇▇▇▇▇▇▇ 0.62
23: ed  ▇▇▇▇▇▇▇ 0.96                                 23 ( +29): el ▇▇▇▇▇▇▇▇▇▇ 0.61
24: ha  ▇▇▇▇▇▇▇ 0.93                                 24 ( +36): et ▇▇▇▇▇▇▇▇▇▇ 0.58
25: nt  ▇▇▇▇▇▇▇ 0.90                                 25 (  +4): as ▇▇▇▇▇▇▇▇▇ 0.55
26: ve  ▇▇▇▇▇▇ 0.86                                  26 ( -22): an ▇▇▇▇▇▇▇▇▇ 0.53
27: le  ▇▇▇▇▇▇ 0.84                                  27 (+117): ex ▇▇▇▇▇▇▇▇▇ 0.52
28: se  ▇▇▇▇▇▇ 0.84                                  28 ( -12): it ▇▇▇▇▇▇▇▇▇ 0.51
29: as  ▇▇▇▇▇▇ 0.79                                  29 ( +14): ra ▇▇▇▇▇▇▇▇▇ 0.50
30: ea  ▇▇▇▇▇▇ 0.77                                  30 ( +26): ta ▇▇▇▇▇▇▇▇▇ 0.50
31: of  ▇▇▇▇▇▇ 0.76                                  31 (  -9): is ▇▇▇▇▇▇▇▇ 0.50
32: me  ▇▇▇▇▇▇ 0.76                                  32 ( +55): rt ▇▇▇▇▇▇▇▇ 0.49
33: co  ▇▇▇▇▇ 0.71                                   33 ( -31): he ▇▇▇▇▇▇▇▇ 0.49
34: ll  ▇▇▇▇▇ 0.70                                   34 (  +2): ne ▇▇▇▇▇▇▇▇ 0.48
35: ro  ▇▇▇▇▇ 0.69                                   35 ( +45): ct ▇▇▇▇▇▇▇▇ 0.48
36: ne  ▇▇▇▇▇ 0.69                                   36 ( -23): to ▇▇▇▇▇▇▇▇ 0.48
37: de  ▇▇▇▇▇ 0.67                                   37 ( -14): ed ▇▇▇▇▇▇▇▇ 0.48
38: hi  ▇▇▇▇▇ 0.66                                   38 ( +44): tr ▇▇▇▇▇▇▇▇ 0.47
39: ri  ▇▇▇▇▇ 0.62                                   39 ( -29): nd ▇▇▇▇▇▇▇▇ 0.47
40: li  ▇▇▇▇ 0.60                                    40 (  +2): io ▇▇▇▇▇▇▇▇ 0.45
42: io  ▇▇▇▇ 0.58                                    41 (  -1): li ▇▇▇▇▇▇▇▇ 0.45
43: ra  ▇▇▇▇ 0.57                                    42 (  -3): ri ▇▇▇▇▇▇▇▇ 0.45
52: el  ▇▇▇▇ 0.51                                    47 ( -32): ng ▇▇▇▇▇▇▇ 0.41
56: ta  ▇▇▇▇ 0.49                                    67 ( -41): ve ▇▇▇▇▇ 0.31
60: et  ▇▇▇ 0.45                                     75 ( -61): ou ▇▇▇▇▇ 0.29
80: ct  ▇▇▇ 0.37                                     76 ( -46): ea ▇▇▇▇▇ 0.29
82: tr  ▇▇▇ 0.36                                     96 ( -62): ll ▇▇▇▇ 0.24
87: rt  ▇▇▇ 0.34                                    120 ( -96): ha ▇▇▇ 0.20
144: ex ▇ 0.19                                      132 ( -94): hi ▇▇▇ 0.18
456: --  0.01                                       207 (-176): of ▇▇ 0.11
iweb vs Granite Code: trigrams
───────────────────────iweb─────────────────────── ───────────────────────code───────────────────────
 1: ␣th    ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.57                     1 ( +7138): --- ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.42
 2: the    ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 1.29                        2 ( +4046): ␣=␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.39
 3: he␣    ▇▇▇▇▇▇▇▇▇▇▇▇ 1.03                            3 (   +21): ion ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.33
 4: ␣an    ▇▇▇▇▇▇▇▇ 0.63                                4 (   +25): ent ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.29
 5: ing    ▇▇▇▇▇▇▇ 0.61                                 5 (  +112): con ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.28
 6: nd␣    ▇▇▇▇▇▇▇ 0.61                                 6 (   +30): tio ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.27
 7: and    ▇▇▇▇▇▇▇ 0.59                                 7 (???): ␣{⏎    ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.25
 8: ␣to    ▇▇▇▇▇▇▇ 0.58                                 8 (    -6): the ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.22
 9: ng␣    ▇▇▇▇▇▇▇ 0.54                                 9 (    -4): ing ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.21
10: to␣    ▇▇▇▇▇▇ 0.53                                 10 (   +64): ate ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.21
11: ␣in    ▇▇▇▇▇▇ 0.50                                 11 (  +653): sel ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.20
12: ed␣    ▇▇▇▇▇▇ 0.48                                 12 (   -11): ␣th ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.20
13: ␣of    ▇▇▇▇▇▇ 0.47                                 13 (???): ␣␣␣    ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.20
14: of␣    ▇▇▇▇▇ 0.43                                  14 (  +456): ass ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.19
15: ␣a␣    ▇▇▇▇▇ 0.40                                  15 (  +153): ect ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
16: er␣    ▇▇▇▇▇ 0.40                                  16 (  +121): ons ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
17: is␣    ▇▇▇▇ 0.36                                   17 (  +213): ort ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
18: in␣    ▇▇▇▇ 0.35                                   18 (+12047): );⏎ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.18
19: ␣co    ▇▇▇▇ 0.35                                   19 (  +375): rt␣ ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.17
20: re␣    ▇▇▇▇ 0.35                                   20 (  +398): ser ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.17
21: on␣    ▇▇▇▇ 0.35                                   21 ( +1257): elf ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.17
22: e␣t    ▇▇▇▇ 0.34                                   22 ( +1451): def ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
23: s␣a    ▇▇▇▇ 0.33                                   23 (  +275): ame ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
24: ion    ▇▇▇▇ 0.33                                   24 (  +379): por ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
25: at␣    ▇▇▇▇ 0.32                                   25 (   +31): ter ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.16
26: or␣    ▇▇▇▇ 0.32                                   26 (  +124): est ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
27: es␣    ▇▇▇▇ 0.30                                   27 (   -16): ␣in ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
28: e␣a    ▇▇▇▇ 0.30                                   28 ( +1089): val ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
29: ent    ▇▇▇▇ 0.29                                   29 (???): ⏎}⏎    ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
30: ␣re    ▇▇▇ 0.29                                    30 (    +2): for ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
31: ␣be    ▇▇▇ 0.29                                    31 (  +550): exp ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
32: for    ▇▇▇ 0.28                                    32 (  +678): ert ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
33: you    ▇▇▇ 0.27                                    33 ( +1697): typ ▇▇▇▇▇▇▇▇▇▇▇▇ 0.15
34: ␣fo    ▇▇▇ 0.27                                    34 (  +118): one ▇▇▇▇▇▇▇▇▇▇▇ 0.14
35: ␣yo    ▇▇▇ 0.27                                    35 ( +1899): ype ▇▇▇▇▇▇▇▇▇▇▇ 0.14
36: tio    ▇▇▇ 0.26                                    36 (  +236): str ▇▇▇▇▇▇▇▇▇▇▇ 0.14
37: as␣    ▇▇▇ 0.26                                    37 (  +790): ext ▇▇▇▇▇▇▇▇▇▇▇ 0.14
38: ␣wi    ▇▇▇ 0.26                                    38 (  +888): dat ▇▇▇▇▇▇▇▇▇▇▇ 0.14
39: n␣t    ▇▇▇ 0.25                                    39 (  +710): col ▇▇▇▇▇▇▇▇▇▇▇ 0.13
40: s␣t    ▇▇▇ 0.25                                    40 (   -10): ␣re ▇▇▇▇▇▇▇▇▇▇▇ 0.13
56: ter    ▇▇ 0.20                                     44 (   -41): he␣ ▇▇▇▇▇▇▇▇▇▇▇ 0.13
74: ate    ▇▇ 0.18                                     52 (   -33): ␣co ▇▇▇▇▇▇▇▇▇▇ 0.13
117: con   ▇▇ 0.13                                     81 (   -60): on␣ ▇▇▇▇▇▇▇▇▇ 0.11
137: ons   ▇ 0.12                                      89 (   -63): or␣ ▇▇▇▇▇▇▇▇ 0.10
150: est   ▇ 0.11                                      90 (   -78): ed␣ ▇▇▇▇▇▇▇▇ 0.10
152: one   ▇ 0.11                                      98 (   -91): and ▇▇▇▇▇▇▇ 0.09
168: ect   ▇ 0.11                                     111 (   -94): is␣ ▇▇▇▇▇▇▇ 0.09
230: ort   ▇ 0.08                                     112 (   -96): er␣ ▇▇▇▇▇▇▇ 0.09
272: str   ▇ 0.07                                     147 (  -139): ␣to ▇▇▇▇▇▇ 0.08
298: ame   ▇ 0.07                                     152 (  -134): in␣ ▇▇▇▇▇▇ 0.07
394: rt␣   ▇ 0.06                                     154 (  -127): es␣ ▇▇▇▇▇▇ 0.07
403: por   ▇ 0.05                                     166 (  -162): ␣an ▇▇▇▇▇▇ 0.07
418: ser   ▇ 0.05                                     180 (  -146): ␣fo ▇▇▇▇▇▇ 0.07
470: ass   ▇ 0.05                                     201 (  -192): ng␣ ▇▇▇▇▇ 0.07
581: exp    0.04                                      214 (  -204): to␣ ▇▇▇▇▇ 0.06
664: sel    0.03                                      253 (  -238): ␣a␣ ▇▇▇▇▇ 0.06
710: ert    0.03                                      262 (  -248): of␣ ▇▇▇▇ 0.06
749: col    0.03                                      282 (  -276): nd␣ ▇▇▇▇ 0.05
827: ext    0.03                                      294 (  -281): ␣of ▇▇▇▇ 0.05
926: dat    0.02                                      308 (  -271): as␣ ▇▇▇▇ 0.05
1117: val   0.02                                      316 (  -294): e␣t ▇▇▇▇ 0.05
1278: elf   0.02                                      326 (  -306): re␣ ▇▇▇▇ 0.05
1473: def   0.01                                      374 (  -351): s␣a ▇▇▇ 0.04
1730: typ   0.01                                      396 (  -368): e␣a ▇▇▇ 0.04
1934: ype   0.01                                      407 (  -368): n␣t ▇▇▇ 0.04
4048: ␣=␣   0.00                                      416 (  -385): ␣be ▇▇▇ 0.04
7139: ---   0.00                                      433 (  -395): ␣wi ▇▇▇ 0.04
12065: );⏎  0.00                                      442 (  -402): s␣t ▇▇▇ 0.04
???: ⏎}⏎    0.00                                      649 (  -624): at␣ ▇▇ 0.03
???: ␣{⏎    0.00                                     1546 ( -1513): you ▇ 0.01
???: ␣␣␣    0.00                                     3114 ( -3079): ␣yo  0.01
iweb vs Granite Code: trigrams, whitespace ignored
───────────────────────iweb─────────────────────── ───────────────────────code───────────────────────
 1: the   ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 2.63                    1 (+4702): --- ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.61
 2: ing   ▇▇▇▇▇▇▇▇▇▇ 1.25                              2 (   +2): ion ▇▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.47
 3: and   ▇▇▇▇▇▇▇▇▇ 1.19                               3 (   +2): ent ▇▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.42
 4: ion   ▇▇▇▇▇ 0.67                                   4 (  +25): con ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.41
 5: ent   ▇▇▇▇ 0.59                                    5 (   +3): tio ▇▇▇▇▇▇▇▇▇▇▇▇▇ 0.39
 6: for   ▇▇▇▇ 0.57                                    6 (   -5): the ▇▇▇▇▇▇▇▇▇▇ 0.31
 7: you   ▇▇▇▇ 0.55                                    7 (   -5): ing ▇▇▇▇▇▇▇▇▇▇ 0.30
 8: tio   ▇▇▇▇ 0.53                                    8 (   +9): ate ▇▇▇▇▇▇▇▇▇▇ 0.30
 9: hat   ▇▇▇▇ 0.48                                    9 ( +316): sel ▇▇▇▇▇▇▇▇▇▇ 0.29
10: tha   ▇▇▇ 0.46                                    10 ( +194): ass ▇▇▇▇▇▇▇▇▇ 0.27
11: her   ▇▇▇ 0.45                                    11 (  +35): ect ▇▇▇▇▇▇▇▇▇ 0.26
12: ter   ▇▇▇ 0.41                                    12 (  +24): ons ▇▇▇▇▇▇▇▇ 0.25
13: all   ▇▇▇ 0.39                                    13 (  +57): ort ▇▇▇▇▇▇▇▇ 0.25
14: ati   ▇▇▇ 0.38                                    14 ( +162): ser ▇▇▇▇▇▇▇▇ 0.25
15: thi   ▇▇▇ 0.36                                    15 ( +733): elf ▇▇▇▇▇▇▇▇ 0.24
16: ver   ▇▇▇ 0.36                                    16 ( +882): def ▇▇▇▇▇▇▇▇ 0.24
17: ate   ▇▇▇ 0.36                                    17 (  +95): ame ▇▇▇▇▇▇▇▇ 0.23
18: our   ▇▇▇ 0.36                                    18 ( +152): por ▇▇▇▇▇▇▇▇ 0.23
19: are   ▇▇▇ 0.34                                    19 (   -7): ter ▇▇▇▇▇▇▇ 0.23
20: ere   ▇▇▇ 0.34                                    20 (  +24): est ▇▇▇▇▇▇▇ 0.22
21: ith   ▇▇▇ 0.34                                    21 ( +617): val ▇▇▇▇▇▇▇ 0.22
22: wit   ▇▇ 0.33                                     22 (  -16): for ▇▇▇▇▇▇▇ 0.22
23: ers   ▇▇ 0.33                                     23 ( +250): exp ▇▇▇▇▇▇▇ 0.22
24: his   ▇▇ 0.32                                     24 ( +328): ert ▇▇▇▇▇▇▇ 0.21
25: pro   ▇▇ 0.30                                     25 (+1063): typ ▇▇▇▇▇▇▇ 0.21
26: rea   ▇▇ 0.29                                     26 (  +19): one ▇▇▇▇▇▇▇ 0.20
27: res   ▇▇ 0.27                                     27 (+1205): ype ▇▇▇▇▇▇▇ 0.20
28: eve   ▇▇ 0.27                                     28 (  +70): str ▇▇▇▇▇▇▇ 0.20
29: con   ▇▇ 0.27                                     29 ( +402): ext ▇▇▇▇▇▇▇ 0.20
30: com   ▇▇ 0.27                                     30 ( +465): dat ▇▇▇▇▇▇ 0.20
31: ill   ▇▇ 0.26                                     31 ( +349): col ▇▇▇▇▇▇ 0.19
32: ive   ▇▇ 0.24                                     32 ( +166): tes ▇▇▇▇▇▇ 0.19
33: out   ▇▇ 0.24                                     33 (  +33): der ▇▇▇▇▇▇ 0.19
34: ess   ▇▇ 0.24                                     34 ( +681): mpo ▇▇▇▇▇▇ 0.19
35: ome   ▇▇ 0.24                                     35 (  +32): nte ▇▇▇▇▇▇ 0.19
36: ons   ▇▇ 0.24                                     36 (  +86): ont ▇▇▇▇▇▇ 0.19
37: ted   ▇▇ 0.24                                     37 (  +67): tur ▇▇▇▇▇▇ 0.19
38: ave   ▇▇ 0.24                                     38 (  -11): res ▇▇▇▇▇▇ 0.19
39: nce   ▇▇ 0.24                                     39 (   +4): sta ▇▇▇▇▇▇ 0.18
40: men   ▇▇ 0.24                                     40 (  -15): pro ▇▇▇▇▇▇ 0.18
43: sta   ▇▇ 0.23                                     41 (   -1): men ▇▇▇▇▇▇ 0.18
44: est   ▇▇ 0.23                                     42 (  -16): rea ▇▇▇▇▇▇ 0.18
45: one   ▇▇ 0.23                                     47 (  -33): ati ▇▇▇▇▇▇ 0.17
46: ect   ▇▇ 0.22                                     60 (  -44): ver ▇▇▇▇▇ 0.16
66: der   ▇ 0.17                                      66 (  -36): com ▇▇▇▇▇ 0.14
67: nte   ▇ 0.17                                      71 (  -68): and ▇▇▇▇ 0.13
70: ort   ▇ 0.17                                      74 (  -37): ted ▇▇▇▇ 0.13
98: str   ▇ 0.15                                      95 (  -82): all ▇▇▇▇ 0.11
104: tur  ▇ 0.14                                     106 (  -83): ers ▇▇▇▇ 0.11
112: ame  ▇ 0.14                                     111 (  -96): thi ▇▇▇ 0.10
122: ont  ▇ 0.13                                     115 (  -81): ess ▇▇▇ 0.10
170: por  ▇ 0.11                                     127 ( -103): his ▇▇▇ 0.10
176: ser  ▇ 0.11                                     141 ( -130): her ▇▇▇ 0.09
198: tes  ▇ 0.10                                     145 ( -124): ith ▇▇▇ 0.09
204: ass  ▇ 0.10                                     155 ( -133): wit ▇▇▇ 0.09
273: exp  ▇ 0.08                                     162 ( -129): out ▇▇▇ 0.08
325: sel  ▇ 0.07                                     179 ( -160): are ▇▇▇ 0.08
352: ert  ▇ 0.07                                     189 ( -169): ere ▇▇ 0.07
380: col   0.06                                      193 ( -161): ive ▇▇ 0.07
431: ext   0.06                                      199 ( -160): nce ▇▇ 0.07
495: dat   0.05                                      243 ( -215): eve ▇▇ 0.06
638: val   0.04                                      319 ( -284): ome ▇▇ 0.05
715: mpo   0.04                                      330 ( -312): our ▇▇ 0.05
748: elf   0.03                                      505 ( -495): tha ▇ 0.04
898: def   0.03                                      517 ( -486): ill ▇ 0.03
1088: typ  0.02                                      603 ( -594): hat ▇ 0.03
1232: ype  0.02                                      826 ( -788): ave ▇ 0.02
4703: ---  0.00                                     1022 (-1015): you ▇ 0.02

Want a version of this corpus with slight modifications?

If you want to get this corpus with slight modifications, including some of the following:

  • Converting upper case characters to lower case
  • Ignoring whitespace (removing ngrams with whitespace)
  • Ignoring ngrams with one or more special symbols or other characters

You may use the ngram_show from the granite-tools (v.0.2.0+).

How the ngrams were created

The code ngrams are based on source code from multiple popular open source repositories. The following repositories where cloned 2024-09-30. The latest commit-hash and the dominant language(s) is shown in parenthesis:

All the code from .py, .pyi, .rs, .js, .jsx, .ts, .tsx, .css, .scss and .less files were collected from the above repositories to a single corpus file per language. Whitespace on the left side of each row (indentation) was removed when reading the files, because otherwise space character would be having most of the weight.

Script for extracting the code corpus from the repos
from collections import defaultdict
from pathlib import Path

raw_folder_root = Path(__file__).parent / "raw"
corpus_folder = (Path(__file__).parent / "corpus-not-clean").resolve()
corpus_folder.mkdir(exist_ok=True)

extension_mapping = {
    ".py": "python",
    ".pyi": "python",
    ".rs": "rust",
    ".js": "javascript",
    ".jsx": "javascript",
    ".ts": "typescript",
    ".tsx": "typescript",
    ".css": "css",
    ".scss": "css",
    ".less": "css",
}


def iter_files(folder: Path):
    for file in folder.rglob("*.*"):
        if not file.is_file():
            continue
        if file.suffix in extension_mapping:
            yield file, extension_mapping[file.suffix]


if __name__ == "__main__":
    total_bytes = defaultdict(int)
    files = {
        lang: open(corpus_folder / f"{lang}.txt", "w")
        for lang in extension_mapping.values()
    }

    for file, lang in iter_files(raw_folder_root):

        try:
            contents = file.read_text()
            # Remove indentation
            contents_pruned = "\n".join(c.lstrip() for c in contents.split("\n"))
            files[lang].write(contents_pruned)
        except Exception as e:
            print(f"Error while reading {file}: {e}. Skipping.")

    for lang in files:
        files[lang].close()

Resulting file structure:

📁 corpus-not-cleaned/
├─📄 javascript.txt  # 142.6 MB
├─📄 python.txt      # 94.2 MB
├─📄 typescript.txt  # 80.3 MB
├─📄 rust.txt        # 29.2 MB
└─📄 css.txt         # 33.0 MB

After cleaning the data (see below), the ngrams binary from the Keyboard Layout Optimizer was used to form the ngrams for each language (ngrams/javascript, ngrams/python, ngrams/typescript, ngrams/rust, ngrams/css).

The Granite Code dataset

The Granite Code ngrams were created from the python, rust, javascript, typescript and the css ngrams by first normalizing the ngram files using the normalize.py script from the Keyboard Layout Optimizer

❯ python ./scripts/ngrams/normalize.py /somepath/ngrams/<corpusname>

and then merging the ngram files with weighting using the ngram_merge binary from the Keyboard Layout Optimizer:

❯ ./target/release/ngram_merge code/ngrams/code code/ngrams/python:0.4 code/ngrams/rust:0.1 code/ngrams/javascript:0.2 code/
ngrams/typescript:0.2 code/ngrams/css:0.1

The used version of the Keyboard Layout Optimizer is specified by commit f93bd06.

Cleaning the datasets

Before creating the ngrams, only following ASCII alphanumerics (62 characters):

qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM1234567890

and following punctuation and special characters (33 characters):

,.!?;:_'"^~#%&/\()[]{}<>=+-*`@$€|

were accepted to the ngrams. The conversion script below was used to either remove or replace characters which belong to the following set:

¡£¤§«­®°²³´¶·¹»½¾¿ÁÃÅÆÉÌÓ×ØßàáâãåæçèéêëìíîïðñòóôõøúûüýāăćčīłŋōŠšūŽžə͜͡αβεηικλμνοπρςστυАЩавдеиклмнорстأابةتخدرسعقلمنهويนรลอาเ​‒–—―‘’“”•…₹™→−─│┆┌┐└┘┬┴═╞╡╪♪♫月语𞤫
Corpora cleanup script

This is the cleanup script which was used to clean the English, Code and Finnish corpora.

from pathlib import Path

root = Path(__file__).parent.parent


TYPABLE_CHARS = "qwertyuiopasdfghjklzxcvbnmQWERTYUIOPASDFGHJKLZXCVBNM1234567890"
TYPABLE_CHARS += r""",.!?;:_'"^~#%&/\()[]{}<>=+-*`@$€|"""
TYPABLE_CHARS += " \t\n"
ALLOWED_CHARACTERS = set(TYPABLE_CHARS)
ALLOWED_CHARACTERS_FINNISH = ALLOWED_CHARACTERS | set("äöÄÖ")

replacements_finnish = {
    "¹": "1",
    "²": "2",
    "³": "3",
    "½": "1/2",
    "¾": "3/4",
    "Á": "A",
    "Ã": "A",
    "Å": "A",
    "Æ": "AE",
    "É": "E",
    "Ì": "I",
    "Ó": "O",
    "×": "x",
    "Ø": "Ö",
    "ß": "ss",
    "à": "a",
    "á": "a",
    "â": "a",
    "ã": "a",
    "å": "a",
    "æ": "ae",
    "ç": "c",
    "è": "e",
    "é": "e",
    "ê": "e",
    "ë": "e",
    "ì": "i",
    "í": "i",
    "î": "i",
    "ï": "i",
    "ð": "d",
    "ñ": "n",
    "ò": "o",
    "ó": "o",
    "ô": "o",
    "õ": "o",
    "ø": "ö",
    "ú": "u",
    "û": "u",
    "ü": "u",
    "ý": "y",
    "ā": "a",
    "ă": "a",
    "ć": "c",
    "č": "c",
    "ī": "i",
    "ł": "l",
    "ŋ": "NG",
    "ō": "o",
    "Š": "S",
    "š": "s",
    "ū": "u",
    "Ž": "Z",
    "ž": "z",
    "α": "a",
    "β": "b",
    "ε": "e",
    "η": "e",
    "ι": "i",
    "κ": "k",
    "λ": "l",
    "μ": "m",
    "ν": "n",
    "ο": "o",
    "π": "p",
    "ρ": "r",
    "ς": "s",
    "σ": "s",
    "τ": "t",
    "υ": "u",
    "А": "A",
    "Щ": "Shch",
    "а": "a",
    "в": "v",
    "д": "d",
    "е": "e",
    "и": "i",
    "к": "k",
    "л": "l",
    "м": "m",
    "н": "n",
    "о": "o",
    "р": "r",
    "с": "s",
    "т": "t",
    "‒": "-",
    "–": "-",
    "—": "--",
    "―": "--",
    "−": "-",
    "─": "-",
    "‘": "'",
    "’": "'",
    "´": "'",
    "“": '"',
    "”": '"',
    "«": '"',
    "»": '"',
    "•": "*",
    "·": "*",
    "…": "...",
    "™": "(tm)",
    "­®": "(r)",
}
replacements = replacements_finnish.copy()
replacements["Ø"] = "O"
replacements["ø"] = "o"


def process_file(input_path, output_path, replacements, allowed_chars):
    chunk_size = 50 * 1024 * 1024  # 50 MB chunk size
    with open(input_path, "r", encoding="utf-8") as infile, open(
        output_path, "w", encoding="utf-8"
    ) as outfile:
        while True:
            chunk = infile.read(chunk_size)
            if not chunk:
                break  # End of file

            for old_str, new_str in replacements.items():
                chunk = chunk.replace(old_str, new_str)
            chunk = "".join(char for char in chunk if char in allowed_chars)
            outfile.write(chunk)


def cleanup_folder(
    folder: Path,
    folder_out: Path,
    allowed_characters: set[str],
    used_replacements: dict[str, str],
):
    folder_out.mkdir(exist_ok=True)
    for file in folder.glob("*.txt"):
        file_out = folder_out / file.name
        print(f"Processing {file} -> {file_out}")
        process_file(file, file_out, used_replacements, allowed_characters)


if __name__ == "__main__":
    import time

    start = time.time()
    for lang in ("english", "code", "finnish"):
        langfolder = root / lang
        allowed_characters = (
            ALLOWED_CHARACTERS_FINNISH if lang == "finnish" else ALLOWED_CHARACTERS
        )
        used_replacements = replacements_finnish if lang == "finnish" else replacements
        cleanup_folder(
            langfolder / "corpus-not-clean",
            langfolder / "corpus-clean",
            allowed_characters,
            used_replacements,
        )
    print("Done in", time.time() - start, "seconds")

Footnotes

  1. Douglas, Ian. “Keyboard Layout Analysis: Creating the Corpus, Bigram Chains, and Shakespeare's Monkeys”. Zenodo, March 29, 2021. doi.org/10.5281/zenodo.5501838.

  2. See: dariogoetz/keyboard_layout_optimizer/discussions/78

  3. It is unclear if the corpus has been superceded by larger corpora later, as the current version of Colemak Design page refers to English Letter Frequency Counts: Mayzner Revisited or ETAOIN SRHLDCU

About

Code corpus (Python, JavaScript, TypeScript, Rust, CSS) used for development of the Granite keyboard layout as character ngrams

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published