forked from enz/german-wordlist
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathwords-with-special-characters.txt
79 lines (55 loc) · 2.72 KB
/
words-with-special-characters.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
-----------------------------------------------------------------------
File: words-with-special-characters.txt, Stefan Hippler, 4 January 2023
-----------------------------------------------------------------------
Note: special characters can also be called letters with diacritics...
Analysis is done using the standard Unix comands grep and wc.
The output of wc (word count) shows in the first column the number of newlines.
The second column shows the number of words. The third column shows the number of bytes.
The last column shows the analyzed file name.
Analyzed is the orignal german wordlist file, named words, from Markus Enzberger.
See: [ https://github.com/enz/german-wordlist ]
---> Simple analysis <---
a) Total numbers of words in words:
wc words -> 685683 685689 8642725 words
Note: discrepancy, 6 words more than lines
b) German special character ß (Eszett)
grep ß words |wc
ß: 11974 11974 168134
c) French special lowercase characters '[àâçèéêëîïôùûÿæœ]'
à: 7 9 57
â: 3 3 36
ç: 14 14 165
è: 72 72 788
é: 296 296 3023
ê: 8 8 64
ë: 1 1 6
î: 2 2 17
ï: 0 0 0
ô: 0 0 0
ù: 0 0 0
û: 2 2 19
ÿ: 0 0 0
æ: 0 0 0
œ: 5 5 43
Total: 410 412 4218
grep '[àâçèéêëîïôùûÿæœ]' words |wc
404 406 4131
ie 12 words have 2 special characters:
Abrégé, Abrégés, Déshabillé, Déshabillés, Négligé, Négligés,
Résumé, Résumés, Séparée, Séparées, Variété, Variétés
d) French special uppercase characters [ÀÂÇÈÉËÎÏÔÙÛŸÆŒ]
[ÀÂÇÈ]: 0
É: 3 3 41
[ËÎÏ]: 0
[ÔÙÛŸÆ]: 0
Œ: 2 2 15
grep '[ÀÂÇÈÉËÎÏÔÙÛŸÆŒ]' words |wc -> 5 5 56
Égalité, Étalonsprache, Étalonsprachen, Œuvre, Œuvres
---------------------------------------------------------
Cross-check:
---------------------------------------------------------
wc words -> 685683 685689 8642725 # total lines, words, bytes
grep -v '[ßàâçèéêëîïôùûÿæœÀÂÇÈÉËÎÏÔÙÛŸÆŒ]' words | wc -> 673305 673309 8470474 # words without any of the special chars listed
grep '[ßàâçèéêëîïôùûÿæœÀÂÇÈÉËÎÏÔÙÛŸÆŒ]' words | wc -> 12378 12380 172251 # words with any of the special chars listed
Difference: -> 0 0 0 # ok, very nice
# end of file