Skip to content

Commit

Permalink
fix: minor edits for the documentation of multi-language stats script
Browse files Browse the repository at this point in the history
  • Loading branch information
timotheeg committed Feb 18, 2022
1 parent 1997ce3 commit bfdd9ea
Showing 1 changed file with 5 additions and 4 deletions.
9 changes: 5 additions & 4 deletions scripts/202202117_multi-language-stats/readme.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
## Context

Script was written to understand multi-language patterns from fromSG use cases. Specifically requireent from Pearly were:
Script was written to understand multi-language patterns from fromSG use cases. Specifically, requirements from Pearly were:

- how many forms use another language apart from english
- number of submissions for these forms
Expand All @@ -9,11 +9,11 @@ Script was written to understand multi-language patterns from fromSG use cases.
- some examples of these forms would be great
- any other related data you think might be interesting!

Detecting language is hard, especially since many form use multiple language for the same text blocks. Singapore has 4 primary languages, so the script attempt to look for these specifically. Chinese and Tamil are easy to spot thanks to dedicated unicode character ranges. Distinguishing English and Malay is harder, so the script uses a very crude heuristic to locate words from these languages that we expect to see in forms.
Detecting language is hard, especially since many forms use multiple languages in the same text blocks. Singapore has 4 primary languages though, so the script attempts to look for these specifically. Chinese and Tamil are easy to identify thanks their using dedicated unicode character ranges. Distinguishing English and Malay is harder, since they both use latin characters. For these 2, the script uses a very crude heuristic to locate words from these languages that we see frequently in forms.

Anything with no match is categorized as `unknown`, and is likely English with funky words, like [this form](https://form.gov.sg/#!/5e0c9534df378700118f3349).
Anything with no match is categorized as `unknown`, and is likely English with "funky words", like [this form](https://form.gov.sg/#!/5e0c9534df378700118f3349).

The scripts output some form count overall by language, and then generates more details reports (with agency name, and number of submissions) for 2 categories:
The script outputs some form counts, overall and by language, and then generates more details reports (with agency name, and number of submissions) for 2 categories:

1. Forms where multiple languages are detected
2. Form where only one language is detected, and it is not English
Expand All @@ -23,6 +23,7 @@ The results are printed to stdout as TSV content, so they can be copy/pasted int
## Install and run

```bash
cd scripts/multi-language-stats
cp .env.template .env.production

# edit .env.production with the correct URU
Expand Down

0 comments on commit bfdd9ea

Please sign in to comment.