feat: add script to compute stats for multi-language forms #3458
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Context
Adding the script I used to compute multi-language stats for @pearlyong . Below were her requirements:
Detecting language is hard, especially since many forms use multiple languages for the same text blocks. Singapore has 4 primary languages though, so the script attempts to look for these specifically. Chinese and Tamil are easy to identify, thanks to dedicated unicode character ranges. Distinguishing English and Malay is harder since they both use latin characters. For these 2, the script uses a very crude heuristic to locate words from these languages that we see frequently in forms.
Anything with no language match is categorized as
unknown
, and is likely English with "funky words", like this form.The scripts outputs some form counts, overall and by language, and then generates more details reports (with agency name, and number of submissions) for 2 categories:
The results are printed to stdout as TSV content, so they can be copy/pasted into excel of google docs. Example report here.
Install and run