feat: add script to compute stats for multi-language forms #3458

timotheeg · 2022-02-18T03:46:37Z

Context

Adding the script I used to compute multi-language stats for @pearlyong . Below were her requirements:

how many forms use another language apart from english
number of submissions for these forms
what languages are being used
which agencies primarily
some examples of these forms would be great
any other related data you think might be interesting!

Detecting language is hard, especially since many forms use multiple languages for the same text blocks. Singapore has 4 primary languages though, so the script attempts to look for these specifically. Chinese and Tamil are easy to identify, thanks to dedicated unicode character ranges. Distinguishing English and Malay is harder since they both use latin characters. For these 2, the script uses a very crude heuristic to locate words from these languages that we see frequently in forms.

Anything with no language match is categorized as unknown, and is likely English with "funky words", like this form.

The scripts outputs some form counts, overall and by language, and then generates more details reports (with agency name, and number of submissions) for 2 categories:

Forms where multiple languages are detected
Form where only one language is detected, and it is not English

The results are printed to stdout as TSV content, so they can be copy/pasted into excel of google docs. Example report here.

Install and run

cd scripts/multi-language-stats
cp .env.template .env.production

# edit .env.production with the correct URI

npm install
npm run get_data

mantariksh

thanks Tim! previously we've only documented scripts for database migrations but I think this could come in really useful in future

mantariksh · 2022-02-18T03:53:49Z

@timotheeg btw the "feat" in the title needs to be lowercase haha

timotheeg · 2022-02-18T04:01:00Z

Thanks!

I forced push to add some edits and update the commit title too.

I had to submit the first commit with --no-verify btw, because after doing the linting and prettier, the precommit hook complained with the error below, which is a false positive 😢

Found patterns for AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY

mantariksh approved these changes Feb 18, 2022

View reviewed changes

timotheeg changed the title ~~Script: Add script to compute sttas for multi-language forms~~ Feat: add script to compute stats for multi-language forms Feb 18, 2022

mantariksh changed the title ~~Feat: add script to compute stats for multi-language forms~~ feat: add script to compute stats for multi-language forms Feb 18, 2022

timotheeg added 2 commits February 18, 2022 11:58

feat: add script to compute stats for multi-language forms

1997ce3

fix: minor edits for the documentation of multi-language stats script

bfdd9ea

timotheeg force-pushed the multi-language-stats branch from 8d0b6f2 to bfdd9ea Compare February 18, 2022 03:58

chore: minor fixes for multi-language stats script

8eca56f

timotheeg merged commit 784a09f into develop Feb 18, 2022

timotheeg deleted the multi-language-stats branch February 18, 2022 07:07

tshuli mentioned this pull request Mar 8, 2022

build: Release 5.44.0 #3554

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add script to compute stats for multi-language forms #3458

feat: add script to compute stats for multi-language forms #3458

timotheeg commented Feb 18, 2022 •

edited

Loading

mantariksh left a comment

mantariksh commented Feb 18, 2022

timotheeg commented Feb 18, 2022

feat: add script to compute stats for multi-language forms #3458

feat: add script to compute stats for multi-language forms #3458

Conversation

timotheeg commented Feb 18, 2022 • edited Loading

Context

Install and run

mantariksh left a comment

Choose a reason for hiding this comment

mantariksh commented Feb 18, 2022

timotheeg commented Feb 18, 2022

timotheeg commented Feb 18, 2022 •

edited

Loading