Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add script to compute stats for multi-language forms #3458

Merged
merged 3 commits into from
Feb 18, 2022

Conversation

timotheeg
Copy link
Contributor

@timotheeg timotheeg commented Feb 18, 2022

Context

Adding the script I used to compute multi-language stats for @pearlyong . Below were her requirements:

  • how many forms use another language apart from english
  • number of submissions for these forms
  • what languages are being used
  • which agencies primarily
  • some examples of these forms would be great
  • any other related data you think might be interesting!

Detecting language is hard, especially since many forms use multiple languages for the same text blocks. Singapore has 4 primary languages though, so the script attempts to look for these specifically. Chinese and Tamil are easy to identify, thanks to dedicated unicode character ranges. Distinguishing English and Malay is harder since they both use latin characters. For these 2, the script uses a very crude heuristic to locate words from these languages that we see frequently in forms.

Anything with no language match is categorized as unknown, and is likely English with "funky words", like this form.

The scripts outputs some form counts, overall and by language, and then generates more details reports (with agency name, and number of submissions) for 2 categories:

  1. Forms where multiple languages are detected
  2. Form where only one language is detected, and it is not English

The results are printed to stdout as TSV content, so they can be copy/pasted into excel of google docs. Example report here.

Install and run

cd scripts/multi-language-stats
cp .env.template .env.production

# edit .env.production with the correct URI

npm install
npm run get_data

Copy link
Contributor

@mantariksh mantariksh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks Tim! previously we've only documented scripts for database migrations but I think this could come in really useful in future

@timotheeg timotheeg changed the title Script: Add script to compute sttas for multi-language forms Feat: add script to compute stats for multi-language forms Feb 18, 2022
@mantariksh mantariksh changed the title Feat: add script to compute stats for multi-language forms feat: add script to compute stats for multi-language forms Feb 18, 2022
@mantariksh
Copy link
Contributor

@timotheeg btw the "feat" in the title needs to be lowercase haha

@timotheeg timotheeg force-pushed the multi-language-stats branch from 8d0b6f2 to bfdd9ea Compare February 18, 2022 03:58
@timotheeg
Copy link
Contributor Author

Thanks!

I forced push to add some edits and update the commit title too.

I had to submit the first commit with --no-verify btw, because after doing the linting and prettier, the precommit hook complained with the error below, which is a false positive 😢

Found patterns for AWS_ACCESS_KEY_ID/AWS_SECRET_ACCESS_KEY

@timotheeg timotheeg merged commit 784a09f into develop Feb 18, 2022
@timotheeg timotheeg deleted the multi-language-stats branch February 18, 2022 07:07
@tshuli tshuli mentioned this pull request Mar 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants