-
-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
First names are not gender-specific #30
Comments
ok @aalexandersson , I understand the situation. I will add this feature as soon as possible, it will be ready in the next version of the package. Thank you very much for collaborating :) |
Thank you :-) Sex/gender classification is difficult. For more background, this Julia Discourse topic discusses the two Julia packages: https://discourse.julialang.org/t/rfc-genderinference-jl/22294 I am aware of these two active areas of research. I will try to stay on top of it:
|
The changes have been added to Faker 0.3.5, thank you very much for your collaboration |
Describe the bug
First names are not gender-specific, and therefore often not realistic. This is a problem when combining name and sex in Faker.profile(). For example, typically Barbara is a female name whereas Jonathan is a male name. A StackOverflow posting suggested this code solution for Python Faker:
fake.first_name_male() if gender=="M" else fake.first_name_female()
But I prefer Julia.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I expected this:
Screenshots
Not applicable because not all profiles are problematic.
Environment
Additional context
SSA provides national, state-specific, and territory-specific data which perhaps could be used:
https://www.ssa.gov/oact/babynames/limits.html
Personally, I need realistic fake datasets for testing record linkage for my work at the Florida cancer registry. The Faker output is only one record (observation), and not in a file (dataset). Is it easy to add several profile observations saved as a dataset? In my case, I need two datasets, say one dataset with 100,000 records and the other dataset with 1 million records. If I could create and read a dataset with, for example, just three records then it should be trivial to repeat the procedure for varying number of observations and datasets.
Edit 1: The SSA data requires lots of merging. It would be good enough for me to have just one approximated dataset such as "name_gender.csv" from data.world. The dataset has 95,025 rows and the 3 columns "name", "gender" and "probability". According to the dataset, the example names Barbara and Jonathan respectively have probabilities 1 and 0.9957. The dataset can be accessed from here: https://data.world/howarder/gender-by-name
There is a Julia package which also might help: NameToGender.jl
Edit 2: There is also another Julia package which seems to be even more useful here: GenderInference.jl
The text was updated successfully, but these errors were encountered: