Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First names are not gender-specific #30

Open
aalexandersson opened this issue Feb 18, 2022 · 4 comments
Open

First names are not gender-specific #30

aalexandersson opened this issue Feb 18, 2022 · 4 comments
Labels

Comments

@aalexandersson
Copy link

aalexandersson commented Feb 18, 2022

Describe the bug
First names are not gender-specific, and therefore often not realistic. This is a problem when combining name and sex in Faker.profile(). For example, typically Barbara is a female name whereas Jonathan is a male name. A StackOverflow posting suggested this code solution for Python Faker:
fake.first_name_male() if gender=="M" else fake.first_name_female()
But I prefer Julia.

To Reproduce
Steps to reproduce the behavior:

julia> using Faker
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Barbara"
  "sex"  => "M"
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Jonathan"
  "sex"  => "F"

Expected behavior
I expected this:

julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Barbara"
  "sex"  => "F"
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Jonathan"
  "sex"  => "M"

Screenshots
Not applicable because not all profiles are problematic.

Environment

  • Repository: Julia 1.7.2
  • OS: Win 10
  • IDE: REPL
  • Project/Manifest: I am a Julia beginner. Is this what you want?
(@v1.7) pkg> st
      Status `C:\Users\aalexandersson\.julia\environments\v1.7\Project.toml`
  [a93c6f00] DataFrames v1.3.2
  [0efc519c] Faker v0.3.2
  [be6f12e9] ODBC v1.0.4
  [08abe8d2] PrettyTables v1.3.1

Additional context
SSA provides national, state-specific, and territory-specific data which perhaps could be used:
https://www.ssa.gov/oact/babynames/limits.html

Personally, I need realistic fake datasets for testing record linkage for my work at the Florida cancer registry. The Faker output is only one record (observation), and not in a file (dataset). Is it easy to add several profile observations saved as a dataset? In my case, I need two datasets, say one dataset with 100,000 records and the other dataset with 1 million records. If I could create and read a dataset with, for example, just three records then it should be trivial to repeat the procedure for varying number of observations and datasets.

Edit 1: The SSA data requires lots of merging. It would be good enough for me to have just one approximated dataset such as "name_gender.csv" from data.world. The dataset has 95,025 rows and the 3 columns "name", "gender" and "probability". According to the dataset, the example names Barbara and Jonathan respectively have probabilities 1 and 0.9957. The dataset can be accessed from here: https://data.world/howarder/gender-by-name

There is a Julia package which also might help: NameToGender.jl

Edit 2: There is also another Julia package which seems to be even more useful here: GenderInference.jl

@neomatrixcode
Copy link
Owner

ok @aalexandersson , I understand the situation.

I will add this feature as soon as possible, it will be ready in the next version of the package.

Thank you very much for collaborating :)

@aalexandersson
Copy link
Author

aalexandersson commented Feb 19, 2022

Thank you :-) Sex/gender classification is difficult. For more background, this Julia Discourse topic discusses the two Julia packages: https://discourse.julialang.org/t/rfc-genderinference-jl/22294

I am aware of these two active areas of research. I will try to stay on top of it:

  1. The North American Association of Central Cancer Registries (NAACCR) has a Sex/Gender Classification Workgroup which is putting together a proposal for a standardized data set:
    https://narrative.naaccr.org/wp-content/uploads/2022/01/Winter-2022-for-PDF.pdf

  2. The University of Washington (UW), together with the U.S. Census Bureau, is putting together standardized fake datasets for record linkage: https://www.census.gov/newsroom/blogs/research-matters/2021/10/four-cooperative-agreements.html

@neomatrixcode
Copy link
Owner

@neomatrixcode
Copy link
Owner

The changes have been added to Faker 0.3.5, thank you very much for your collaboration

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants