First names are not gender-specific #30

aalexandersson · 2022-02-18T18:06:31Z

Describe the bug
First names are not gender-specific, and therefore often not realistic. This is a problem when combining name and sex in Faker.profile(). For example, typically Barbara is a female name whereas Jonathan is a male name. A StackOverflow posting suggested this code solution for Python Faker:
fake.first_name_male() if gender=="M" else fake.first_name_female()
But I prefer Julia.

To Reproduce
Steps to reproduce the behavior:

julia> using Faker
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Barbara"
  "sex"  => "M"
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Jonathan"
  "sex"  => "F"

Expected behavior
I expected this:

julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Barbara"
  "sex"  => "F"
julia> Faker.profile("name", "sex")
Dict{Any, Any} with 2 entries:
  "name" => "Jonathan"
  "sex"  => "M"

Screenshots
Not applicable because not all profiles are problematic.

Environment

Repository: Julia 1.7.2
OS: Win 10
IDE: REPL
Project/Manifest: I am a Julia beginner. Is this what you want?

(@v1.7) pkg> st
      Status `C:\Users\aalexandersson\.julia\environments\v1.7\Project.toml`
  [a93c6f00] DataFrames v1.3.2
  [0efc519c] Faker v0.3.2
  [be6f12e9] ODBC v1.0.4
  [08abe8d2] PrettyTables v1.3.1

Additional context
SSA provides national, state-specific, and territory-specific data which perhaps could be used:
https://www.ssa.gov/oact/babynames/limits.html

Personally, I need realistic fake datasets for testing record linkage for my work at the Florida cancer registry. The Faker output is only one record (observation), and not in a file (dataset). Is it easy to add several profile observations saved as a dataset? In my case, I need two datasets, say one dataset with 100,000 records and the other dataset with 1 million records. If I could create and read a dataset with, for example, just three records then it should be trivial to repeat the procedure for varying number of observations and datasets.

Edit 1: The SSA data requires lots of merging. It would be good enough for me to have just one approximated dataset such as "name_gender.csv" from data.world. The dataset has 95,025 rows and the 3 columns "name", "gender" and "probability". According to the dataset, the example names Barbara and Jonathan respectively have probabilities 1 and 0.9957. The dataset can be accessed from here: https://data.world/howarder/gender-by-name

There is a Julia package which also might help: NameToGender.jl

Edit 2: There is also another Julia package which seems to be even more useful here: GenderInference.jl

The text was updated successfully, but these errors were encountered:

neomatrixcode · 2022-02-19T01:20:24Z

ok @aalexandersson , I understand the situation.

I will add this feature as soon as possible, it will be ready in the next version of the package.

Thank you very much for collaborating :)

aalexandersson · 2022-02-19T17:22:30Z

Thank you :-) Sex/gender classification is difficult. For more background, this Julia Discourse topic discusses the two Julia packages: https://discourse.julialang.org/t/rfc-genderinference-jl/22294

I am aware of these two active areas of research. I will try to stay on top of it:

The North American Association of Central Cancer Registries (NAACCR) has a Sex/Gender Classification Workgroup which is putting together a proposal for a standardized data set:
https://narrative.naaccr.org/wp-content/uploads/2022/01/Winter-2022-for-PDF.pdf
The University of Washington (UW), together with the U.S. Census Bureau, is putting together standardized fake datasets for record linkage: https://www.census.gov/newsroom/blogs/research-matters/2021/10/four-cooperative-agreements.html

neomatrixcode · 2022-03-01T22:59:30Z

JuliaRegistries/General#55751

neomatrixcode · 2022-03-04T14:28:52Z

The changes have been added to Faker 0.3.5, thank you very much for your collaboration

aalexandersson added the bug label Feb 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First names are not gender-specific #30

First names are not gender-specific #30

aalexandersson commented Feb 18, 2022 •

edited

Loading

neomatrixcode commented Feb 19, 2022

aalexandersson commented Feb 19, 2022 •

edited

Loading

neomatrixcode commented Mar 1, 2022

neomatrixcode commented Mar 4, 2022

First names are not gender-specific #30

First names are not gender-specific #30

Comments

aalexandersson commented Feb 18, 2022 • edited Loading

neomatrixcode commented Feb 19, 2022

aalexandersson commented Feb 19, 2022 • edited Loading

neomatrixcode commented Mar 1, 2022

neomatrixcode commented Mar 4, 2022

aalexandersson commented Feb 18, 2022 •

edited

Loading

aalexandersson commented Feb 19, 2022 •

edited

Loading