Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve matching approach #13

Closed
Hussein-Mahfouz opened this issue Apr 5, 2024 · 4 comments · Fixed by #60
Closed

Improve matching approach #13

Hussein-Mahfouz opened this issue Apr 5, 2024 · 4 comments · Fixed by #60
Assignees
Labels
enhancement New feature or request Task 1 creating activity chains

Comments

@Hussein-Mahfouz
Copy link
Collaborator

The current approach to matching the SPC to the NTS is:

  1. Categorical matching (exact join) to match households
  2. Propensity score matching to match individuals within households

Categorical matching is inflexible and some households in the SPC don't have any exact matches in the NTS (see here for matching results. It would be better to do Propensity Score Matching at the Household level from the beginning. This would ensure that each Household in the SPC is matched to at least one household in the NTS

Tools

The matchit R package is very comprehensive. It has different matching algorithms, and also allows you to specify different calipers for each covariate. This is very handy because we might want to be stricter on some covariates than others (e.g. for households, we may want the household size to match exactly, but be more forgiving on household income)

I didn't find a python library that has the same functionality as matchIt. In psmpy, you can only provide one caliper based on the overall distance

@Hussein-Mahfouz
Copy link
Collaborator Author

Hussein-Mahfouz commented May 9, 2024

Another solution is to follow the SPC approach: do categorical matching iteratively, and after each iteration relax the constraints slightly. This should result in better matching at the household level. Example:

  • Round 1: Household income | Number of adults | Number of children | Employment status | Car ownership | Type of tenancy | Rural/Urban Classification
  • Round 2: Household income | Number of adults | Number of children | Employment status | Car ownership | Type of tenancy | Rural/Urban Classification
  • Round 3: Household income | Number of adults | Number of children | Employment status | Car ownership | Type of tenancy | Rural/Urban Classification
  • Round 4: Household income | Number of adults | Number of children | Employment status | Car ownership | Type of tenancy | Rural/Urban Classification

This should pprovide better results than the current match_categorical implementation where we only match once, and have to sacrifice some variables to improve matching (as shown here).

In the final round, all households that are yet to be matched can be matched either randomly, or to a household with values close to the mean

additional arguments to pass to match_categorical:

  • optional_columns: these are the columns that we can relax. It could be an ordered list, and at each iteration, we remove the last column from the list before matching

@Hussein-Mahfouz Hussein-Mahfouz changed the title Change Household Level matching from categorical to PSM Improve matching approach May 9, 2024
@Hussein-Mahfouz
Copy link
Collaborator Author

Hussein-Mahfouz commented May 10, 2024

@sgreenbury the statistical matching approach I mentioned is in the ile-de-france project: link. I haven't tried to use the pipeline in the ile de france project, but it's well documented. Maybe it's something we should explore

@Hussein-Mahfouz
Copy link
Collaborator Author

I've just found the following description in this paper:

In the next step, each sampled individual is then matched to an observation from the 35 household travel survey, using hot-deck matching ((27), (17)). 36 The idea is to find all source observations (i.e. all samples from the household travel survey) that match the target observations (i.e. synthetic agents previously sampled from the census) on a 1 list of given matching attributes, and then to sample randomly one of those source observations. 2 To avoid over-fitting, if too few source observations are found for a given target observation, some 3 matching attributes are removed to enhance the set of matching source observations.

I like the step taken to avoid overfitting. They do statistical matching, but it can also be applied to categorical matching at the household level, and we would have a threshold for minimum number of matches

@Hussein-Mahfouz
Copy link
Collaborator Author

Notes on implementation of iterative_match_categorical() fn:

  • You define a list of fixed_cols and optional_cols. The optional columns are listed in order of importance.
  • At the first iteration, we try to match on all columns. At each iteration, we pop the last item from the optional_cols list
  • After each iteration, we check the number of matches found for each household. If a household has more than n_matches, we stop adding matches to it during successive iterations.
  • The process terminates when we've gone through all optional_cols
  • It does not guarantee that all households will be matched, but the matching rate is better than our old approach

New approach - Sample of 15,000 households (columns used here):

0.4 % of households in the SPC had no match

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Task 1 creating activity chains
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant