Improve matching approach #13

Hussein-Mahfouz · 2024-04-05T08:53:51Z

The current approach to matching the SPC to the NTS is:

Categorical matching (exact join) to match households
Propensity score matching to match individuals within households

Categorical matching is inflexible and some households in the SPC don't have any exact matches in the NTS (see here for matching results. It would be better to do Propensity Score Matching at the Household level from the beginning. This would ensure that each Household in the SPC is matched to at least one household in the NTS

Tools

The matchit R package is very comprehensive. It has different matching algorithms, and also allows you to specify different calipers for each covariate. This is very handy because we might want to be stricter on some covariates than others (e.g. for households, we may want the household size to match exactly, but be more forgiving on household income)

I didn't find a python library that has the same functionality as matchIt. In psmpy, you can only provide one caliper based on the overall distance

Hussein-Mahfouz · 2024-05-09T15:02:47Z

Another solution is to follow the SPC approach: do categorical matching iteratively, and after each iteration relax the constraints slightly. This should result in better matching at the household level. Example:

Round 1: Household income | Number of adults | Number of children | Employment status | Car ownership | Type of tenancy | Rural/Urban Classification
Round 2: Household income | Number of adults | Number of children | Employment status | Car ownership | ~~Type of tenancy~~ | Rural/Urban Classification
Round 3: ~~Household income~~ | Number of adults | Number of children | Employment status | Car ownership | ~~Type of tenancy~~ | Rural/Urban Classification
Round 4: ~~Household income~~ | Number of adults | Number of children | ~~Employment status~~ | Car ownership | ~~Type of tenancy~~ | Rural/Urban Classification

This should pprovide better results than the current match_categorical implementation where we only match once, and have to sacrifice some variables to improve matching (as shown here).

In the final round, all households that are yet to be matched can be matched either randomly, or to a household with values close to the mean

additional arguments to pass to match_categorical:

optional_columns: these are the columns that we can relax. It could be an ordered list, and at each iteration, we remove the last column from the list before matching

Hussein-Mahfouz · 2024-05-10T08:42:43Z

@sgreenbury the statistical matching approach I mentioned is in the ile-de-france project: link. I haven't tried to use the pipeline in the ile de france project, but it's well documented. Maybe it's something we should explore

Hussein-Mahfouz · 2024-05-21T16:20:18Z

I've just found the following description in this paper:

In the next step, each sampled individual is then matched to an observation from the 35 household travel survey, using hot-deck matching ((27), (17)). 36 The idea is to find all source observations (i.e. all samples from the household travel survey) that match the target observations (i.e. synthetic agents previously sampled from the census) on a 1 list of given matching attributes, and then to sample randomly one of those source observations. 2 To avoid over-fitting, if too few source observations are found for a given target observation, some 3 matching attributes are removed to enhance the set of matching source observations.

I like the step taken to avoid overfitting. They do statistical matching, but it can also be applied to categorical matching at the household level, and we would have a threshold for minimum number of matches

Hussein-Mahfouz · 2024-10-18T07:23:41Z

Notes on implementation of iterative_match_categorical() fn:

You define a list of fixed_cols and optional_cols. The optional columns are listed in order of importance.
At the first iteration, we try to match on all columns. At each iteration, we pop the last item from the optional_cols list
After each iteration, we check the number of matches found for each household. If a household has more than n_matches, we stop adding matches to it during successive iterations.
The process terminates when we've gone through all optional_cols
It does not guarantee that all households will be matched, but the matching rate is better than our old approach

New approach - Sample of 15,000 households (columns used here):

0.4 % of households in the SPC had no match

Hussein-Mahfouz added the enhancement New feature or request label Apr 5, 2024

Hussein-Mahfouz self-assigned this Apr 5, 2024

Hussein-Mahfouz mentioned this issue Apr 18, 2024

Filter NTS data to study area to avoid unrepresentative travel distances or mode share #16

Closed

Hussein-Mahfouz added the Task 1 creating activity chains label Apr 19, 2024

Hussein-Mahfouz changed the title ~~Change Household Level matching from categorical to PSM~~ Improve matching approach May 9, 2024

sgreenbury mentioned this issue May 24, 2024

Speeding up the matching process #25

Closed

Hussein-Mahfouz added a commit that referenced this issue Oct 15, 2024

iterative matching, see #13

1c69de2

Hussein-Mahfouz mentioned this issue Oct 15, 2024

Iterative categorical matching #60

Merged

Hussein-Mahfouz linked a pull request Oct 15, 2024 that will close this issue

Iterative categorical matching #60

Merged

sgreenbury closed this as completed in #60 Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve matching approach #13

Improve matching approach #13

Hussein-Mahfouz commented Apr 5, 2024

Hussein-Mahfouz commented May 9, 2024 •

edited

Loading

Hussein-Mahfouz commented May 10, 2024 •

edited

Loading

Hussein-Mahfouz commented May 21, 2024

Hussein-Mahfouz commented Oct 18, 2024

Improve matching approach #13

Improve matching approach #13

Comments

Hussein-Mahfouz commented Apr 5, 2024

Tools

Hussein-Mahfouz commented May 9, 2024 • edited Loading

Hussein-Mahfouz commented May 10, 2024 • edited Loading

Hussein-Mahfouz commented May 21, 2024

Hussein-Mahfouz commented Oct 18, 2024

Hussein-Mahfouz commented May 9, 2024 •

edited

Loading

Hussein-Mahfouz commented May 10, 2024 •

edited

Loading