Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with the "incidence rate" in the "neighborhoods" analysis #6

Open
ghost opened this issue Jan 17, 2019 · 4 comments
Open

Issues with the "incidence rate" in the "neighborhoods" analysis #6

ghost opened this issue Jan 17, 2019 · 4 comments
Assignees

Comments

@ghost
Copy link

ghost commented Jan 17, 2019

@ProQuestionAsker @iblind

Inspired by your work I'm in the process of analyzing Yelp data for Berlin's neighborhoods, but I'm a little confused by the "incidence rate" you use. It's obviously the same that Katie Hempenius used. Do you happen to know where it comes from, or if she came up with it herself? The only incidence rate I know of comes from epidemiology and includes a component of time, which Katie's formula definitely does not. I've checked out some business resources and they all use the standard epidemiological definition.

Also, do you do any kind of within-category, between-district comparison in your analysis, i.e. do you rank the categories within each district and the districts within each category? As far as I can tell you don't. I ask because the second half of the formula, the part that normalizes the data, only effects the within-category, between-district rankings, but not the within-district, between-category rankings, since it is simply a constant at the within-district level. If you don't compare between districts then the normalization step is superfluous.

Hopefully you can tell me if I've maybe missed something. I plan to post my analysis on my blog in the next week or so. It's likely very few will read it, but I do plan to take up David Robinson on his offer, so if I'm lucky some people will read it. I respect your work and don't want it to seem like I've launched a surprise attack, ergo this "issue".

I have a few other objections to Katie's "incidence rate" beyond the above details. You are welcome to read and comment on them here. You'll want the "Deciding on a Metric" section that starts at line 359. It's still very much a rough draft, so some things will change, but my arguments in this section are fleshed out enough to understand what I'm aiming for.

@ProQuestionAsker
Copy link
Contributor

Hey @GershTri,

This is so great, thanks for sharing!

I'll do my best to answer your questions, but let me know if you have any others.

I'm not entirely sure where Katie came up with her incidence rate calculation, but to us, it logically made sense as a way to calculate businesses that are uniquely common in a neighborhood compared to the city as a whole. Totally hear you on the semantics of the phrase incidence rate, maybe something like prevalence (as you suggest) would have been more appropriate to distinguish this type of calculation from the epidemiological usage. Though, I believe we only used the phrase in the metadata here in our data repository, not in the article itself.

You're also right that we didn't really need to include the neighborhood level constant (businesses per city / businesses per neighborhood) in our calculation since we only ended up using this value for within-neighborhood calculations instead of between-neighborhood calculations. I believe at one point we had considered doing some between-neighborhood calculations (something like "The Number 1 spot for yogurt shops is x"), but we never ended up following through with that in the final version of the story.

I've also checked out your blog post draft and we totally hear you on most of your issues with the calculation. Though, for us, we decided early-on against including the reviews in our calculations since we were more interested in what businesses existed, rather than people's experiences with those businesses (it also discounts new businesses which may have few reviews).

All of that being said, I re-ran our Seattle data using your proposed calculation (without the review step), and I'm a little unsure that it solves all of the problems that you hoped it would solve. Now, almost all of the small neighborhoods are being led by a single business because it is the only one in the whole city and it is located in that neighborhood. This issue is why we implemented the businesses must make up more than 1% of the neighborhood's businesses rule when we ran this analysis.

Here's a screenshot of what the analysis looks like on our Seattle data.
screen shot 2019-01-18 at 10 37 00 am

The nCount is the number of times that type of business occurs in a neighborhood, whereas cCount is the number of times that type of business occurs in the whole city. relPrev is the first part of your proposed formula (which you refer to as relative prevalence), and proportion is the second part of your proposed formula. altCalc is thus relPrev * proportion. Much of this comes down to the way that we are each considering a business (or type of business) to best describe a neighborhood. For me, I'd have a hard time saying that eatertainment is the best descriptor of Admiral, even though that neighborhood contains the only one in the city. In the end, we landed on businesses that are more popular in a neighborhood than they are in the city, provided they make up at least 1% of that neighborhood's businesses. But, I can totally see why you may choose to use a slightly different description or metric.

@ghost
Copy link
Author

ghost commented Jan 19, 2019

Hi @ProQuestionAsker,

Thanks so much for your answer! It really clears up a lot for me. I am certain that there are some kinks in my formula. It seems to perform pretty well for Berlin, but I'm not completely satisfied with the results myself.

The output for Seattle isn't too surprising. There's a step further down in my script (hidden on line 659, no section title or text yet) where I also filter out less frequent businesses: filter(preval_dist > 0.01 | review_dist_prop > 0.01), i.e. the category must make up more than 1% of the district's businesses or more than 1% of it's total reviews. I include the latter condition because I think larger institutions (beaches, shopping centers, botanical gardens, etc.) can also characterize a district/neighborhood, but they are by nature infrequent and would be excluded by the 1% of businesses criterion.

I understand why you wouldn't want to include reviews in your calculation, given your (very reasonable) goals. That said, technically I'm not interested people's experiences with the businesses. If that were the case I would take positive and negative reviews into account. Instead, I'm really only interested in total reviews, positive and negative, as an non-specific indicator of the interest generated by a given category. That way, less prevalent categories that generate a disproportionate amount of interest can compete against slightly more prevalent categories that no one really cares about. I think this is important because we are talking about what characterizes a district/neighborhood, which for me should include some element of human activity - i.e. not just what exists, but what people actually do there. Still, there are definitely some arguments to be made against using reviews.

By the way, I've been following your work since your film dialogue analysis a couple years back and you've been a big inspiration for me. Awesome work!

I'll let you know when the final draft is done so you have a chance to respond, clarify etc. You're probably busy, so no big deal if you don't get around to it. I can also always add edits later. Thanks again for your time!

@khempenius
Copy link

Hi @GershTri,

Thanks for the email. Yep, none of your objections re: metric selection come as a surprise (I think all of those considerations crossed my mind them when before settling on that).

I think normalizing data (or even coming up with the metric in the first place) is far from a one-size-fits-all thing. For what I was trying to achieve (and the particular data set) this seemed like the best approach - I wouldn't be surprised if your particular project uses something slightly different.

@ghost
Copy link
Author

ghost commented Jan 22, 2019

Hi @khempenius,

Thanks for taking the time for answer. You are absolutely right, there are advantages and disadvantages to every solution. Berlin, San Francisco, Seattle and New York are also very different cities. I noticed, for example, that Berlin has a third as many Yelp reviews despite being approx. three times larger than SF, and a majority of those seem to be restaurants and cafés.

I really appreciate your, and @ProQuestionAsker 's, willingness to have this conversation with me. I think my post will be all the better for it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants