-
Notifications
You must be signed in to change notification settings - Fork 249
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issues with the "incidence rate" in the "neighborhoods" analysis #6
Comments
Thanks so much for your answer! It really clears up a lot for me. I am certain that there are some kinks in my formula. It seems to perform pretty well for Berlin, but I'm not completely satisfied with the results myself. The output for Seattle isn't too surprising. There's a step further down in my script (hidden on line 659, no section title or text yet) where I also filter out less frequent businesses: I understand why you wouldn't want to include reviews in your calculation, given your (very reasonable) goals. That said, technically I'm not interested people's experiences with the businesses. If that were the case I would take positive and negative reviews into account. Instead, I'm really only interested in total reviews, positive and negative, as an non-specific indicator of the interest generated by a given category. That way, less prevalent categories that generate a disproportionate amount of interest can compete against slightly more prevalent categories that no one really cares about. I think this is important because we are talking about what characterizes a district/neighborhood, which for me should include some element of human activity - i.e. not just what exists, but what people actually do there. Still, there are definitely some arguments to be made against using reviews. By the way, I've been following your work since your film dialogue analysis a couple years back and you've been a big inspiration for me. Awesome work! I'll let you know when the final draft is done so you have a chance to respond, clarify etc. You're probably busy, so no big deal if you don't get around to it. I can also always add edits later. Thanks again for your time! |
Hi @GershTri, Thanks for the email. Yep, none of your objections re: metric selection come as a surprise (I think all of those considerations crossed my mind them when before settling on that). I think normalizing data (or even coming up with the metric in the first place) is far from a one-size-fits-all thing. For what I was trying to achieve (and the particular data set) this seemed like the best approach - I wouldn't be surprised if your particular project uses something slightly different. |
Hi @khempenius, Thanks for taking the time for answer. You are absolutely right, there are advantages and disadvantages to every solution. Berlin, San Francisco, Seattle and New York are also very different cities. I noticed, for example, that Berlin has a third as many Yelp reviews despite being approx. three times larger than SF, and a majority of those seem to be restaurants and cafés. I really appreciate your, and @ProQuestionAsker 's, willingness to have this conversation with me. I think my post will be all the better for it! |
@ProQuestionAsker @iblind
Inspired by your work I'm in the process of analyzing Yelp data for Berlin's neighborhoods, but I'm a little confused by the "incidence rate" you use. It's obviously the same that Katie Hempenius used. Do you happen to know where it comes from, or if she came up with it herself? The only incidence rate I know of comes from epidemiology and includes a component of time, which Katie's formula definitely does not. I've checked out some business resources and they all use the standard epidemiological definition.
Also, do you do any kind of within-category, between-district comparison in your analysis, i.e. do you rank the categories within each district and the districts within each category? As far as I can tell you don't. I ask because the second half of the formula, the part that normalizes the data, only effects the within-category, between-district rankings, but not the within-district, between-category rankings, since it is simply a constant at the within-district level. If you don't compare between districts then the normalization step is superfluous.
Hopefully you can tell me if I've maybe missed something. I plan to post my analysis on my blog in the next week or so. It's likely very few will read it, but I do plan to take up David Robinson on his offer, so if I'm lucky some people will read it. I respect your work and don't want it to seem like I've launched a surprise attack, ergo this "issue".
I have a few other objections to Katie's "incidence rate" beyond the above details. You are welcome to read and comment on them here. You'll want the "Deciding on a Metric" section that starts at line 359. It's still very much a rough draft, so some things will change, but my arguments in this section are fleshed out enough to understand what I'm aiming for.
The text was updated successfully, but these errors were encountered: