-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't use Wikipedia disambiguation pages as IDs #1
Comments
I believe we addressed this on the collection.cooperhewitt.org site. but were you talking about anything within this repository? |
I'm not sure how it was addressed on the web site, but it looks wrong to me in both the repository and the web site. See any of the entries deleted in this commit on my fork tfmorris@9b06f37 |
p.s. I don't have an example to hand and am not sure I've corrected them, but I've also seen pages on the web site that have wikitext like [[redirect ...]] which was fetched instead of following the redirect. |
Here's an example of a page with redirect text http://collection.cooperhewitt.org/people/18041625/ you end up at the right page at Wikipedia eventually because browsers follow redirects, but it would be better to use the canonical ID. This particular one is cleaned up on my branch, but not sure if I got them all because I wasn't looking specifically for them. |
ok, yeah the website is working off a different repository, this one is just for the people concordances stuff.. if you submit your pull request, we can probably get that all cleaned up! |
Yeah, the disambiguation pages are a known-known. They aren't considered a feature but they aren't a bug either, at least in the short term. The effort, for the alpha release, was more about building out the tools/infrastructure around the idea(s) of how to hold hands with data from Wikipedia than it was about being absolutely perfect. For example, there are probably a whole bunch of matches with multiple pages (distinct from disambiguation pages) that could be easily sorted with a little more code/smarts but we opted to stay away from those records in the early stages so as not to get dragged in to the quicksand of edge cases. As such the disambiguation pages were tolerated because they still provide a way for humans to jump between our collection records and Wikipedia. The robots, for now, will just have to be confused :D Now that the alpha site is live we are going to start revisiting a lot of the concordance-related tools and spend more time dealing with things like disambiguation. If you've already got a list of known "ambiguous" concordances – I saw your blog post (thanks!) but haven't had a chance to read it in detail – then I would be happy to merge the changes here and use that as a reference going forward. Cheers, |
Add link to comments about disambiguation pages (issue #1)
Ah, I see you've already cleaned up the disambiguation pages here: tfmorris@9b06f37#concordances.csv I will spend some time with all this in the next couple of days and if it's all good (measure twice and all that...) do the merge dance. Thanks! |
@straup I see that this repo has been abandoned. Did the changes get merged in before everything was moved to the collection repo? |
These pages, by their very nature, represent multiple things, not a single thing and should be blacklisted for any exercise like this.
The text was updated successfully, but these errors were encountered: