Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't use Wikipedia disambiguation pages as IDs #1

Open
tfmorris opened this issue Sep 30, 2012 · 8 comments
Open

Don't use Wikipedia disambiguation pages as IDs #1

tfmorris opened this issue Sep 30, 2012 · 8 comments

Comments

@tfmorris
Copy link

These pages, by their very nature, represent multiple things, not a single thing and should be blacklisted for any exercise like this.

@micahwalter
Copy link
Contributor

I believe we addressed this on the collection.cooperhewitt.org site. but were you talking about anything within this repository?

@tfmorris
Copy link
Author

tfmorris commented Oct 3, 2012

I'm not sure how it was addressed on the web site, but it looks wrong to me in both the repository and the web site. See any of the entries deleted in this commit on my fork tfmorris@9b06f37
For example, http://collection.cooperhewitt.org/people/18041077/

@tfmorris
Copy link
Author

tfmorris commented Oct 3, 2012

p.s. I don't have an example to hand and am not sure I've corrected them, but I've also seen pages on the web site that have wikitext like [[redirect ...]] which was fetched instead of following the redirect.

@tfmorris
Copy link
Author

tfmorris commented Oct 3, 2012

Here's an example of a page with redirect text http://collection.cooperhewitt.org/people/18041625/ you end up at the right page at Wikipedia eventually because browsers follow redirects, but it would be better to use the canonical ID. This particular one is cleaned up on my branch, but not sure if I got them all because I wasn't looking specifically for them.

@micahwalter
Copy link
Contributor

ok, yeah the website is working off a different repository, this one is just for the people concordances stuff.. if you submit your pull request, we can probably get that all cleaned up!

@straup
Copy link
Collaborator

straup commented Oct 4, 2012

Yeah, the disambiguation pages are a known-known. They aren't considered a feature but they aren't a bug either, at least in the short term.

The effort, for the alpha release, was more about building out the tools/infrastructure around the idea(s) of how to hold hands with data from Wikipedia than it was about being absolutely perfect.

For example, there are probably a whole bunch of matches with multiple pages (distinct from disambiguation pages) that could be easily sorted with a little more code/smarts but we opted to stay away from those records in the early stages so as not to get dragged in to the quicksand of edge cases.

As such the disambiguation pages were tolerated because they still provide a way for humans to jump between our collection records and Wikipedia. The robots, for now, will just have to be confused :D

Now that the alpha site is live we are going to start revisiting a lot of the concordance-related tools and spend more time dealing with things like disambiguation.

If you've already got a list of known "ambiguous" concordances – I saw your blog post (thanks!) but haven't had a chance to read it in detail – then I would be happy to merge the changes here and use that as a reference going forward.

Cheers,

straup pushed a commit that referenced this issue Oct 4, 2012
Add link to comments about disambiguation pages (issue #1)
@straup
Copy link
Collaborator

straup commented Oct 4, 2012

Ah, I see you've already cleaned up the disambiguation pages here:

tfmorris@9b06f37#concordances.csv

I will spend some time with all this in the next couple of days and if it's all good (measure twice and all that...) do the merge dance.

Thanks!

@tfmorris
Copy link
Author

@straup I see that this repo has been abandoned. Did the changes get merged in before everything was moved to the collection repo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants