-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wiki examples: sample code to get from tfidf doc to wikipedia title/uri and vice versa #1322
Comments
@joernhees can you please explain your issue clearly. |
i think it's already about as clear as i can...
after following the tutorial, i wasn't able to easily find a link between "topic #0" and its wikipedia article |
@joernhees As I remember, Wikipedia doesn't store URI in dumb (but you can re-store it using "Title"). About your questions: you should pass your page bag-of-words to any model (TfIdf, Lsi, Lda, etc), for this you should use From vector, you can't retrieve original article (but you can store mapping |
@joernhees the tutorial doesn't construct any url/id<->vector mapping by default. IIRC there was some discussion around adding this, since it's such a useful and commonly requested feature. In fact, it may have been added in one of the recent notebooks -- @menshikh-iv ? |
Found it: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/corpora/wikicorpus.py#L278 @menshikh-iv @joernhees can you open an issue that will set At the same time, fix the language in the docs, because it reads poorly and is not very informative. Related to #1161. |
Hello! What is the status of this issue? If it is not then I'd like to take it up. The task is to create a mapping from the doc with the Tf-Idf values to the actual corpus article? Thanks. |
Hello @alokdebnath, feel free to contribute! The task is as #1322 (comment) suggested + some kind of mapping with vectors (as optional thing) |
can I take up this issue ? |
@naba7 of course, feel free to contribute! |
I very much like the LSI and LDA wiki examples, but one aspect that i think is missing is: how to get from tf-idf doc vectors (or later LSI / LDA vecs) back to Wikipedia URIs (or titles if easier) and vice versa?
Am i missing something obvious, or do i have to run another pass over the wiki dump, as the titles aren't saved anywhere?
I'll happily make a PR to extend the examples with this...
The text was updated successfully, but these errors were encountered: