Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wiki examples: sample code to get from tfidf doc to wikipedia title/uri and vice versa #1322

Closed
joernhees opened this issue May 14, 2017 · 9 comments
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature good first issue Issue for new contributors (not required gensim understanding + very simple)

Comments

@joernhees
Copy link

I very much like the LSI and LDA wiki examples, but one aspect that i think is missing is: how to get from tf-idf doc vectors (or later LSI / LDA vecs) back to Wikipedia URIs (or titles if easier) and vice versa?

Am i missing something obvious, or do i have to run another pass over the wiki dump, as the titles aren't saved anywhere?

I'll happily make a PR to extend the examples with this...

@poornagurram
Copy link
Contributor

@joernhees can you please explain your issue clearly.

@joernhees
Copy link
Author

i think it's already about as clear as i can...

  • how to get from Wikipedia URI to vector?
  • how to get from vector to Wikipedia URI?

after following the tutorial, i wasn't able to easily find a link between "topic #0" and its wikipedia article

@menshikh-iv
Copy link
Contributor

@joernhees As I remember, Wikipedia doesn't store URI in dumb (but you can re-store it using "Title").

About your questions: you should pass your page bag-of-words to any model (TfIdf, Lsi, Lda, etc), for this you should use __getitem__ method, for example model[my_bow]

From vector, you can't retrieve original article (but you can store mapping article <-> vector)

@piskvorky
Copy link
Owner

piskvorky commented Aug 15, 2017

@joernhees the tutorial doesn't construct any url/id<->vector mapping by default.

IIRC there was some discussion around adding this, since it's such a useful and commonly requested feature. In fact, it may have been added in one of the recent notebooks -- @menshikh-iv ?

@piskvorky
Copy link
Owner

piskvorky commented Aug 15, 2017

Found it: https://github.com/RaRe-Technologies/gensim/blob/develop/gensim/corpora/wikicorpus.py#L278

@menshikh-iv @joernhees can you open an issue that will set metadata=True in the make_wiki script? (and test that it works)
The vectors without titles are kinda useless, so True is a better default.

At the same time, fix the language in the docs, because it reads poorly and is not very informative. Related to #1161.

@menshikh-iv menshikh-iv added difficulty easy Easy issue: required small fix feature Issue described a new feature test before incubator labels Oct 2, 2017
@menshikh-iv menshikh-iv added good first issue Issue for new contributors (not required gensim understanding + very simple) needs domain knowledge and removed test before incubator labels Oct 16, 2017
@djinn-anthrope
Copy link

Hello!

What is the status of this issue? If it is not then I'd like to take it up. The task is to create a mapping from the doc with the Tf-Idf values to the actual corpus article?

Thanks.

@menshikh-iv
Copy link
Contributor

Hello @alokdebnath, feel free to contribute! The task is as #1322 (comment) suggested + some kind of mapping with vectors (as optional thing)

@dnabanita7
Copy link

can I take up this issue ?

@menshikh-iv
Copy link
Contributor

@naba7 of course, feel free to contribute!

Xinyi2016 added a commit to Xinyi2016/gensim that referenced this issue Oct 31, 2018
Xinyi2016 added a commit to Xinyi2016/gensim that referenced this issue Oct 31, 2018
menshikh-iv pushed a commit that referenced this issue Jan 11, 2019
* Add metadata for wiki examples (#1322)

* Add metadata for wiki examples (#1322)

* update output list of files

* upd docstring

* three -> several
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
difficulty easy Easy issue: required small fix feature Issue described a new feature good first issue Issue for new contributors (not required gensim understanding + very simple)
Projects
None yet
Development

No branches or pull requests

6 participants