Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A posteriori log-likelihood computed for a new document #3

Open
vressegu opened this issue Jun 27, 2018 · 0 comments
Open

A posteriori log-likelihood computed for a new document #3

vressegu opened this issue Jun 27, 2018 · 0 comments

Comments

@vressegu
Copy link

Dear Pr. Blei and collaborators,

I am trying to use your code to apply LDA for anomaly detection in a Bayesian framework.
But I am not sure that the method "score" of the class LatentDirichletAllocation" do what I want to do.

More specifically, using the notations of the paper Latent Dirichlet Allocation, Blei, Ng & Jordan (2003),
I have a a corpus of documents D = { w_1, ... w_M} for learning.
I would like to use smoothing to handle Out of Vocabulary issues.
So, the hyper-parameters I learned to fit my LDA model on D are \alpha and \eta.

Then, I want to do anomaly detection, by using the LDA model as a bayesian semi-supervised classifier.
I assume that all documents wi of the initial corpus D belong to the class "normal" (class 1).
When I see a new document w
{test}, I try to know if it belongs to the class "normal" (class 1) or to the class "anomaly" (class -1).
To know this, I would like to compute a prosteriori probability:
p ( w{test} | D , \alpha , \eta )
If it is too small, w
{test} is considered as an anomaly.

But, I do not know if the method "score" of the class LatentDirichletAllocation" compute
p ( w{test} | \alpha , \eta ) (formula 1)
= \int d \beta p ( w
{test} | \alpha , \beta ) p ( beta | \eta )
or
p ( w{test} | D , \alpha , \eta ) (formula 2)
= \int d \beta p ( w
{test} | \alpha , \beta ) p ( beta | D , \eta )
My intuition is that the method score may be initially used for the fitting of \alpha and \eta on the corpus D and thus it is not exactly what I want to do.
I think that p ( beta | D , \eta ) ( = the a posteriori pdf of \beta (the distriution of words in each topic) ) cointains much more information on the statistics of the corpus D, than the p ( beta | \eta ) ( = the a priori pdf of \beta ). Hence, it would be better suited for my classification problem.

So, please could you tell me if the method score implements (formula 1) or (formula 2)?

Thank you in advance.

Kind Regards,
Valentin Resseguier

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant