Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To have Lucene query DSL compliant Search API #429

Closed
lalitpagaria opened this issue Sep 24, 2020 · 7 comments
Closed

To have Lucene query DSL compliant Search API #429

lalitpagaria opened this issue Sep 24, 2020 · 7 comments
Assignees
Labels
type:feature New feature or request
Milestone

Comments

@lalitpagaria
Copy link
Contributor

lalitpagaria commented Sep 24, 2020

Is your feature request related to a problem? Please describe.
I am tying to integrate Haystack with Confluence (later to extend work to other Atlassian products). Confluence search is very bad for us (at my work location) to find right page (specifically RFCs) hence leads to duplicate work. I am writing one confluence plugin to route these full text search queries to Haystack instead of default Lucene based system.

Describe the solution you'd like
To have Lucene query DSL compliant /query endpoint (I am not asking to support each and every functionality). It will make integration with other systems more straight forward. Also this will make Haystack a drop-in replacement (in limited capacity) for any ES or Lucene based system.

Describe alternatives you've considered
I am currently thinking of modifying LUcene QUery Manipolator to translate Lucene query DSL queries to /doc-qa endpoint supported format.

Additional context
None

@lalitpagaria lalitpagaria added the type:feature New feature or request label Sep 24, 2020
@lalitpagaria
Copy link
Contributor Author

@tanaysoni @tholor just checking whether this suggested enhancement align with your product roadmap?
If yes then I will wait otherwise I will write bit hackish solution to complete integration with confluence.

@tanaysoni
Copy link
Contributor

Hi @lalitpagaria, thank you for the feature request. I think integrations to other systems would be useful for the community!

I like the idea of translating Lucene queries to adapt as per the /doc-qa endpoint. The query string can be converted to a question along with any additional metadata filters we can extract from the Lucene query.

I am curious about how you're planning the ingestion of documents from Confluence to a Haystack Document Store?

@lalitpagaria
Copy link
Contributor Author

lalitpagaria commented Oct 2, 2020

@tanaysoni Currently I am using very hackish solution, using confluence crawler to fetch pages and creating txt file (still work in progress to clean HTML tags). And then calling /file-upload to upload file to haystack.

But I see many issues with this approach (using haystack as a service), hence I will use haystack as a library for easier customisations on Document Store. Also I am planning to keep mapping of page_id -> haystack_doc_id, so easy to take care of update/deletion of pages. My design is still very raw, it will take some time to evolve.

@tanaysoni
Copy link
Contributor

Hi @lalitpagaria, thank you for sharing the details. I look forward to knowing how the end-to-end pipeline works out!

For the Lucene query part, what do you think of a new endpoint in the REST API that accepts Lucene DSL & converts to the /doc-qa format like you earlier proposed? Would that work for your use-case?

@lalitpagaria
Copy link
Contributor Author

lalitpagaria commented Oct 5, 2020

For the Lucene query part, what do you think of a new endpoint in the REST API that accepts Lucene DSL & converts to the /doc-qa format like you earlier proposed? Would that work for your use-case?

Yes it will work. For DSL, I found elasticsearch-dsl, which is better supported and maintained by Elastic. Can raise PR if that is fine?

I look forward to knowing how the end-to-end pipeline works out!

For me cleaning is not working fine. Confluence giving data in html format, and I am trying to clean it via tika but for 50% docs it is failing or not able to clean it. BTW I found better lib to fetch documents from confluence/Jira, as it support OAuth as well. Also Atlassian, deprecating xmlrpc calls and promoting allowing rest APIs.

@tholor tholor added this to the #2 milestone Oct 6, 2020
@lalitpagaria
Copy link
Contributor Author

You can assign this to me. I will raise WIP PR to get initial feedback and then we can go from there.

@lalitpagaria
Copy link
Contributor Author

Completed by #471

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:feature New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants