-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cursor to cql confluence #2775
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR Summary
This pull request enhances the Confluence connector by adding cursor support for CQL queries, improving pagination handling for large datasets.
- Modified
danswer_cql
method inDanswerConfluence
class to accept and return cursor information - Updated
_fetch_pages
method to use cursors for efficient pagination - Adjusted
load_from_state
andpoll_source
methods to work with the new cursor-based approach - These changes aim to improve efficiency and reliability when fetching large amounts of data from Confluence
1 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile
if include_archived_spaces: | ||
url_suffix += "&includeArchivedSpaces=true" | ||
try: | ||
response = self.get(url_suffix) | ||
return response.get("results", []) | ||
return response | ||
except Exception as e: | ||
raise e | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing a bunch of regex, this is now just explained in the instructions not to filter with lastmodified
) | ||
return {} | ||
|
||
def recurse_children_pages( | ||
self, | ||
start_ind: int, | ||
page_id: str, | ||
) -> list[dict[str, Any]]: | ||
pages: list[dict[str, Any]] = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pablo did a bunch of changes that seem to make the recursive indexing stuff work
elif self.space: | ||
self.cql_query = f"type=page and space={self.space}" | ||
elif space: | ||
self.cql_query = f"type=page and space='{space}'" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ this was key to fixing the issue with indexing spaces that start with a tilde
@@ -399,7 +298,6 @@ def __init__( | |||
|
|||
# Remove trailing slash from wiki_base if present | |||
self.wiki_base = wiki_base.rstrip("/") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we remove self.space because the only place its used outside of this function is in the metadata. See the comment therer for more details
attachment_text, unused_page_attachments = self._fetch_attachments( | ||
self.confluence_client, page_id, files_in_used | ||
self.confluence_client, page_id, files_in_use | ||
) | ||
unused_attachments.extend(unused_page_attachments) | ||
|
||
page_text += "\n" + attachment_text if attachment_text else "" | ||
comments_text = self._fetch_comments(self.confluence_client, page_id) | ||
page_text += comments_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We use the space retrieved with each page to get the name of the space instead of the key/id because that is more semantically useful for searching
@@ -825,70 +735,48 @@ def _get_attachment_batch( | |||
|
|||
return doc_batch, end_ind - start_ind | |||
|
|||
def load_from_state(self) -> GenerateDocumentsOutput: | |||
unused_attachments: list[dict[str, Any]] = [] | |||
def _handle_batch_retrieval( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create this function to reduce repeat code from the poll/load connector
Description
Quick fix, but todos for later
How Has This Been Tested?
[Describe the tests you ran to verify your changes]
Accepted Risk
[Any know risks or failure modes to point out to reviewers]
Related Issue(s)
[If applicable, link to the issue(s) this PR addresses]
Checklist: