Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cursor to cql confluence #2775

Merged
merged 5 commits into from
Oct 14, 2024
Merged

Add cursor to cql confluence #2775

merged 5 commits into from
Oct 14, 2024

Conversation

pablonyx
Copy link
Contributor

@pablonyx pablonyx commented Oct 12, 2024

Description

Quick fix, but todos for later

  • Continue on failure
  • Fully verify CQL (base settings tested)

How Has This Been Tested?

[Describe the tests you ran to verify your changes]

Accepted Risk

[Any know risks or failure modes to point out to reviewers]

Related Issue(s)

[If applicable, link to the issue(s) this PR addresses]

Checklist:

  • All of the automated tests pass
  • All PR comments are addressed and marked resolved
  • If there are migrations, they have been rebased to latest main
  • If there are new dependencies, they are added to the requirements
  • If there are new environment variables, they are added to all of the deployment methods
  • If there are new APIs that don't require auth, they are added to PUBLIC_ENDPOINT_SPECS
  • Docker images build and basic functionalities work
  • Author has done a final read through of the PR right before merge

Copy link

vercel bot commented Oct 12, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
internal-search ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 13, 2024 11:45pm

@pablonyx pablonyx marked this pull request as ready for review October 12, 2024 19:51
Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This pull request enhances the Confluence connector by adding cursor support for CQL queries, improving pagination handling for large datasets.

  • Modified danswer_cql method in DanswerConfluence class to accept and return cursor information
  • Updated _fetch_pages method to use cursors for efficient pagination
  • Adjusted load_from_state and poll_source methods to work with the new cursor-based approach
  • These changes aim to improve efficiency and reliability when fetching large amounts of data from Confluence

1 file(s) reviewed, no comment(s)
Edit PR Review Bot Settings | Greptile

if include_archived_spaces:
url_suffix += "&includeArchivedSpaces=true"
try:
response = self.get(url_suffix)
return response.get("results", [])
return response
except Exception as e:
raise e


Copy link
Contributor

@hagen-danswer hagen-danswer Oct 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing a bunch of regex, this is now just explained in the instructions not to filter with lastmodified

)
return {}

def recurse_children_pages(
self,
start_ind: int,
page_id: str,
) -> list[dict[str, Any]]:
pages: list[dict[str, Any]] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pablo did a bunch of changes that seem to make the recursive indexing stuff work

elif self.space:
self.cql_query = f"type=page and space={self.space}"
elif space:
self.cql_query = f"type=page and space='{space}'"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ this was key to fixing the issue with indexing spaces that start with a tilde

@@ -399,7 +298,6 @@ def __init__(

# Remove trailing slash from wiki_base if present
self.wiki_base = wiki_base.rstrip("/")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we remove self.space because the only place its used outside of this function is in the metadata. See the comment therer for more details

attachment_text, unused_page_attachments = self._fetch_attachments(
self.confluence_client, page_id, files_in_used
self.confluence_client, page_id, files_in_use
)
unused_attachments.extend(unused_page_attachments)

page_text += "\n" + attachment_text if attachment_text else ""
comments_text = self._fetch_comments(self.confluence_client, page_id)
page_text += comments_text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use the space retrieved with each page to get the name of the space instead of the key/id because that is more semantically useful for searching

@@ -825,70 +735,48 @@ def _get_attachment_batch(

return doc_batch, end_ind - start_ind

def load_from_state(self) -> GenerateDocumentsOutput:
unused_attachments: list[dict[str, Any]] = []
def _handle_batch_retrieval(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create this function to reduce repeat code from the poll/load connector

@Weves Weves added this pull request to the merge queue Oct 14, 2024
Merged via the queue into main with commit a9bcc89 Oct 14, 2024
7 checks passed
@hagen-danswer hagen-danswer deleted the confluence_patch branch October 17, 2024 18:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants