New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add cursor to cql confluence #2775

Merged

Weves merged 5 commits into main from confluence_patch

Oct 14, 2024

Contributor

pablonyx commented Oct 12, 2024 •

edited

Loading

Description

Quick fix, but todos for later

Continue on failure
Fully verify CQL (base settings tested)

How Has This Been Tested?

[Describe the tests you ran to verify your changes]

Accepted Risk

[Any know risks or failure modes to point out to reviewers]

Related Issue(s)

[If applicable, link to the issue(s) this PR addresses]

Checklist:

All of the automated tests pass
All PR comments are addressed and marked resolved
If there are migrations, they have been rebased to latest main
If there are new dependencies, they are added to the requirements
If there are new environment variables, they are added to all of the deployment methods
If there are new APIs that don't require auth, they are added to PUBLIC_ENDPOINT_SPECS
Docker images build and basic functionalities work
Author has done a final read through of the PR right before merge


          add cursor to cql confluence

8543f86

vercel bot commented Oct 12, 2024 •

edited

Loading

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
internal-search	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Oct 13, 2024 11:45pm

pablonyx added 2 commits

October 12, 2024 12:50

46ae660

8b572cb

pablonyx marked this pull request as ready for review

October 12, 2024 19:51

greptile-apps bot reviewed

View reviewed changes

greptile-apps bot left a comment

PR Summary

This pull request enhances the Confluence connector by adding cursor support for CQL queries, improving pagination handling for large datasets.

Modified danswer_cql method in DanswerConfluence class to accept and return cursor information
Updated _fetch_pages method to use cursors for efficient pagination
Adjusted load_from_state and poll_source methods to work with the new cursor-based approach
These changes aim to improve efficiency and reliability when fetching large amounts of data from Confluence

_{1 file(s) reviewed, no comment(s)}
_{Edit PR Review Bot Settings | Greptile}

vercel bot deployed to Preview

October 12, 2024 19:57

View deployment


          fixed space indexing issue

3f20731

vercel bot deployed to Preview

October 13, 2024 23:29

View deployment

hagen-danswer reviewed

View reviewed changes

backend/danswer/connectors/confluence/connector.py

                       if include_archived_spaces:
                           url_suffix += "&includeArchivedSpaces=true"
                       try:
                           response = self.get(url_suffix)
-                          return response.get("results", [])
+                          return response
                       except Exception as e:
                           raise e

Contributor

hagen-danswer Oct 13, 2024 •

edited

Loading

Instead of doing a bunch of regex, this is now just explained in the instructions not to filter with lastmodified

backend/danswer/connectors/confluence/connector.py

                           )
                           return {}
                   def recurse_children_pages(
                       self,
-                      start_ind: int,
                       page_id: str,
                   ) -> list[dict[str, Any]]:
                       pages: list[dict[str, Any]] = []

Contributor

hagen-danswer Oct 13, 2024

pablo did a bunch of changes that seem to make the recursive indexing stuff work

backend/danswer/connectors/confluence/connector.py

    
                      elif self.space:

                          self.cql_query = f"type=page and space={self.space}"

                      elif space:

                          self.cql_query = f"type=page and space='{space}'"

Contributor

hagen-danswer Oct 13, 2024

^ this was key to fixing the issue with indexing spaces that start with a tilde

backend/danswer/connectors/confluence/connector.py

		@@ -399,7 +298,6 @@ def __init__(

		# Remove trailing slash from wiki_base if present
		self.wiki_base = wiki_base.rstrip("/")

Contributor

hagen-danswer Oct 13, 2024

we remove self.space because the only place its used outside of this function is in the metadata. See the comment therer for more details

backend/danswer/connectors/confluence/connector.py

                           attachment_text, unused_page_attachments = self._fetch_attachments(
-                              self.confluence_client, page_id, files_in_used
+                              self.confluence_client, page_id, files_in_use
                           )
                           unused_attachments.extend(unused_page_attachments)
                           page_text += "\n" + attachment_text if attachment_text else ""
                           comments_text = self._fetch_comments(self.confluence_client, page_id)
                           page_text += comments_text

Contributor

hagen-danswer Oct 13, 2024

We use the space retrieved with each page to get the name of the space instead of the key/id because that is more semantically useful for searching

backend/danswer/connectors/confluence/connector.py

@@ @@ -825,70 +735,48 @@ def _get_attachment_batch( @@
                       return doc_batch, end_ind - start_ind
-                  def load_from_state(self) -> GenerateDocumentsOutput:
-                      unused_attachments: list[dict[str, Any]] = []
+                  def _handle_batch_retrieval(

Contributor

hagen-danswer Oct 13, 2024

create this function to reduce repeat code from the poll/load connector


          fixed .get

8546bd4

vercel bot deployed to Preview

October 13, 2024 23:45

View deployment

Weves approved these changes

View reviewed changes

Weves added this pull request to the merge queue

Merged via the queue into main with commit a9bcc89

7 checks passed

hagen-danswer deleted the confluence_patch branch

October 17, 2024 18:43

This was referenced Nov 27, 2024

chore/merge upstream 2024112601 mindvalley/danswer#79

Merged

chore/merge upstream 2024112801 mindvalley/danswer#81

Merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet