Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Knowledge tool appears to rescrape webpages that have already been scraped in the same session #536

Closed
sheng-liang opened this issue Nov 11, 2024 · 6 comments
Labels
bug Something isn't working knowledge

Comments

@sheng-liang
Copy link

sheng-liang commented Nov 11, 2024

I asked knowledge to scrape and ingest intel.com.

At some point it scraped 547 pages and ingested 30 pages. Then it somehow got into rescraping URLs that already have been scraped earlier, only to find out the timestamp has not changed. See the attached figure. This should not happen. It should not rescrape any URL until after we scraped the whole site and want to do a sync.

image

@cjellick
Copy link
Contributor

This was in main and likely due to the server restarting because main otto upgrades on every commit. We can hopefully improve scraping to "pick up where it left off" more gracefully.

@cjellick cjellick added the bug Something isn't working label Dec 16, 2024
@cjellick
Copy link
Contributor

In the event the process does get restarted, perhaps we can have some checkpointing mechanism.

@StrongMonkey
Copy link
Contributor

The fix is merged, but @sangee2004 I am not sure if you have a good way to kill this. Maybe restarting server in the middle of scraping?

@sangee2004
Copy link

@StrongMonkey I was able to reproduce this issue by restarting otto server when website sync was in progress which was tracked in #1154 . I will test this use case .

@sangee2004
Copy link

@StrongMonkey When I tried the following steps, I see the sync getting blocked after server restart for a long time ( 7 minute delay)

  1. Create an agent with knowledge file from coral.org
  2. When sync is in progress , add knowledge file from one of the synced folders

When sync and ingestion is still in progress , redeploy otto srerver.

One otto server is restarted , noticed that syncing of files is stalled for a very long time like 7 minutes in my case , in this state .

Agent - https://test.acornlabs.com/admin/agents/a1bsd6m

Image

After about 7 minutes , I see the sync proceed further

Reopening this issue to see why we take 7 minutes for syncing to start progress in this case.

@sangee2004
Copy link

This issue is addressed when testing with the latest build.
There is no big delay in sync process to start happening when server was restarted when sync/ingestion was in progress.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working knowledge
Projects
None yet
Development

No branches or pull requests

4 participants