-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Knowledge tool appears to rescrape webpages that have already been scraped in the same session #536
Comments
This was in main and likely due to the server restarting because main otto upgrades on every commit. We can hopefully improve scraping to "pick up where it left off" more gracefully. |
In the event the process does get restarted, perhaps we can have some checkpointing mechanism. |
The fix is merged, but @sangee2004 I am not sure if you have a good way to kill this. Maybe restarting server in the middle of scraping? |
@StrongMonkey I was able to reproduce this issue by restarting otto server when website sync was in progress which was tracked in #1154 . I will test this use case . |
@StrongMonkey When I tried the following steps, I see the sync getting blocked after server restart for a long time ( 7 minute delay)
When sync and ingestion is still in progress , redeploy otto srerver. One otto server is restarted , noticed that syncing of files is stalled for a very long time like 7 minutes in my case , in this state . Agent - https://test.acornlabs.com/admin/agents/a1bsd6m After about 7 minutes , I see the sync proceed further Reopening this issue to see why we take 7 minutes for syncing to start progress in this case. |
This issue is addressed when testing with the latest build. |
I asked knowledge to scrape and ingest intel.com.
At some point it scraped 547 pages and ingested 30 pages. Then it somehow got into rescraping URLs that already have been scraped earlier, only to find out the timestamp has not changed. See the attached figure. This should not happen. It should not rescrape any URL until after we scraped the whole site and want to do a sync.
The text was updated successfully, but these errors were encountered: