Knowledge tool appears to rescrape webpages that have already been scraped in the same session #536

sheng-liang · 2024-11-11T20:48:23Z

I asked knowledge to scrape and ingest intel.com.

At some point it scraped 547 pages and ingested 30 pages. Then it somehow got into rescraping URLs that already have been scraped earlier, only to find out the timestamp has not changed. See the attached figure. This should not happen. It should not rescrape any URL until after we scraped the whole site and want to do a sync.

cjellick · 2024-11-12T00:26:17Z

This was in main and likely due to the server restarting because main otto upgrades on every commit. We can hopefully improve scraping to "pick up where it left off" more gracefully.

cjellick · 2024-12-27T22:19:01Z

In the event the process does get restarted, perhaps we can have some checkpointing mechanism.

StrongMonkey · 2025-01-10T23:35:15Z

The fix is merged, but @sangee2004 I am not sure if you have a good way to kill this. Maybe restarting server in the middle of scraping?

sangee2004 · 2025-01-11T01:16:46Z

@StrongMonkey I was able to reproduce this issue by restarting otto server when website sync was in progress which was tracked in #1154 . I will test this use case .

sangee2004 · 2025-01-13T21:34:40Z

@StrongMonkey When I tried the following steps, I see the sync getting blocked after server restart for a long time ( 7 minute delay)

Create an agent with knowledge file from coral.org
When sync is in progress , add knowledge file from one of the synced folders

When sync and ingestion is still in progress , redeploy otto srerver.

One otto server is restarted , noticed that syncing of files is stalled for a very long time like 7 minutes in my case , in this state .

Agent - https://test.acornlabs.com/admin/agents/a1bsd6m

After about 7 minutes , I see the sync proceed further

Reopening this issue to see why we take 7 minutes for syncing to start progress in this case.

sangee2004 · 2025-01-22T22:29:26Z

This issue is addressed when testing with the latest build.
There is no big delay in sync process to start happening when server was restarted when sync/ingestion was in progress.

cjellick assigned StrongMonkey Nov 12, 2024

cjellick added the knowledge label Dec 4, 2024

cjellick added the bug Something isn't working label Dec 16, 2024

cjellick unassigned StrongMonkey Dec 26, 2024

cjellick mentioned this issue Jan 9, 2025

Knowledge- Restarting otto server when website sync is in progress results in all the synced files to get resynced. #1154

Closed

StrongMonkey mentioned this issue Jan 20, 2025

Fix: Add visited url properly to memory store obot-platform/tools#354

Merged

sangee2004 closed this as completed Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Knowledge tool appears to rescrape webpages that have already been scraped in the same session #536

Knowledge tool appears to rescrape webpages that have already been scraped in the same session #536

sheng-liang commented Nov 11, 2024 •

edited

Loading

cjellick commented Nov 12, 2024

cjellick commented Dec 27, 2024

StrongMonkey commented Jan 10, 2025

sangee2004 commented Jan 11, 2025

sangee2004 commented Jan 13, 2025

sangee2004 commented Jan 22, 2025

Knowledge tool appears to rescrape webpages that have already been scraped in the same session #536

Knowledge tool appears to rescrape webpages that have already been scraped in the same session #536

Comments

sheng-liang commented Nov 11, 2024 • edited Loading

cjellick commented Nov 12, 2024

cjellick commented Dec 27, 2024

StrongMonkey commented Jan 10, 2025

sangee2004 commented Jan 11, 2025

sangee2004 commented Jan 13, 2025

sangee2004 commented Jan 22, 2025

sheng-liang commented Nov 11, 2024 •

edited

Loading