Fix: Add visited url properly to memory store #354

StrongMonkey · 2025-01-20T21:25:45Z

The url was not properly added to the memory store to restore scraping process. Also, record pdf files that have been scraped too.

obot-platform/obot#536

Signed-off-by: Daishan Peng <[email protected]>

StrongMonkey · 2025-01-21T16:35:06Z

knowledge/data-sources/website/colly.go

@@ -212,13 +204,25 @@ func scrape(ctx context.Context, logOut *logrus.Logger, output *MetadataOutput,
 				}
 				linkURL = parsedLink
 			}
-			e.Request.Visit(linkURL.String())
+
+			if err := e.Request.Visit(linkURL.String()); err == nil {


Only marked url as visited if it has visited all the link under the url. Otherwise we could be missing url on the next restart.

StrongMonkey · 2025-01-21T16:35:31Z

knowledge/data-sources/website/main.go

@@ -32,8 +32,7 @@ type State struct {
 }

 type WebsiteCrawlingState struct {
-	CurrentURL  string              `json:"currentURL"`
-	VisitedURLs map[string]struct{} `json:"visitedURLs"`
+	VisitedURLs map[string]string `json:"visitedURLs"`


Changed it to a map of string and string to account for pdf file path.

StrongMonkey · 2025-01-21T16:54:32Z

@thedadams @g-linville changed the implementation quite a bit, would like some re-review

Fix: Add visited url properly to memory store

77e38ee

Signed-off-by: Daishan Peng <[email protected]>

StrongMonkey requested review from thedadams, njhale, iwilltry42 and g-linville January 20, 2025 21:27

g-linville approved these changes Jan 20, 2025

View reviewed changes

thedadams approved these changes Jan 20, 2025

View reviewed changes

Address other changes

76d43da

Signed-off-by: Daishan Peng <[email protected]>

StrongMonkey commented Jan 21, 2025

View reviewed changes

StrongMonkey requested review from g-linville and thedadams January 21, 2025 16:54

g-linville approved these changes Jan 21, 2025

View reviewed changes

thedadams approved these changes Jan 21, 2025

View reviewed changes

StrongMonkey merged commit 34e3340 into obot-platform:main Jan 21, 2025
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Add visited url properly to memory store #354

Fix: Add visited url properly to memory store #354

StrongMonkey commented Jan 20, 2025 •

edited

Loading

StrongMonkey Jan 21, 2025

StrongMonkey Jan 21, 2025

StrongMonkey commented Jan 21, 2025

Fix: Add visited url properly to memory store #354

Fix: Add visited url properly to memory store #354

Conversation

StrongMonkey commented Jan 20, 2025 • edited Loading

StrongMonkey Jan 21, 2025

Choose a reason for hiding this comment

StrongMonkey Jan 21, 2025

Choose a reason for hiding this comment

StrongMonkey commented Jan 21, 2025

StrongMonkey commented Jan 20, 2025 •

edited

Loading