-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix: Add visited url properly to memory store #354
Conversation
Signed-off-by: Daishan Peng <[email protected]>
Signed-off-by: Daishan Peng <[email protected]>
@@ -212,13 +204,25 @@ func scrape(ctx context.Context, logOut *logrus.Logger, output *MetadataOutput, | |||
} | |||
linkURL = parsedLink | |||
} | |||
e.Request.Visit(linkURL.String()) | |||
|
|||
if err := e.Request.Visit(linkURL.String()); err == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only marked url as visited if it has visited all the link under the url. Otherwise we could be missing url on the next restart.
@@ -32,8 +32,7 @@ type State struct { | |||
} | |||
|
|||
type WebsiteCrawlingState struct { | |||
CurrentURL string `json:"currentURL"` | |||
VisitedURLs map[string]struct{} `json:"visitedURLs"` | |||
VisitedURLs map[string]string `json:"visitedURLs"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed it to a map of string and string to account for pdf file path.
@thedadams @g-linville changed the implementation quite a bit, would like some re-review |
The url was not properly added to the memory store to restore scraping process. Also, record pdf files that have been scraped too.
obot-platform/obot#536