Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

knownUrls processing logic is incorrectly using underscore #50

Open
Nysosis opened this issue Jan 2, 2018 · 3 comments
Open

knownUrls processing logic is incorrectly using underscore #50

Nysosis opened this issue Jan 2, 2018 · 3 comments
Assignees

Comments

@Nysosis
Copy link

Nysosis commented Jan 2, 2018

crawler.js#L178 and crawler.js#L197 are using underscore to determine whether the provided url is a known url.

This is incorrect, as underscore doesn't handle objects on it's contains method, as such it's always returning false. It should be using the same form as crawler.js#L203.

Changing it breaks a unit test though, and I'm not sure how it's going to impact the flow as expected

@amoilanen
Copy link
Owner

amoilanen commented Jan 10, 2018

Thanks for reviewing the code. Looks like at least in the second case the check always returns false.

  1. In crawler.js#L178 currentUrlsToCrawl is an array and Underscore's contains method should work as expected in this case http://underscorejs.org/#contains

  2. For the second case crawler.js#L197

_.contains(self.knownUrls, url)

should most likely be changed to

_.contains(_.keys(self.knownUrls), url)

as self.knownUrls is an object and for it _.contains will always return false.

Will check this more and fix this/update the tests.

@amoilanen amoilanen added the bug label Jan 10, 2018
@amoilanen amoilanen self-assigned this Jan 10, 2018
@Nysosis
Copy link
Author

Nysosis commented Jan 10, 2018

For the first case, you've got two ||'d checks in the line, the first currentUrlsToCrawl, like you say, is an array and will correctly work with underscore.contains, the second is the knownUrls, so will always return false.

I can't see any reason to keep the knownUrls as an object, it looks like it would all work correct if you switched it over to being an array, and just fixed up the instances where it's being treated like an object, #L203 for example, to use _.contains like currentUrlsToCrawl

As an aside, I wasn't sure if this was still an actively developed repo, and wanted to make some other changes to it, I've forked the repo, and done a load of work to it, as I wanted to integrate it with headless chrome (as mentioned in #49). I pulled out request from being an explict thing, and created two separate 'pluginable' engines, one for request and one for chromium (via puppeteer) (none of this is up on GitHub yet, just on my machine). Which I'm happy to look at merging/passing over, however, I have also converted the whole thing to typescript (which is what highlighted this issue to me in the first place), and I don't know your feelings on whether you'd be happy to make that big a change. Also some slight changes as a result of pulling out request, so there's some breaking changes for the API.

@amoilanen
Copy link
Owner

amoilanen commented Jan 14, 2018

Yes, you are right, will check both cases. Most likely we should switch knownUrls to be an array and make sure the unit tests pass.

Thank you for pointing out the problematic places and finding these issues.

As an aside, I wasn't sure if this was still an actively developed repo, and wanted to make some other changes to it, I've forked the repo, and done a load of work to it, as I wanted to integrate it with headless chrome (as mentioned in #49). I pulled out request from being an explict thing, and created two separate 'pluginable' engines, one for request and one for chromium (via puppeteer) (none of this is up on GitHub yet, just on my machine). Which I'm happy to look at merging/passing over, however, I have also converted the whole thing to typescript (which is what highlighted this issue to me in the first place), and I don't know your feelings on whether you'd be happy to make that big a change.

Not sure about switching to Typescript and adding support for headless Chromium with breaking API changes, these are quite large changes, but they can probably be done at least in a separate experimental branch.

I thought about developing the crawler further and supporting other request engines/re-factoring the implementation. But then it might be a good idea to avoid huge potentially destabilizing changes in the 'master' branch and making sure the implementation is well tested and there are unit and end-to-end tests. Once the new implementation is stable enough/is tested it can be released as a new major version.

From my side not sure how much time it is possible to dedicate to the development of the crawler, this whole process and releasing a new version can take at least a few months. I will try to find some time to do this.

If you need something faster and want to add more experimental features/do more changes, then probably it is a good idea to fork the repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants