-
Notifications
You must be signed in to change notification settings - Fork 55
/
Copy pathBacklog
48 lines (36 loc) · 2.48 KB
/
Backlog
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
- Add the ability to switch the crawling engine: headless Chrome or Node.js requests
- Migrate to ES6 or Typescript, modern tooling
-- Should be possible to run crawler end-to-end tests - OK
-- Should be possible to run crawler unit tests, both on the console and in the browser - OK
-- Should be possible to run crawler examples - OK
-- Should be possible to publish an NPM package and install it - OK
-- Should be possible to lint Typescript code - OK
-- Re-factor crawler, extract more classes, add types
--- Extract separate tests for Executor and Response - OK
-- Use Mocha and Chai instead of Jasmine for unit tests? Use headless Chrome or Firefox when running tests.
-- Test request failing, that concurrent request number is decreased as expected
Replace PhantomJS with headless Chrome when running unit tests
-- Fix linting errors
-- Source map support
-- Line coverage
- Add more e2e tests, change from jasmine-node to some standard test runner for Node.js?
- Add support for other request engines: headless Chrome, make it configurable
- Provide additional reactive-like API for crawler (rx.js), naturally urls being crawled form a stream of data?
- Enable debugging, when crawler is passed debug: true option log information about crawled urls, content etc.?
- It should be possible to limit the number of requests not only by the number of requests per second but also
by the number of concurrent active requests, i.e. maxConcurrentRequests: 5
This should provide yet another sensible approach to limiting the bandwidth being used up.
- Add the ability to avoid crawling non-text urls such as audio and video files, can take a lot of bandwidth
- In case there is an error when crawling some url, this url can be queued again, and only fully abandoned from crawling after another 2 repeated failures
- Limit the section of the page that should be crawled https://github.com/antivanov/js-crawler/issues/15
- Normalize the urls, so that urls https://github.com https://github.com/ are considered to be the same url
- Re-factor the crawler, add unit tests and the build infrastructure
- Add end-to-end tests for the following cases:
- Redirects
- Binary content
- Several pages referencing each other (tree like structure)
- Handle cycles: page 1 references page 2 and page 2 references page 1
- Look at and test more the API for forgetting crawled urls
- Better intuitive API for crawling several urls
- Move sources under the src directory, then bundle during the build
- Code coverage