Extend repitition test to span across two files #349

MichaIng · 2021-09-30T11:52:34Z

A first step to address: #348

MichaIng · 2021-09-30T12:04:31Z

Strange, the test succeeds:

test cli::test_repetition ... ok

with contradicts my tests where in this case the URLs were checked and counted once per file. Did I do the syntax right to add the two input files?

Or the issue was fixed in the meantime between my tests and this commit? 🤔
I'll repeat my tests now with lychee compiled from latest master.

lychee-bin/tests/cli.rs

MichaIng · 2021-09-30T13:27:03Z

Okay, "great", no it fails as expected 🙂.

MichaIng · 2021-09-30T13:40:28Z

Here is the part which should filter out cached links. Mandatory for this to work of course is that there is only a single Collector instance:

// Filter out already cached links (duplicates)
links.retain(|l| !self.cache.contains(&l.uri));

mre · 2021-09-30T13:46:52Z

I wonder why we even have that line at all. Given that links is a HashSet<Request>, it should just work if we update it as we do in the line after that.

MichaIng · 2021-09-30T13:57:03Z

Hmm, you mean identical URIs then have an identical hash (=key) so that duplicates shouldn't be possible? I also see only a single let links = Collector::new( Collector instance (outside of examples and tests) in main.rs > async fn run(, so at least there is no obvious reason why multiple instances could exist.

Ah wait, link checks start already (async) after one file has been processed, right? Is it possible that a link is checked before the duplicate has been parsed, and when being parsed a second time the cached result is overridden or so?

MichaIng · 2021-09-30T16:32:33Z

In test_glob_ignore_case it also needs to expect only a single check, I guess, same as in test_glob?
As an alternative, probably there is a way to have different URIs for the mocked servers? Not sure what makes most sense there, whether we want to include testing the URI cache or explicitly want to test whether two different URIs/servers are checked independently.

Otherwise, awesome, the cache now seems to work 👍. Probably links.retain can also be skipped if links.extend + update_cache excludes duplicates already?
Or probably it can be made a little more efficient, to skip/remove duplicates only from new_links instead of checking this of all links on every iteration for the handle loop?

MichaIng · 2021-09-30T18:57:37Z

Failing tests:

test_collect_links
test_require_https

MichaIng · 2021-09-30T19:14:43Z

Okay, I think something has gone wrong. test_collect_links shows that finally only a single URL was returned in responses while there are definitely 5 different URIs added. Probably the test_glob* tests should still show two checked URIs (and hence the two mocked servers have different URIs) and only one is checked for a similar reason than in test_collect_links.

In test_require_https the URL needs to stay plain HTTP as is is explicitly about testing the --require-https option. I'll revert this.

@mre
Okay I think I found the reason for the lost URIs:

links.retain(|l| !self.cache.contains(&l.uri)); now removes the URIs which were just added in the previous loop iteration, as they are now synced to the cache immediately. So only the last URI remains.
Instead of removing everything from links which is in the cache, we need to do this for new_links only. So in every iteration of the loop, only URIs which are not in the cache yet are added to the function-wide links, but nothing which was added to links already must be removed.
Also instead of adding all links to the cache on every loop iteration, we could do this with new_links only, which should reduce overhead a little. I'm just not sure how to adjust update_cache argument and handling when passing new_links.
Now it is also clear why the additional retain is still required. links and cache are both hash sets, so they cannot contain duplicates naturally. But we want to avoid that collect_links > client instances check the same links each. So links itself cannot contain duplicates, but it also must not contain any URI which was added by another collect_links instance to the cache before. Did I understand this right? 😅

lychee-lib/src/client.rs

MichaIng · 2021-10-01T12:53:25Z

I tried to fix the issue that each loop clears links. I hope links and new_links are the same variable type, else it would fail and needs fix by someone who is able to fix e.g. update_cache argument definitions accordingly.
EDIT: Okay does not work as new_links is no HashSet. Shall we make it one or shall we update the methods accordindly?

The whole issue with the cache, why it didn't work as expected, seems to be due to the race condition between multiple collect_links instances when checking their links against the cache and then updating the cache afterwards. Updating the cache directly within the loop and directly after already cached URLs have been removed from new links, minimises this race condition, but does not fully eliminate it. In theory the new links would need to be "locked" in the cache first before being filtered, so that in the meantime a second instance cannot add any of those to the cache. But I have no good idea how achieve this best.
EDIT: Hmm, actually reviewing the previous code, I am not 100% sure anymore whether the race condition really is the issue. I mean links was filtered against cache directly before they were added to the cache. I cannot imagine that there is a notable amount of time between those two steps so that a race condition is relevant? But I have no other explanation why it didn't/doesn't work either. We'll see when having syntax of current state fixed, so that links are filtered and cache updated with a smaller amount of links, potentially reducing the time between both steps.

mre · 2021-10-02T00:10:34Z

Adding the cache to the collector was a mistake. The collector should have a single responsibility: collecting links. If anything then we should have a cache for requests. There it would be a simple lookup if a URL was already requested with the added benefit of tracking the number of duplicate links in the future. In conclusion I'd vote for removing the cache entirely for now and add it back later. 😉

MichaIng · 2021-10-02T13:08:23Z

Just for my understanding:

Based on max concurrency (and number of links, of course), there is a number of concurrent collect_links instances, which return links, so I guess there is the same number of client side request handlers attached to those?
One input cannot be handled by multiple collect instances but basically the inputs are sorted into multiple collect instances?

So there needs to be one global cache, links or requests pool, and it shouldn't matter which one it is, as long as duplicates across all concurrent instances are filtered at some point, in theory the earlier the better (less overhead). I'm not sure why it doesn't work currently (side of the currently false syntax 😉), so if you don't see an easy way to fix it the way it was intended initially, I agree to remove it for now, especially since I'm looking forward for a release with local files support 🙂.

Wouldn't it be possible that all collect instances feed a global links pool, being a hash set to avoid duplicates in the first place, and to have all request handlers then directly pick from this global pool (locking/tagging the entry first so that it is not handled a second time)? That way we don't need a separate cache to check against.
EDIT: Actually you had this idea already: #55
And would address as well: #163
Probably there is a way to store the links/URIs and their response in the same HashSet, so that it is a single pool for everything, which naturally excludes dupes on both ends.

MichaIng · 2021-10-05T13:45:37Z

Okay how shall we proceed? The cache does not work reliably, but if really concurrency/race condition is the issue, then it should at least work partly when using multiple and larger inputs, and I do not see a general issue to filter it in the collector already, at least until we find time for a better implementation. I wonder whether #330 will help 🤔. Probably we could leave things for now and rebase the two-files test extension afterwards.

If anyone knows how to fix the current state so that new_links is filtered against the cache, we could at least see whether it helps. That retain(|l| !self.cache.contains(&l.uri)) generally works has been proven by the previous CI test results, where links was "correctly" cleared as all contained links were found in the cache already: #349 (comment)

But otherwise it is trivial to remove the parts with the cache for now and revert the tests to check for filtered repetitions in a single file only.

Signed-off-by: MichaIng <[email protected]>

lychee-lib/src/collector.rs

mre · 2021-10-06T13:00:32Z

I think there is an option to just add a #[cached] derive to the client check method similar to the one we already have for the path look up. 🤔 That should solve our issues and be way more elegant imho.

MichaIng · 2021-10-06T13:19:07Z

Sounds like a more native solution 👍. Btw, shall I outsource the HTTP => HTTPS changes into an own PR?

Reqwest comes with its own request pool, so there's no need in adding another layer of indirection. This also gets rid of a lot of allocs.

Signed-off-by: MichaIng <[email protected]>

Reqwest comes with its own request pool, so there's no need in adding another layer of indirection. This also gets rid of a lot of allocs.

Signed-off-by: MichaIng <[email protected]>

lychee-bin/tests/cli.rs

mre · 2021-10-07T12:24:01Z

I've removed caching across multiple files as discussed.
Was hoping to add back the caching to client with a #[cached] macro as we do for paths, but no luck because #[cached] is not supported for methods yet. Did some refactoring around the client while I was at it, though. Namely, I removed the deadpool because it's not needed (reqwest comes with its own request pool). So while this is not strictly related to the goal of this PR, I'd still like to keep the changes, as performance is not worse in my tests (probably faster if anything) and we get rid of a dependency and quite a few allocations (~2% in my tests for the sentry docs example).

MichaIng · 2021-10-07T12:30:03Z

How is the collector now tied to the clients? I had the impression that one collector iteration was tied to one client, but when there is no client pool now but a single instance with internal threading/pool, may this even solve the problem of concurrent collector iterations?

mre · 2021-10-07T12:35:10Z

Not sure I understand. The collector and clients were always independent (?). The collector would return a HashSet of requests and those would be fed to the client pool via a channel. So there was no coupling between the two. So there shouldn't have been any race condition. Now that we removed the client pool, we feed the requests to the same client instance. Apart from that nothing changed.

Sorry, I'm pretty sure I misunderstood. 😅

mre · 2021-10-07T12:40:19Z

Apart from that, this is ready to be merged from my point of view.

MichaIng · 2021-10-07T12:47:13Z

Ah okay, was my misunderstanding then. I was just wondering where this Ok(links) of the collector is going to, as this is assured to not contain dupes. Found it now here: https://github.com/lycheeverse/lychee/blob/251332e/lychee-bin/src/main.rs#L210-L213

Actually I don't understand where the concurrency of the link collection comes from, to me it looks like there is a single Collector => collect_links => Ok(links), while only filling the links HashSet is done by concurrent handles. So when there is only a single links HashSet filled and returned, how is it possible that links are checked multiple times? Is the status of a link check stored in this HashSet as well and may links.extend(new_links?); reset this status when one of the new_links is present already?

mre · 2021-10-07T14:22:56Z

Is the status of a link check stored in this HashSet as well and may links.extend(new_links?); reset this status when one of the new_links is present already?

Yeah something like that. TBH I don't really care because the current cache impl is broken and too entangled with the collector. We'll find a better option like using a XOR filter, which is very memory-efficient and fast. But that's for another PR. 😉 Will merge this one now unless anyone has any final comments.

mre · 2021-10-07T16:08:13Z

@MichaIng I've merged the changes now. Let's build a better caching layer in the future. As always, thanks for the initial PR and the feedback during the development. 😃

mre · 2021-10-07T17:19:20Z

Meh, just tested master and it's super slow now. Looks like I misunderstood reqwest's concurrency model. It feels more like it's requesting every page sequentially. Maybe it's a simple fix without having to add back deadpool. 🤔
@MichaIng can you check?

MichaIng · 2021-10-07T19:38:45Z

Yes I can confirm, on our website, with latest master, execution time raised from ~8 seconds to 2 minutes, on the docs from 13 seconds to 13 minutes (!). That client pool definitely needs to stay until another way of concurrency is found.

mre · 2021-10-08T10:54:44Z

Created a PR and will merge it when it gets green. That should take care of it.

A while ago, caching was removed due to some issues (see #349). This is a new implementation with the following improvements: * Architecture: The new implementation is decoupled from the collector, which was a major issue in the last version. Now the collector has a single responsibility: collecting links. This also avoids race-conditions when running multiple collect_links instances, which probably was an issue before. * Performance: Uses DashMap under the hood, which was noticeably faster than Mutex<HashMap> in my tests. * Simplicity: The cache format is a CSV file with two columns: URI and status. I decided to create a new struct called CacheStatus for serialization, because trying to serialize the error kinds in Status turned out to be a bit of a nightmare and at this point I don't think it's worth the pain (and probably isn't idiomatic either). This is an optional feature. Caching only gets used if the `--cache` flag is set.

MichaIng mentioned this pull request Sep 30, 2021

Fix URI cache to cache URIs across multiple files #348

Closed

mre reviewed Sep 30, 2021

View reviewed changes

lychee-bin/tests/cli.rs Outdated Show resolved Hide resolved

mre reviewed Sep 30, 2021

View reviewed changes

lychee-lib/src/client.rs Outdated Show resolved Hide resolved

Extend repitition test to span across two files

69de82f

Signed-off-by: MichaIng <[email protected]>

MichaIng commented Oct 6, 2021

View reviewed changes

lychee-lib/src/collector.rs Outdated Show resolved Hide resolved

mre and others added 9 commits October 7, 2021 12:58

Remove cache from collector

1ec9c59

Remove client pool

f0b2417

Reqwest comes with its own request pool, so there's no need in adding another layer of indirection. This also gets rid of a lot of allocs.

Extend repitition test to span across two files

d92a263

Signed-off-by: MichaIng <[email protected]>

Remove cache from collector

1813a48

Remove client pool

e067505

Reqwest comes with its own request pool, so there's no need in adding another layer of indirection. This also gets rid of a lot of allocs.

Merge remote-tracking branch 'origin/master' into MichaIng/master

456fd3f

Merge branch 'master' of github.com:MichaIng/lychee into MichaIng/master

7519a9c

Add comment about ignored caching test

9d920d1

Satisfy Rust fmt lint

1bc7db3

Signed-off-by: MichaIng <[email protected]>

MichaIng commented Oct 7, 2021

View reviewed changes

lychee-bin/tests/cli.rs Show resolved Hide resolved

Improve error handling and documentation

9ee30cc

mre added 2 commits October 7, 2021 14:09

Merge branch 'master' of github.com:MichaIng/lychee into MichaIng/master

e6e0969

Add back test for request caching in single file

00a0558

mre merged commit 961f12e into lycheeverse:master Oct 7, 2021

mre mentioned this pull request Oct 8, 2021

Add back connection pool #355

Merged

dcroote mentioned this pull request Oct 22, 2021

429 Too Many Requests #367

Closed

mre mentioned this pull request Jan 4, 2022

Add caching functionality (v2) #443

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend repitition test to span across two files #349

Extend repitition test to span across two files #349

MichaIng commented Sep 30, 2021

MichaIng commented Sep 30, 2021 •

edited

Loading

MichaIng commented Sep 30, 2021

MichaIng commented Sep 30, 2021

mre commented Sep 30, 2021

MichaIng commented Sep 30, 2021 •

edited

Loading

MichaIng commented Sep 30, 2021 •

edited

Loading

MichaIng commented Sep 30, 2021

MichaIng commented Sep 30, 2021 •

edited

Loading

MichaIng commented Oct 1, 2021 •

edited

Loading

mre commented Oct 2, 2021

MichaIng commented Oct 2, 2021 •

edited

Loading

MichaIng commented Oct 5, 2021 •

edited

Loading

mre commented Oct 6, 2021

MichaIng commented Oct 6, 2021

mre commented Oct 7, 2021

MichaIng commented Oct 7, 2021

mre commented Oct 7, 2021

mre commented Oct 7, 2021

MichaIng commented Oct 7, 2021

mre commented Oct 7, 2021

mre commented Oct 7, 2021

mre commented Oct 7, 2021

MichaIng commented Oct 7, 2021

mre commented Oct 8, 2021

Extend repitition test to span across two files #349

Extend repitition test to span across two files #349

Conversation

MichaIng commented Sep 30, 2021

MichaIng commented Sep 30, 2021 • edited Loading

MichaIng commented Sep 30, 2021

MichaIng commented Sep 30, 2021

mre commented Sep 30, 2021

MichaIng commented Sep 30, 2021 • edited Loading

MichaIng commented Sep 30, 2021 • edited Loading

MichaIng commented Sep 30, 2021

MichaIng commented Sep 30, 2021 • edited Loading

MichaIng commented Oct 1, 2021 • edited Loading

mre commented Oct 2, 2021

MichaIng commented Oct 2, 2021 • edited Loading

MichaIng commented Oct 5, 2021 • edited Loading

mre commented Oct 6, 2021

MichaIng commented Oct 6, 2021

mre commented Oct 7, 2021

MichaIng commented Oct 7, 2021

mre commented Oct 7, 2021

mre commented Oct 7, 2021

MichaIng commented Oct 7, 2021

mre commented Oct 7, 2021

mre commented Oct 7, 2021

mre commented Oct 7, 2021

MichaIng commented Oct 7, 2021

mre commented Oct 8, 2021

MichaIng commented Sep 30, 2021 •

edited

Loading

MichaIng commented Sep 30, 2021 •

edited

Loading

MichaIng commented Sep 30, 2021 •

edited

Loading

MichaIng commented Sep 30, 2021 •

edited

Loading

MichaIng commented Oct 1, 2021 •

edited

Loading

MichaIng commented Oct 2, 2021 •

edited

Loading

MichaIng commented Oct 5, 2021 •

edited

Loading