Skip to content
This repository has been archived by the owner on May 21, 2024. It is now read-only.

Fixes for fetcher #1032

Merged
merged 6 commits into from
Dec 14, 2018
Merged

Fixes for fetcher #1032

merged 6 commits into from
Dec 14, 2018

Conversation

eu-siemann
Copy link
Contributor

There are several problems with pause-resume fetcher API and the corresponding fetcher test that I see at the moment.

  1. Fetcher test currently uses large_interrupted file for pause/resume. With large_interrupted server aborts after transferring half of a file (1000 bytes). This should lead to the The target's calculated hash did not match the hash in the metadata error and, after that, the upper layer may choose to retry. However, currently, the test succeeds, because while committing the first half to the db, fetcher gets pause event, which makes it retry silently after resume is called. This behavior itself is not a problem, the problem is that it is also correct for the test to fail, if pause comes after the first chunk is committed.

  2. Race condition between fetchVerifyTarget() or DownloadHandler() and setPause():

    http->download()
                                                setPause(true) 
    DownloadHandler() {
    ...
    return written_size + 1
                                                setPause(false)
    http->download() == CURLE_WRITE_ERROR && 
    pause_ == false
    throw OversizedTarget
    
  3. Using a mutex to pause the fetcher is not only non-idiomatic, but it is also an undefined behavior, as the thread which called pause may terminate and/or resume might get called from another thread. See [https://en.cppreference.com/w/cpp/thread/mutex] and [https://en.cppreference.com/w/cpp/thread/mutex/unlock]. It seems to work now, but it can easily break anytime.

  4. The downloading_ guard in fetcher is pretty useless IMO. If I understand correctly its main purpose is to return kNotDownloading, if the Download API wasn't called yet. But the downloading_ flag
    doesn't guarantee anything. The high-level Download API call is asynchronous and the execution of underlying Fetcher::fetchVerifyTarget will be performed at some point in the future on the command queue. Pause, on the other hand, is synchronous, so it's entirely possible that caller gets a kNotDownloading even after he has called Download. That means that caller should track the downloading state himself and provide some retry logic for the aforementioned case.

    So why bother with the kNotDownloading at all?
    Cleaner approach IMO would be to set pause no matter if the Download was called or not. In case Download called after a Pause it should just block until the Resume is called.

This PR attempts to solve issues 1-3, issue 4 is out of scope currently, I just want to start a discussion on it.

@codecov-io
Copy link

codecov-io commented Dec 11, 2018

Codecov Report

Merging #1032 into master will increase coverage by 0.01%.
The diff coverage is 97.77%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1032      +/-   ##
==========================================
+ Coverage   82.31%   82.33%   +0.01%     
==========================================
  Files         189      189              
  Lines       13663    13671       +8     
==========================================
+ Hits        11247    11256       +9     
+ Misses       2416     2415       -1
Impacted Files Coverage Δ
src/libaktualizr/package_manager/ostreemanager.h 83.33% <ø> (-2.39%) ⬇️
src/libaktualizr/http/httpclient.h 100% <ø> (ø) ⬆️
src/libaktualizr/uptane/fetcher.h 86.66% <100%> (+0.45%) ⬆️
src/libaktualizr/package_manager/ostreemanager.cc 54.78% <100%> (+0.08%) ⬆️
src/libaktualizr/http/httpinterface.h 100% <100%> (ø) ⬆️
src/libaktualizr/http/httpclient.cc 91.13% <100%> (ø) ⬆️
src/libaktualizr/uptane/fetcher.cc 96.93% <97.22%> (-0.24%) ⬇️
src/libaktualizr/uptane/fetcher_test.cc 96.52% <97.61%> (+0.6%) ⬆️
src/libaktualizr/primary/reportqueue_test.cc 95.94% <0%> (-1.36%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b5d105c...816c81e. Read the comment docs.

} else {
mt->pause_mutex->unlock(); // Unlock locked by try_lock() mutex.
}
mt->pause();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment here, to document that we expect to wait for the download to not be paused anymore?

This callback is passed through several method calls and constructors and the intent is a bit obfuscated here, IMO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, also the name is a bit confusing, as it will not pause on every call, but I wasn't able to come up with something better. I'll add a comment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for a comment, but even just checkPause would be a better name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, I'll rename it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, a bit late again. But it probably wouldn't hurt to have a null check as well before making the call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbonn I don't know, std:function will throw in case it's empty. Do we want to proceed if pause callback isn't provided?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just having:

if (checkPause != nullptr) {
  checkPause()
}

Then, users of the class could just pass a null pointer if they don't need the pause functionality. It's only an added functionality, not a part of the core duty of the class.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, that would make sense.

Copy link
Contributor

@OYTIS OYTIS left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from what is already mentioned, looks good to me. I never liked that mutex.

Copy link
Collaborator

@pattivacek pattivacek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of good stuff here. Thanks!

} else {
mt->pause_mutex->unlock(); // Unlock locked by try_lock() mutex.
}
mt->pause();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for a comment, but even just checkPause would be a better name.

}
} while (retry);
throw Exception("image", "Could not download file, error: " + response.error_message);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This cleanup is definitely an improvement, but I think we are losing our retry functionality. We currently retry downloads at a high level if they fail for basically any reason. I think that's worth keeping. Originally, I think we gave it three tries before failing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I forgot to look into sotauptaneclient.cc. The retry indeed will not work. But does it work now? The code on master throws in exactly the same way.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, the actual retry logic is in SotaUptaneClient::downloadImage. All is well, sorry!

src/libaktualizr/uptane/fetcher_test.cc Show resolved Hide resolved
@OYTIS OYTIS mentioned this pull request Dec 13, 2018
Eugene Smirnov added 2 commits December 13, 2018 17:01
Fetcher test currently uses `large_interrupted` file for pause/resume.
With `large_interrupted` server aborts after transferring half of a file
(1000 bytes). This should lead to the `The target's calculated hash did
not match the hash in the metadata` error and after that, the upper layer
may choose to retry.
However, currently, the test succeeds, because while committing the first
half to the db, fetcher gets `pause` event, which makes it retry
silently after `resume` is called.
This behavior itself is not a problem, the problem is that it is also
correct for the test to fail, if `pause` comes after the first chunk
is committed.
To solve this:
* Use `large_file`
* Launch download in a separate thread, so we can timeout,
  currently after 20 sec.
* Pause the fetcher after a certain percent is downloaded,
  rather than after 200 ms.

Signed-off-by: Eugene Smirnov <[email protected]>
We cannot get any progress report on a file of 2000 bytes,
as we receive all file in a single curl callback.

Signed-off-by: Eugene Smirnov <[email protected]>
@eu-siemann eu-siemann force-pushed the fix/pause-resume-race branch 2 times, most recently from 9d8ba4f to a16bb89 Compare December 13, 2018 16:15
@eu-siemann
Copy link
Contributor Author

I've rebased this to the merged sqlite changes, added couple of small fixes from my previous abandoned PR, and resolved the comments, except the one, which I'm not yet sure about.

@eu-siemann eu-siemann force-pushed the fix/pause-resume-race branch from a16bb89 to 16f331e Compare December 13, 2018 16:43
@@ -234,7 +234,7 @@ std::future<HttpResponse> HttpClient::downloadAsync(const std::string& url, curl
curlEasySetoptWrapper(curl_download, CURLOPT_TIMEOUT, 0);
curlEasySetoptWrapper(curl_download, CURLOPT_LOW_SPEED_TIME, speed_limit_time_interval_);
curlEasySetoptWrapper(curl_download, CURLOPT_LOW_SPEED_LIMIT, speed_limit_bytes_per_sec_);
curlEasySetoptWrapper(curl_download, CURLOPT_RESUME_FROM, from);
curlEasySetoptWrapper(curl_download, CURLOPT_RESUME_FROM_LARGE, from);
Copy link
Contributor

@OYTIS OYTIS Dec 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense then to change the type of from to curl_off_t?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, absolutely. Thanks!

Eugene Smirnov added 4 commits December 13, 2018 18:13
Using a mutex to pause the fetcher is not only non-idiomatic, but it is also
an undefined behavior, as the thread which called `pause` may terminate
and/or resume might get called from another thread.
See [https://en.cppreference.com/w/cpp/thread/mutex] and
[https://en.cppreference.com/w/cpp/thread/mutex/unlock].
It seems to work now, but it can easily break anytime.

Also there was a race condition between `fetchVerifyTarget()`
or `DownloadHandler()` and `setPause()`, which should be solved
with this commit:
   http->download()
                                               setPause(true)
   DownloadHandler() {
   ...
   return written_size + 1
                                               setPause(false)
   http->download() == CURLE_WRITE_ERROR &&
   pause_ == false
   throw OversizedTarget

Signed-off-by: Eugene Smirnov <[email protected]>
* Remove prints that are duplicated in hmi_stub/tests
* Don't send progress events with the same percentage.

Signed-off-by: Eugene Smirnov <[email protected]>
@eu-siemann eu-siemann force-pushed the fix/pause-resume-race branch from 16f331e to 816c81e Compare December 13, 2018 17:30
@OYTIS OYTIS merged commit a39ed7b into master Dec 14, 2018
@eu-siemann eu-siemann deleted the fix/pause-resume-race branch December 14, 2018 09:50
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants