Skip to content
This repository has been archived by the owner on May 21, 2024. It is now read-only.

Feat/ostree walker #1020

Merged
merged 20 commits into from
Dec 17, 2018
Merged

Feat/ostree walker #1020

merged 20 commits into from
Dec 17, 2018

Conversation

pattivacek
Copy link
Collaborator

I wrote most of this while trying to debug the missing OSTree object errors, and I figured it was worth polishing up a bit and submitting. Note that I've never successfully gotten a walk to finish, so I don't know what happens if it does. I suspect it will block indefinitely, but I'm not sure. Hopefully sooner or later I'll be able to figure that out.

@pattivacek pattivacek force-pushed the feat/ostree-walker branch 3 times, most recently from c8f7764 to 43e91da Compare December 3, 2018 08:31
@codecov-io
Copy link

codecov-io commented Dec 3, 2018

Codecov Report

Merging #1020 into master will decrease coverage by 0.14%.
The diff coverage is 77.82%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1020      +/-   ##
==========================================
- Coverage   82.48%   82.33%   -0.15%     
==========================================
  Files         195      196       +1     
  Lines       13793    13858      +65     
==========================================
+ Hits        11377    11410      +33     
- Misses       2416     2448      +32
Impacted Files Coverage Δ
src/sota_tools/ostree_object.h 100% <ø> (ø) ⬆️
src/libaktualizr/http/httpinterface.h 100% <ø> (ø) ⬆️
src/sota_tools/treehub_server_test.cc 95.23% <ø> (ø) ⬆️
src/sota_tools/ostree_repo.h 66.66% <ø> (+16.66%) ⬆️
src/libaktualizr/http/httpclient.cc 91.13% <ø> (ø) ⬆️
src/sota_tools/oauth2.cc 86.66% <ø> (ø) ⬆️
src/libaktualizr/http/httpclient.h 100% <ø> (ø) ⬆️
src/sota_tools/ostree_object_test.cc 100% <100%> (ø) ⬆️
src/sota_tools/request_pool.h 100% <100%> (ø) ⬆️
src/sota_tools/ostree_dir_repo.h 100% <100%> (ø) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b9825f0...cd3acde. Read the comment docs.

@pattivacek pattivacek force-pushed the feat/ostree-walker branch 4 times, most recently from 16edef4 to 36cf5ce Compare December 7, 2018 10:09
@pattivacek
Copy link
Collaborator Author

By the way, I've solved the halting problem and optimized several parts of the code. I'm looking into more optimizations for garage-deploy, but this work is ready for review. Note that I still haven't gotten to use the garage-push --walk-tree option in full glory yet.

@lbonn
Copy link
Contributor

lbonn commented Dec 7, 2018

So, did you manage to run a full garage-check --walk-tree?

@pattivacek
Copy link
Collaborator Author

So, did you manage to run a full garage-check --walk-tree?

Yes, finally! I've also run a full garage-push --walk-tree, but I still haven't seen a case where it successfully found an uploaded an unexpectedly missing object.

@pattivacek pattivacek force-pushed the feat/ostree-walker branch 5 times, most recently from 5e5be04 to 9ca9ab4 Compare December 12, 2018 15:52
@pattivacek
Copy link
Collaborator Author

pattivacek commented Dec 12, 2018

Added some more work to improve garage-deploy speed.

For a fairly simple change, garage-push (with these changes) took this long (note that this does not include garage-sign, but that is usually trivial), measured with time:

real  0m9,098s
user  0m0,947s
sys   0m0,317s

Using garage-deploy from master to upload the same objects:

real  3m13,168s
user  0m27,168s
sys   0m1,309s

Using garage-deploy from this branch to upload the same objects:

real  1m37,113s
user  0m24,744s
sys   0m1,211s

I haven't been able to compare a larger set of changes/objects because of unreliable connectivity. (I actually had one error with the garage-deploy in this case, but it recovered and was trivial enough that it still outperformed the old garage-deploy by quite a bit.)

I'm also still quite impressed with how fast garage-push is compared with garage-deploy. The advantage of reading from disk instead of over a network is obvious, but I'm still wondering if there are further inefficiencies I haven't yet uncovered.

@pattivacek
Copy link
Collaborator Author

I think this should be reviewed and merged now. I've tested that garage-push will successfully push children objects even if the parents exist when running with -w and I've proven that garage-deploy is now substantially faster than before.

I've had to re-read this code probably a dozen times, so maybe this will
save me time next time around.

Signed-off-by: Patrick Vacek <[email protected]>
Mostly just useful for debugging strange circumstances. WIP.

Signed-off-by: Patrick Vacek <[email protected]>
This is shared across the garage tools. This enabled me to fix the
remaining TODO in OSTreeObject::CurlDone().

Signed-off-by: Patrick Vacek <[email protected]>
Basically just break if the system is idle, which should only happen
once everything is done.

Signed-off-by: Patrick Vacek <[email protected]>
Still check the children if we are walking the whole tree, though.

Signed-off-by: Patrick Vacek <[email protected]>
Thanks to @eu-smirnov and @OYTIS for helping disect this and find the
room for improvement.

Signed-off-by: Patrick Vacek <[email protected]>
Do not wait at all if the timeout return value is 0.

Signed-off-by: Patrick Vacek <[email protected]>
* Only print concurrency messages if it changed.
* Log fetches at debug level. Trace level has the full log from curl,
but that is usually too much, and we printed nothing at debug level.

Signed-off-by: Patrick Vacek <[email protected]>
It isn't particularly useful for garage-deploy, since the fetches are on
a single thread, but it can't hurt.

Signed-off-by: Patrick Vacek <[email protected]>
We know the type when we inspect objects for their children, and if we
store it, when can reuse it when querying the server. Previously, we
were trying all known object extensions, which resulted in a lot of
unnecessary networking calls.

Signed-off-by: Patrick Vacek <[email protected]>
And log at debug level if we do.

Signed-off-by: Patrick Vacek <[email protected]>
bool CheckPoolState(const OSTreeObject::ptr &root_object, const RequestPool &request_pool) {
if (request_pool.run_mode() == RunMode::kWalkTree || request_pool.run_mode() == RunMode::kPushTree) {
return !request_pool.is_idle() && !request_pool.is_stopped();
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarity: having an exhaustive switch/case on all existing running modes would make this part easier to follow as it wouldn't need to know about the exact enum cases.

/* Verify that constructor does not accept a nonexistent repo. */
/* Verify that constructor does not accept a nonexistent repo.
*
* Note that this will not fail on release builds! */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, but isn't that a bit bad, for example if we want to run our test suite on qemu? This could be a FIXME/TODO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm actually working on fixing that right now! My current solution is to wrap it around NDEBUG so that it doesn't run on Release builds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok but what does it test then? It's not the actual behaviour of the class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it fails in debug builds because of an assert that checks that the repo is valid. Would you prefer that instead threw an exception so it failed in all cases?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either that or just remove this test. If we just test that the constructor has an assert, I don't think it brings much value.

return it->second;
}

const std::string exts[] = {".filez", ".dirtree", ".dirmeta", ".commit"};
Copy link
Contributor

@lbonn lbonn Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::map exts{{OstreeObjectType::OSTREE_OBJECT_TYPE_FILE, "filez"}, ...}?

edit: oh, this is old code which was moved, not urgent...
edit2: oh actually not, this is new code sorry :). I was looking at the commit's diff

Just reducing unnecessary variation and cleaning up minor issues.

Signed-off-by: Patrick Vacek <[email protected]>
This also has the benefit of causing the relevant test to fail also in
release builds. (It used to just be an assert.)

Signed-off-by: Patrick Vacek <[email protected]>
timeout.tv_sec = 0;
timeout.tv_usec = 100 * 1000;
// If maxfd == -1, then wait the lesser of timeoutms and 100ms. See:
// https://curl.haxx.se/libcurl/c/curl_multi_fdset.html
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No if?

long nofd_timeoutms = std::min(timeoutms, 100);
LOG_DEBUG << "Waiting " << nofd_timeoutms << " ms for curl";
timeout.tv_sec = nofd_timeoutms / 1000;
timeout.tv_usec = 1000 * (nofd_timeoutms % 1000);

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh and actually, tv_sec is then always 0...

@@ -241,7 +247,24 @@ void OSTreeObject::Upload(const TreehubServer &push_target, CURLM *curl_multi_ha
request_start_time_ = std::chrono::steady_clock::now();
}

void OSTreeObject::PresenceUnknown(RequestPool &pool, const int64_t rescode) {
void OSTreeObject::CheckChildren(RequestPool &pool, const long rescode) { // NOLINT
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This clang-tidy lint is a bit annoying https://clang.llvm.org/extra/clang-tidy/checks/google-runtime-int.html. Even if I agree with the intent, it is a bit strict.

Maybe we could disable it. Otherwise, maybe use the specific lint id: // NOLINT(google-runtime-int)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I like it, it's just because of the curl API that we have the problem. You are right about specifying the lint ID, though.

Copy link
Contributor

@lbonn lbonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've made some style comments. I'll maybe need a bit more time to follow all the logic but since you've run tests on real repos and that it brings good speed improvements, it's already enough from my perspective.

Also refactor CheckPoolState to use a cleaner switch.

Signed-off-by: Patrick Vacek <[email protected]>
@pattivacek pattivacek force-pushed the feat/ostree-walker branch 2 times, most recently from 1c60c85 to a9a59a2 Compare December 17, 2018 11:18
@pattivacek
Copy link
Collaborator Author

I'll maybe need a bit more time to follow all the logic

I don't blame you, this ended up being a lot of work all thrown in one place. At least I did my best to separate the work into meaningful chunks/commits. The only real new feature is the tree walking, but there is a ton of refactoring, bug fixing, and optimization alongside that.

@pattivacek pattivacek merged commit 24a1ce0 into master Dec 17, 2018
@pattivacek pattivacek deleted the feat/ostree-walker branch December 17, 2018 14:42
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants