Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Startup performance #41

Closed
radu-matei opened this issue Mar 13, 2024 · 12 comments
Closed

Startup performance #41

radu-matei opened this issue Mar 13, 2024 · 12 comments
Labels
help wanted Extra attention is needed

Comments

@radu-matei
Copy link
Contributor

I'd like to understand where we spend time when starting a container; I filed #40 since I assume we spend a non-trivial amount of time when starting, but I'm not sure if there is anywhere else we could look.

@jsturtevant — is the RUST_LOG implementation in the rust-extensions crate helping us with tracing here?

@radu-matei radu-matei added the help wanted Extra attention is needed label Mar 13, 2024
@jsturtevant
Copy link
Contributor

There isn't a ton of logs in the underlying shim that would help.

@Mossaka has been working on adding open telemetry tracing to runwasi. That would be able to show more details on where we are spending time. We has some early results and I think we found it was in the setting up the container NS's that we were spending time.

Cleaning up the disk operations in #40 will add some improvements. I saw an improvement in the wasmtime shim after avoiding disk reads/writes with the OCI artifacts but it was not where were spending most of our time.

@Mossaka
Copy link
Collaborator

Mossaka commented Mar 13, 2024

yeah this is the work @jsturtevant's referring to: containerd/runwasi#10

@radu-matei
Copy link
Contributor Author

I think even only being able to set the RUST_LOG level for tracing throughout the shim dependencies would be a very good start in starting to diagnose this.

@Mossaka
Copy link
Collaborator

Mossaka commented Apr 23, 2024

A quick update on the findings of the startup performance -

After adding traces to runwasi, I found that runwasi isn't deleting the shim socket after the shim gets deleted. (e.g. /run/containerd/<hash>). This socket address causes the shim::spawn function from rust-extensions to hang for over 1 second (e.g. issue containerd/rust-extensions#263), which could be a reason why it takes so much time to start a container using runwasi.

Jeager tracing

image

The wait_socket_working() function loops connection to the ttrpc client using the undeleted socket address from last run 200 times and each time waits for 5ms, which contribtues to the 1s hang on start_shim function.

What's next?

I am following up with rust-extension folks to figure out why exactly are we waiting for such a long time to establish a connection with a socket address. I am also investigating how to properly delete the shim socket address in runwasi after the shim gets deleted, which in turn could make the startup much faster.

In theory, I think start_shim should take less time than task_create request which takes about 40ms.

@Mossaka Mossaka moved this to In Progress in Runwasi 1.0 Apr 23, 2024
@radu-matei
Copy link
Contributor Author

This is excellent info, @Mossaka, thank you so much for digging into this!

@Mossaka
Copy link
Collaborator

Mossaka commented Apr 29, 2024

A quick update:

Abel's comment at containerd/rust-extensions#263 answers why we have to wait for the socket. So the only question remains: why doesn't the runwasi shim delete it's socket address after its get closed?

After digging more into the syscalls, I found that the shim::wait() actually unlinks the entire bundle path (e.g. /run/containerd/io.containerd.runtime.v2.task/<namespace>/<id>) which contains the address file for reading the socket address. This means that the runwasi shims actually crash every time right before they terminate. This explains why the socket address isn't deleted before shim close.

But why does the bundle path gets deleted? I am not sure at the moment. Still trying to figure it out. It might related to youki's libcontainer (e.g. youki-dev/youki#2772).

@radu-matei
Copy link
Contributor Author

@Mossaka is youki-dev/youki#2780 going to affect the startup performance we're seeing here?

Thanks!

@Mossaka
Copy link
Collaborator

Mossaka commented May 6, 2024

@Mossaka is containers/youki#2780 going to affect the startup performance we're seeing here?

Thanks!

Yes, that will make the startup performance much faster

@Mossaka
Copy link
Collaborator

Mossaka commented May 22, 2024

containerd/runwasi#601 once this PR is merged, we can do a release of runwasi and then bump the version in the spin shim. It should make start up much faster

@Mossaka
Copy link
Collaborator

Mossaka commented May 27, 2024

i just bump the newest release version of the runwasi, will do some tests to see the perf gain

@Mossaka
Copy link
Collaborator

Mossaka commented May 27, 2024

Test is performed using spin command trigger on a simply rust program that outputs the system timestamp. if you are interested the script can be found here

Before

#########################################
Containerd + RUNWASI with Spin SHIM
#########################################

Running experiment 1
Start timestamp in milis: 1716853258777 milliseconds
End timestamp in milis: 1716853259917 milliseconds
Time difference: 1140 milliseconds

Running experiment 2
Start timestamp in milis: 1716853260086 milliseconds
End timestamp in milis: 1716853261240 milliseconds
Time difference: 1154 milliseconds

Running experiment 3
Start timestamp in milis: 1716853261307 milliseconds
End timestamp in milis: 1716853262432 milliseconds
Time difference: 1125 milliseconds

Running experiment 4
Start timestamp in milis: 1716853262533 milliseconds
End timestamp in milis: 1716853263661 milliseconds
Time difference: 1128 milliseconds

Running experiment 5
Start timestamp in milis: 1716853264079 milliseconds
End timestamp in milis: 1716853265217 milliseconds
Time difference: 1138 milliseconds

Mean: 1137 milliseconds
Median: 1138 milliseconds
Mode: 1154 milliseconds
Standard Deviation: 10.19 milliseconds
Minimum Time: 1125 milliseconds
Maximum Time: 1154 milliseconds

After

#########################################
Containerd + RUNWASI with Spin SHIM
#########################################

Running experiment 1
Start timestamp in milis: 1716853849599 milliseconds
End timestamp in milis: 1716853849736 milliseconds
Time difference: 137 milliseconds

Running experiment 2
Start timestamp in milis: 1716853849817 milliseconds
End timestamp in milis: 1716853849948 milliseconds
Time difference: 131 milliseconds

Running experiment 3
Start timestamp in milis: 1716853849988 milliseconds
End timestamp in milis: 1716853850117 milliseconds
Time difference: 129 milliseconds

Running experiment 4
Start timestamp in milis: 1716853850152 milliseconds
End timestamp in milis: 1716853850273 milliseconds
Time difference: 121 milliseconds

Running experiment 5
Start timestamp in milis: 1716853850309 milliseconds
End timestamp in milis: 1716853850435 milliseconds
Time difference: 126 milliseconds

Mean: 128 milliseconds
Median: 129 milliseconds
Mode: 137 milliseconds
Standard Deviation: 5.29 milliseconds
Minimum Time: 121 milliseconds
Maximum Time: 137 milliseconds

@Mossaka
Copy link
Collaborator

Mossaka commented May 27, 2024

To close this issue, I am working on a PR to add opentelemetry capability to runwasi and its shims, which helped me to discover suprises and trace where we spend more time on.

@Mossaka Mossaka closed this as completed May 27, 2024
@github-project-automation github-project-automation bot moved this from In Progress to Done in Runwasi 1.0 May 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants