-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to execute scripts via Kubernetes Jobs #690
Conversation
562aea6
to
58762ac
Compare
b6234e2
to
5069e7a
Compare
{ | ||
public KubernetesClientConfiguration Get() | ||
{ | ||
return KubernetesClientConfiguration.InClusterConfig(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we are running in a cluster, this will just use the ambient service account auth
{ | ||
var versionInfo = await Client.Version.GetCodeAsync(); | ||
|
||
return new ClusterVersion(int.Parse(versionInfo.Major), int.Parse(versionInfo.Minor)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the patch number of a k8s cluster is not provided by the cluster itself (and isn't important for what we need it for)
} | ||
} | ||
|
||
public async Task Watch(ScriptTicket scriptTicket, Func<V1Job, bool> onChange, Action<Exception> onError, CancellationToken cancellationToken) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than polling the TryGet
endpoint continually, this creates an IAsyncEnumerable
which yields when there is a change
|
||
format() { | ||
now=$(date -u +"%Y-%m-%dT%H:%M:%S.%N%z") | ||
echo "$now|$2" | tee -a "$1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use tee
so the logs are written to the job/pod logs as well as the log files
# This ungodly hack is to stop the pod from being killed before the last log has been flushed | ||
sleep 0.250 #250ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the pod is destroyed before the file logs are flushed. This little hack just gives the pod time to flush the last redirected output to the files
This pull request has been linked to Shortcut Story #59580: Add ability for Tentacle to schedule new k8s jobs when running scripts/deployments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, It took a bunch of time work to get to this point but it looks fairly clean. And nice that it was done in conjunction with the H&E group.
Just a couple questions here and there, I dont think anything particularly structural.
{ | ||
public static class KubernetesConfig | ||
{ | ||
public static string Namespace => Environment.GetEnvironmentVariable("OCTOPUS__K8STENTACLE__NAMESPACE") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor suggestion, might be a little cleaner with a local helper function since there are a bunch of environment variable lookups with this pattern.
Given its only 3 its borderline that its needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose to pull out a common method in a2a48b7
timeoutSeconds: 1800, //same as the TTL of the job itself | ||
cancellationToken: cancellationToken); | ||
|
||
await foreach (var (type, job) in response.WatchAsync<V1Job, V1JobList>(onError, cancellationToken: cancellationToken)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting pattern 👍
using var writer = ScriptLog.CreateWriter(); | ||
try | ||
{ | ||
using (ScriptIsolationMutex.Acquire(workspace.IsolationLevel, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This presumably works because when running one instance, the mutex is accessable by that one Tentacle process.
If/when we allow multiple instances to run within a cluster we may need to answer if this isolationlevel check is still relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. When/if we get to HA, then we'll need a different approach to how to do the script isolation mutex or, as you say, if it's still relevant
}, | ||
ServiceAccountName = KubernetesConfig.JobServiceAccountName, | ||
RestartPolicy = "Never", | ||
Volumes = volumes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice to see that worked :)
}, | ||
Env = new List<V1EnvVar> | ||
{ | ||
new(EnvironmentVariables.TentacleHome, $"/data/tentacle-home/{instanceName}"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some of those other environment variables sound like they could be relevant for Calamari. TentacleJournal
or TentacleInstanceName
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hrm, good call. I'll investigate. I added these ones because they were needed for the health-check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, I missed TentacleJournal
and TentacleInstanceName
. Added in a2a48b7
new(1, 28, 2), | ||
}; | ||
|
||
async Task<string> GetDefaultContainer() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor, consider pulling some parts of this class such as this worker tool container resolution into its own class that can be tested seperately. ALso makes this class easier to grok.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Created a new KubernetesJobContainerResolver
in a2a48b7
} | ||
|
||
public IScriptExecutor GetExecutor() | ||
{ | ||
return shellScriptExecutor.Value; | ||
return KubernetesConfig.UseJobs switch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be simpler to push as much of the "if k8s do this" into being container registration concern?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, fair call. I'll move this into the module and only register the appropriate executor based on the env var
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in a2a48b7
There is no longer a factory and the script executors are now conditionaly registered in the Autofac container
@zentron I've updated the PR with the changes suggested. I did end up removing the I also pulled the container stuff into a new class, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice one, this is a huge piece of the puzzle 👍
Background
This is the big one 😱
This adds support for executing scripts supplied to
ScriptServiceV3Alpha
by Kubernetes Jobs.Results
Adds a new
KubernetesJobScriptExecutor
that is used when the environment variableOCTOPUS__TENTACLE__K8SUSEJOBS
is set toTrue
At a high level, the new
RunningKubernetesJobScript
creates a new k8s Job and executes theBootstrap
script sent from server in theoctopuslabs/k8-workertools
container image. It then starts two background processes, one to monitor the status of the job itself (so it knows when it's finished) and another to read the streamed log files (stdout.log
andstderr.log
) and write them into theOutput.log
. For more info on this process, see belowThe tag of the worker tools image used is linked to the version of k8s the tentacle is running in, which we query via the API.
It's implied that there is also a volume called
tentacle-home
that has access to the tentacle's work directory so it can execute the script. This is supplied as raw yaml in theOCTOPUS__TENTACLE__K8SJOBVOLUMEYAML
environment variable.Script Logs
As Tentacle needs to write the
Output.log
in a thread-safe manner (so logs don't get munged), we execute the bootstrap script using abootstrapRunner.sh
script which redirects thestdout
andstderr
streams from theBootstrap
script into two files in the work directory,stdout.log
andstderr.log
. Then there is a process in theKubernetesJobOutputStreamWriter
class that periodically (every 250ms) tail observes those files for new log entries. When there are, they are then written to the tentacle'sOutput.log
under the same locking mechanism as the rest of the code.Once the job is finished, we perform one last read of those files just in case there was final log items flushed.
Testing
Currently there is no automated tests for this code. I'm still working on refining it so manual testing will suffice for now until the codebase stabilises.
I was able to successfully run a health check for the the
Kubernetes Tentacle
in my local server.Shortcut story: [sc-59580]
After
How to review this PR
Quality ✔️
Pre-requisites