Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to execute scripts via Kubernetes Jobs #690

Merged
merged 30 commits into from
Nov 29, 2023

Conversation

APErebus
Copy link
Contributor

@APErebus APErebus commented Nov 27, 2023

Background

This is the big one 😱

This adds support for executing scripts supplied to ScriptServiceV3Alpha by Kubernetes Jobs.

Results

Adds a new KubernetesJobScriptExecutor that is used when the environment variable OCTOPUS__TENTACLE__K8SUSEJOBS is set to True

At a high level, the new RunningKubernetesJobScript creates a new k8s Job and executes the Bootstrap script sent from server in the octopuslabs/k8-workertools container image. It then starts two background processes, one to monitor the status of the job itself (so it knows when it's finished) and another to read the streamed log files (stdout.log and stderr.log) and write them into the Output.log. For more info on this process, see below

The tag of the worker tools image used is linked to the version of k8s the tentacle is running in, which we query via the API.

It's implied that there is also a volume called tentacle-home that has access to the tentacle's work directory so it can execute the script. This is supplied as raw yaml in the OCTOPUS__TENTACLE__K8SJOBVOLUMEYAML environment variable.

Script Logs

As Tentacle needs to write the Output.log in a thread-safe manner (so logs don't get munged), we execute the bootstrap script using a bootstrapRunner.sh script which redirects the stdout and stderr streams from the Bootstrap script into two files in the work directory, stdout.log and stderr.log. Then there is a process in the KubernetesJobOutputStreamWriter class that periodically (every 250ms) tail observes those files for new log entries. When there are, they are then written to the tentacle's Output.log under the same locking mechanism as the rest of the code.

Once the job is finished, we perform one last read of those files just in case there was final log items flushed.

Testing

Currently there is no automated tests for this code. I'm still working on refining it so manual testing will suffice for now until the codebase stabilises.

I was able to successfully run a health check for the the Kubernetes Tentacle in my local server.

Shortcut story: [sc-59580]

After

image

How to review this PR

Quality ✔️

Pre-requisites

  • I have read How we use GitHub Issues for help deciding when and where it's appropriate to make an issue.
  • I have considered informing or consulting the right people, according to the ownership map.
  • I have considered appropriate testing for my change.

@APErebus APErebus force-pushed the ap/k8s-jobs-script-execution branch from b6234e2 to 5069e7a Compare November 27, 2023 02:09
@APErebus APErebus marked this pull request as ready for review November 27, 2023 02:28
@APErebus APErebus requested a review from a team as a code owner November 27, 2023 02:28
{
public KubernetesClientConfiguration Get()
{
return KubernetesClientConfiguration.InClusterConfig();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we are running in a cluster, this will just use the ambient service account auth

{
var versionInfo = await Client.Version.GetCodeAsync();

return new ClusterVersion(int.Parse(versionInfo.Major), int.Parse(versionInfo.Minor));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the patch number of a k8s cluster is not provided by the cluster itself (and isn't important for what we need it for)

}
}

public async Task Watch(ScriptTicket scriptTicket, Func<V1Job, bool> onChange, Action<Exception> onError, CancellationToken cancellationToken)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than polling the TryGet endpoint continually, this creates an IAsyncEnumerable which yields when there is a change


format() {
now=$(date -u +"%Y-%m-%dT%H:%M:%S.%N%z")
echo "$now|$2" | tee -a "$1"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use tee so the logs are written to the job/pod logs as well as the log files

Comment on lines +45 to +46
# This ungodly hack is to stop the pod from being killed before the last log has been flushed
sleep 0.250 #250ms
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like the pod is destroyed before the file logs are flushed. This little hack just gives the pod time to flush the last redirected output to the files

Copy link

Copy link
Contributor

@zentron zentron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, It took a bunch of time work to get to this point but it looks fairly clean. And nice that it was done in conjunction with the H&E group.

Just a couple questions here and there, I dont think anything particularly structural.

source/Octopus.Tentacle/ExternalInit.cs Show resolved Hide resolved
{
public static class KubernetesConfig
{
public static string Namespace => Environment.GetEnvironmentVariable("OCTOPUS__K8STENTACLE__NAMESPACE")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestion, might be a little cleaner with a local helper function since there are a bunch of environment variable lookups with this pattern.
Given its only 3 its borderline that its needed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I chose to pull out a common method in a2a48b7

source/Octopus.Tentacle/Kubernetes/KubernetesJobService.cs Outdated Show resolved Hide resolved
source/Octopus.Tentacle/Kubernetes/KubernetesJobService.cs Outdated Show resolved Hide resolved
timeoutSeconds: 1800, //same as the TTL of the job itself
cancellationToken: cancellationToken);

await foreach (var (type, job) in response.WatchAsync<V1Job, V1JobList>(onError, cancellationToken: cancellationToken))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting pattern 👍

using var writer = ScriptLog.CreateWriter();
try
{
using (ScriptIsolationMutex.Acquire(workspace.IsolationLevel,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This presumably works because when running one instance, the mutex is accessable by that one Tentacle process.
If/when we allow multiple instances to run within a cluster we may need to answer if this isolationlevel check is still relevant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. When/if we get to HA, then we'll need a different approach to how to do the script isolation mutex or, as you say, if it's still relevant

},
ServiceAccountName = KubernetesConfig.JobServiceAccountName,
RestartPolicy = "Never",
Volumes = volumes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice to see that worked :)

},
Env = new List<V1EnvVar>
{
new(EnvironmentVariables.TentacleHome, $"/data/tentacle-home/{instanceName}"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of those other environment variables sound like they could be relevant for Calamari. TentacleJournal or TentacleInstanceName?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hrm, good call. I'll investigate. I added these ones because they were needed for the health-check

Copy link
Contributor Author

@APErebus APErebus Nov 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I missed TentacleJournal and TentacleInstanceName. Added in a2a48b7

new(1, 28, 2),
};

async Task<string> GetDefaultContainer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, consider pulling some parts of this class such as this worker tool container resolution into its own class that can be tested seperately. ALso makes this class easier to grok.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Created a new KubernetesJobContainerResolver in a2a48b7

}

public IScriptExecutor GetExecutor()
{
return shellScriptExecutor.Value;
return KubernetesConfig.UseJobs switch
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be simpler to push as much of the "if k8s do this" into being container registration concern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, fair call. I'll move this into the module and only register the appropriate executor based on the env var

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a2a48b7

There is no longer a factory and the script executors are now conditionaly registered in the Autofac container

@APErebus APErebus requested a review from zentron November 29, 2023 02:17
@APErebus
Copy link
Contributor Author

@zentron I've updated the PR with the changes suggested. I did end up removing the IScriptExecutorFactory and moving it into the module.

I also pulled the container stuff into a new class, IKubernetesJobContainerResolver

Copy link
Contributor

@zentron zentron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one, this is a huge piece of the puzzle 👍

@APErebus APErebus merged commit be796e5 into main Nov 29, 2023
2 checks passed
@APErebus APErebus deleted the ap/k8s-jobs-script-execution branch November 29, 2023 04:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants