Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pull] master from armadaproject:master #12

Closed
wants to merge 70 commits into from

Conversation

pull-bot-infra[bot]
Copy link

@pull-bot-infra pull-bot-infra bot commented Jul 30, 2024

See Commits and Changes for more details.


Created by pull-bot-infra[bot]

Can you help keep this open source service alive? 💖 Please sponsor : )

dave-gantenbein and others added 30 commits June 20, 2024 11:42
* adding Geoff to approvers list

Signed-off-by: David Gantenbein <[email protected]>

* adding Geoff to maintainers list

Signed-off-by: David Gantenbein <[email protected]>

---------

Signed-off-by: David Gantenbein <[email protected]>
* Limit the number of events we put into each Pulsar Message

Currently we can publish very large events per message (100k+ events per message)

This can make the time to process messages quite unpredictable, as they can be anywhere between 1 event and 100000+ events

Now we restrict how many messages we put into each message (via `maxAllowedEventsPerMessage`), which should make how many changes a given message may contain somewhat more predictable

Signed-off-by: JamesMurkin <[email protected]>

* Add unit test

Signed-off-by: JamesMurkin <[email protected]>

* Fix unit test

Signed-off-by: JamesMurkin <[email protected]>

* Fix unit test

Signed-off-by: JamesMurkin <[email protected]>

* Fix passing field through ExecutorAPI

Signed-off-by: JamesMurkin <[email protected]>

---------

Signed-off-by: JamesMurkin <[email protected]>
* Add assignedNodeId field to JobSchedulingContext

This makes it far faster to look up the currently assigned node for a given evicted job
 - Currently it uses a map look up which is comparatively slow

Signed-off-by: JamesMurkin <[email protected]>

* Add test for Get/Set AssignedNodeId

Signed-off-by: JamesMurkin <[email protected]>

* Fix check for assigned node

Signed-off-by: JamesMurkin <[email protected]>

* Fix unit test

Signed-off-by: JamesMurkin <[email protected]>

---------

Signed-off-by: JamesMurkin <[email protected]>
Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* Make ingesters limit how many events they process at once

Currently we limit how many messages are processed at once

However a single message may contain thousands of events, so limit on messages is a bit flawed
 - A limit of 1000, could mean you're processing 1000 or 1000000+ events at once

This should make the ingesters less prone to long pauses if some large messages come along

Also adjusted event ingester config to be more standard with other config naming

Signed-off-by: JamesMurkin <[email protected]>

* Comment improvement

Signed-off-by: JamesMurkin <[email protected]>

* Comment improvement

Signed-off-by: JamesMurkin <[email protected]>

* gofumpt

Signed-off-by: JamesMurkin <[email protected]>

* Improve config descriptions

Signed-off-by: JamesMurkin <[email protected]>

* Limit Pulsar messages to have a configurable max number of events per message

Currently we can publish very large messages (100k+ events per message)

This can make the time to process messages quite unpredictable, as they can be anywhere between 1 event and 100000+ events

Now we restrict how many messages we put into each message (via `maxAllowedEventsPerMessage`), which should make how many changes a given message may contain somewhat more predictable

Signed-off-by: JamesMurkin <[email protected]>

* Revert "Limit Pulsar messages to have a configurable max number of events per message"

This reverts commit 11a8a2a.

* Rename config to MaxOutputMessageSizeBytes

Signed-off-by: JamesMurkin <[email protected]>

---------

Signed-off-by: JamesMurkin <[email protected]>
* Improve performance of selectNodeForJobWithFairPreemption

Testing with the worst case, this makes the function 2-4 times faster
 - Combined with #3678 it is about 3-5 times faster

Even the average case should be improved, as we now do less work

The main "trick" here is to not perform node.UnsafeCopy() and nodeDb.unbindJobFromNodeInPlace until it is needed (once you know which node is the selected one)
 - These functions are expensive, particularly UnsafeCopy

Instead just keep track of the evicted nodes + available resource locally until you've found one that matches.

There are more enhancements that could be made here:
 - Work out when it is better to check static requirements first (possibly even save these and use them for all jobs with the same scheduling key)
 - Improve nodeDb speed to look up a node (use a map?)
 - Improve nodeDb speed to iterate evicted jobs (use a slice? / effectively we just need a sorted hashmap, not a fully fledged radix tree)

Signed-off-by: JamesMurkin <[email protected]>

* Lint

Signed-off-by: JamesMurkin <[email protected]>

* Move struct definition inside function

Signed-off-by: JamesMurkin <[email protected]>

---------

Signed-off-by: JamesMurkin <[email protected]>
* Syncing out Armada Airflow Operator (#164)

* changes

* Fixing tests

* Formatting

* Making integration tests work

* Formatting files

* Formatting, ensuring Armada trigger can cancel

* Fixing docs and adding back examples

* Fixing docs

* Addressing comments, supporting more python versions

* Updating docs

* Removing airflow operator build script from github workflow check

---------

Co-authored-by: Mustafa Ilyas <[email protected]>
Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* wip

Signed-off-by: Chris Martin <[email protected]>

* wip

Signed-off-by: Chris Martin <[email protected]>

* wip

Signed-off-by: Chris Martin <[email protected]>

* wip

Signed-off-by: Chris Martin <[email protected]>

* unit tests for ingester

Signed-off-by: Chris Martin <[email protected]>

* wip

Signed-off-by: Chris Martin <[email protected]>

* wip

Signed-off-by: Chris Martin <[email protected]>

* fix tests

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

* extra test

Signed-off-by: Chris Martin <[email protected]>

* code review comments

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* wip

Signed-off-by: Chris Martin <[email protected]>

* wip

Signed-off-by: Chris Martin <[email protected]>

* fix some tests

Signed-off-by: Chris Martin <[email protected]>

* fix some tests

Signed-off-by: Chris Martin <[email protected]>

* wip

Signed-off-by: Chris Martin <[email protected]>

* extra test

Signed-off-by: Chris Martin <[email protected]>

* allow ability to disable adjusted fair share protection.

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

* extra test

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* Make ingesters limit how many events they process at once

Currently we limit how many messages are processed at once

However a single message may contain thousands of events, so limit on messages is a bit flawed
 - A limit of 1000, could mean you're processing 1000 or 1000000+ events at once

This should make the ingesters less prone to long pauses if some large messages come along

Also adjusted event ingester config to be more standard with other config naming

Signed-off-by: JamesMurkin <[email protected]>

* Comment improvement

Signed-off-by: JamesMurkin <[email protected]>

* Comment improvement

Signed-off-by: JamesMurkin <[email protected]>

* gofumpt

Signed-off-by: JamesMurkin <[email protected]>

* Improve config descriptions

Signed-off-by: JamesMurkin <[email protected]>

* Limit Pulsar messages to have a configurable max number of events per message

Currently we can publish very large messages (100k+ events per message)

This can make the time to process messages quite unpredictable, as they can be anywhere between 1 event and 100000+ events

Now we restrict how many messages we put into each message (via `maxAllowedEventsPerMessage`), which should make how many changes a given message may contain somewhat more predictable

Signed-off-by: JamesMurkin <[email protected]>

* Revert "Limit Pulsar messages to have a configurable max number of events per message"

This reverts commit 11a8a2a.

* Improve logging in ingester pipeline

This should help us understand what is happening in our ingestion pipelines

 - Should log if we are no longer receiving pulsar messages for 2mins
 - Will log a summary of how many messages and event in each "batch"
 - Will log a summary of the types of events in each batch
 - Will log a summary of how long Convert took for each batch

This is admittedly quite a "quick" fix and better long term steps would be:
 - Metrics or spans
  - Some of these could be at the ingseter pipeline level (generic)
  - Some would need to be done in each ingester to expose more detailed information such as which query is all the time being spent on

Signed-off-by: JamesMurkin <[email protected]>

---------

Signed-off-by: JamesMurkin <[email protected]>
* Support state rejected and job errors

Signed-off-by: Robert Smith <[email protected]>

* fix test

Signed-off-by: Robert Smith <[email protected]>

---------

Signed-off-by: Robert Smith <[email protected]>
Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
The API for working out if a timer is stopped / needs draining is more complicated than I understood. We could fix what we have to use it properly, but instead use simpler approach that isn't error prone

It also means we can print how long it has been since we last received an event

Signed-off-by: JamesMurkin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* wip

Signed-off-by: Chris Martin <[email protected]>

* add tests

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

* code review

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
The function that counts the number of events per event sequence is wrong as it is actually counting the number of event sequences (each of which can contain many events)

This just makes us actually count all the events in all event sequences

Signed-off-by: JamesMurkin <[email protected]>
* wip

Signed-off-by: Chris Martin <[email protected]>

* wip

Signed-off-by: Chris Martin <[email protected]>

* wip

Signed-off-by: Chris Martin <[email protected]>

* code review comment

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* initial timestamps

Signed-off-by: Chris Martin <[email protected]>

* initial timestamps

Signed-off-by: Chris Martin <[email protected]>

* initial timestamps

Signed-off-by: Chris Martin <[email protected]>

* tests

Signed-off-by: Chris Martin <[email protected]>

* regen proto

Signed-off-by: Chris Martin <[email protected]>

* fix diff

Signed-off-by: Chris Martin <[email protected]>

* more fixes

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

* fix tests

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* initial timestamps

Signed-off-by: Chris Martin <[email protected]>

* initial timestamps

Signed-off-by: Chris Martin <[email protected]>

* initial timestamps

Signed-off-by: Chris Martin <[email protected]>

* tests

Signed-off-by: Chris Martin <[email protected]>

* regen proto

Signed-off-by: Chris Martin <[email protected]>

* fix diff

Signed-off-by: Chris Martin <[email protected]>

* submit.go

Signed-off-by: Chris Martin <[email protected]>

* submit.go

Signed-off-by: Chris Martin <[email protected]>

* more fixes

Signed-off-by: Chris Martin <[email protected]>

* more fixes

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

* fix tests

Signed-off-by: Chris Martin <[email protected]>

* fix proto gen

Signed-off-by: Chris Martin <[email protected]>

* fix test

Signed-off-by: Chris Martin <[email protected]>

* comment out test

Signed-off-by: Chris Martin <[email protected]>

* fix

Signed-off-by: Chris Martin <[email protected]>

* fix

Signed-off-by: Chris Martin <[email protected]>

* fix

Signed-off-by: Chris Martin <[email protected]>

* another test

Signed-off-by: Chris Martin <[email protected]>

* another test

Signed-off-by: Chris Martin <[email protected]>

* wip

Signed-off-by: Chris Martin <[email protected]>

* fix merge conflict

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* initial timestamps

Signed-off-by: Chris Martin <[email protected]>

* initial timestamps

Signed-off-by: Chris Martin <[email protected]>

* initial timestamps

Signed-off-by: Chris Martin <[email protected]>

* tests

Signed-off-by: Chris Martin <[email protected]>

* regen proto

Signed-off-by: Chris Martin <[email protected]>

* fix diff

Signed-off-by: Chris Martin <[email protected]>

* submit.go

Signed-off-by: Chris Martin <[email protected]>

* submit.go

Signed-off-by: Chris Martin <[email protected]>

* more fixes

Signed-off-by: Chris Martin <[email protected]>

* more fixes

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

* fix tests

Signed-off-by: Chris Martin <[email protected]>

* fix proto gen

Signed-off-by: Chris Martin <[email protected]>

* fix test

Signed-off-by: Chris Martin <[email protected]>

* comment out test

Signed-off-by: Chris Martin <[email protected]>

* fix

Signed-off-by: Chris Martin <[email protected]>

* fix

Signed-off-by: Chris Martin <[email protected]>

* fix

Signed-off-by: Chris Martin <[email protected]>

* another test

Signed-off-by: Chris Martin <[email protected]>

* another test

Signed-off-by: Chris Martin <[email protected]>

* initial commit

Signed-off-by: Chris Martin <[email protected]>

* fix test

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* initial timestamps

Signed-off-by: Chris Martin <[email protected]>

* initial timestamps

Signed-off-by: Chris Martin <[email protected]>

* initial timestamps

Signed-off-by: Chris Martin <[email protected]>

* tests

Signed-off-by: Chris Martin <[email protected]>

* regen proto

Signed-off-by: Chris Martin <[email protected]>

* fix diff

Signed-off-by: Chris Martin <[email protected]>

* submit.go

Signed-off-by: Chris Martin <[email protected]>

* submit.go

Signed-off-by: Chris Martin <[email protected]>

* more fixes

Signed-off-by: Chris Martin <[email protected]>

* more fixes

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

* fix tests

Signed-off-by: Chris Martin <[email protected]>

* fix proto gen

Signed-off-by: Chris Martin <[email protected]>

* fix test

Signed-off-by: Chris Martin <[email protected]>

* comment out test

Signed-off-by: Chris Martin <[email protected]>

* fix

Signed-off-by: Chris Martin <[email protected]>

* fix

Signed-off-by: Chris Martin <[email protected]>

* fix

Signed-off-by: Chris Martin <[email protected]>

* another test

Signed-off-by: Chris Martin <[email protected]>

* another test

Signed-off-by: Chris Martin <[email protected]>

* initial commit

Signed-off-by: Chris Martin <[email protected]>

* remove gogo from testsuite

Signed-off-by: Chris Martin <[email protected]>

* executor

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

* unit test

Signed-off-by: Chris Martin <[email protected]>

* added test

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* remove timestamp

Signed-off-by: Chris Martin <[email protected]>

* remove timestamp

Signed-off-by: Chris Martin <[email protected]>

* fix import order

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* fairness error

Signed-off-by: Chris Martin <[email protected]>

* Update internal/scheduler/scheduler_metrics.go

Co-authored-by: JamesMurkin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
Co-authored-by: JamesMurkin <[email protected]>
* [WIP] Add support for mixed pool clusters

Currently an executor (k8s cluster) must belong to a pool.

This can be quite restricting if you have several nodes, all of which you want to be in their own pools. As you have to build a cluster for each group of nodes

With this change we are making it so a Node belongs to a pool (marked via label) - and defaulting to the cluster pool
 - At some point after this migration, we may remove the cluster having a pool and instead have a static "default" pool

As part of this change I've also removed the minimumJobSize functionality as it is unused
 - If we need it back, we'll need to add it as config at the scheduling level (minimum per pool)

Signed-off-by: JamesMurkin <[email protected]>

* Fix imports

Signed-off-by: JamesMurkin <[email protected]>

* gofumpt

Signed-off-by: JamesMurkin <[email protected]>

* Remove minimumJobSize, add tests, pass through Pool via JobRunLeased

Signed-off-by: JamesMurkin <[email protected]>

* Add tests + tidy

Signed-off-by: JamesMurkin <[email protected]>

* Format

Signed-off-by: JamesMurkin <[email protected]>

* Remove unused field

Signed-off-by: JamesMurkin <[email protected]>

* Default pool

Signed-off-by: JamesMurkin <[email protected]>

* Improve comment

---------

Signed-off-by: JamesMurkin <[email protected]>
d80tb7 and others added 28 commits July 12, 2024 14:35
Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
Co-authored-by: robertdavidsmith <[email protected]>
Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* remove jobservice

Signed-off-by: Chris Martin <[email protected]>

* go mod tidy

Signed-off-by: Chris Martin <[email protected]>

* remove grpc pool

Signed-off-by: Chris Martin <[email protected]>

* remove json marshaller

Signed-off-by: Chris Martin <[email protected]>

* Removing remaining jobservice references

* Using built in JSON marshaller to avoid compile errors

* Go mod tidy

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
Co-authored-by: Mustafa <[email protected]>
Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
This deepcopy isn't actually doing anything useful and is quite slow
 - In tests where the scheduler was only evicting/rescheduling jobs, it was 10% of the scheduling round

It is no longer needed, it was originally added as the scheduler would modify the job it was scheduling. This is no longer the case and that is now added to the JobSchedulingContext instead

I've added a very brief test to confirm it no longer modifies the original jobs scheduling details
 - Ideally we'd check the job hasn't changed at all - but it isn't completely straight forward

Signed-off-by: JamesMurkin <[email protected]>
* [armadactl] Remove client side podspec validation

Reasons for removal are:
 - It isn't very useful, kubernetes client side validation is pretty weak
 - It can't be used without access to the internet

Alternatively we could add it back and embed the podspec schema, so it can be used offline.

However I think it provides too little value in reality to worth doing right now.

Signed-off-by: JamesMurkin <[email protected]>

* go mod tidy

Signed-off-by: JamesMurkin <[email protected]>

---------

Signed-off-by: JamesMurkin <[email protected]>
* implement cap

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
…ore profiling details in sched config (#3802)

Co-authored-by: Robert Smith <[email protected]>
Signed-off-by: Robert Smith <[email protected]>
Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
…rror handling (#3809)

Co-authored-by: Robert Smith <[email protected]>
Signed-off-by: Robert Smith <[email protected]>
* Add test for pulsarBatchSize on ingestion pipeline

This test just confirms pulsarBatchSize is limit the number of events we process at once correctly

Signed-off-by: JamesMurkin <[email protected]>

* Merge master - fix proto time

Signed-off-by: JamesMurkin <[email protected]>

* Gofumpt

Signed-off-by: JamesMurkin <[email protected]>

---------

Signed-off-by: JamesMurkin <[email protected]>
…3801)

We pass priorityClasses around a lot, but largely these are unused at the bottom of the stack
 - I've removed this passing where the priorityClass is not used

Removed a few other minor unused parameters

Signed-off-by: JamesMurkin <[email protected]>
* Use Publisher Eveywhere

Use the pulsarutils.Publisher everywhere we publish eventsequences

Motivations for this are:
 - Same code used to publish to pulsar used in all places
 - Better testing on that piece of code
 - Less work if we need to work on publisher in future (only 1 place to update)

Signed-off-by: JamesMurkin <[email protected]>

* Fix config validation

Signed-off-by: JamesMurkin <[email protected]>

* imports

Signed-off-by: JamesMurkin <[email protected]>

* gofumpt

Signed-off-by: JamesMurkin <[email protected]>

* Fix producer names

Signed-off-by: JamesMurkin <[email protected]>

---------

Signed-off-by: JamesMurkin <[email protected]>
* fix node label bug (#181)

* lint

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Christopher Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
…elease-rc workflows and conditions in build workflow (#3810)

Signed-off-by: Ivan Pavlovic <[email protected]>
* wip

Signed-off-by: Chris Martin <[email protected]>

* fix some tests

Signed-off-by: Chris Martin <[email protected]>

* fix some tests

Signed-off-by: Chris Martin <[email protected]>

* fix some tests

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

* update proto

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* wip

Signed-off-by: Chris Martin <[email protected]>

* add test

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* wip

Signed-off-by: Chris Martin <[email protected]>

* wip

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

* lint

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* remove stringinterner from nodedb

Signed-off-by: Chris Martin <[email protected]>

* fix case of labels being nil

Signed-off-by: Chris Martin <[email protected]>

---------

Signed-off-by: Chris Martin <[email protected]>
Co-authored-by: Chris Martin <[email protected]>
* Fixing unfeasible scheduling keys bug

* Renaming test parameter to be more consistent with other parameters

* Renaming constraints

Co-authored-by: Mustafa Ilyas <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants