forked from armadaproject/armada
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[pull] master from armadaproject:master #17
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* adding Geoff to approvers list Signed-off-by: David Gantenbein <[email protected]> * adding Geoff to maintainers list Signed-off-by: David Gantenbein <[email protected]> --------- Signed-off-by: David Gantenbein <[email protected]>
Signed-off-by: JamesMurkin <[email protected]>
* Limit the number of events we put into each Pulsar Message Currently we can publish very large events per message (100k+ events per message) This can make the time to process messages quite unpredictable, as they can be anywhere between 1 event and 100000+ events Now we restrict how many messages we put into each message (via `maxAllowedEventsPerMessage`), which should make how many changes a given message may contain somewhat more predictable Signed-off-by: JamesMurkin <[email protected]> * Add unit test Signed-off-by: JamesMurkin <[email protected]> * Fix unit test Signed-off-by: JamesMurkin <[email protected]> * Fix unit test Signed-off-by: JamesMurkin <[email protected]> * Fix passing field through ExecutorAPI Signed-off-by: JamesMurkin <[email protected]> --------- Signed-off-by: JamesMurkin <[email protected]>
* Add assignedNodeId field to JobSchedulingContext This makes it far faster to look up the currently assigned node for a given evicted job - Currently it uses a map look up which is comparatively slow Signed-off-by: JamesMurkin <[email protected]> * Add test for Get/Set AssignedNodeId Signed-off-by: JamesMurkin <[email protected]> * Fix check for assigned node Signed-off-by: JamesMurkin <[email protected]> * Fix unit test Signed-off-by: JamesMurkin <[email protected]> --------- Signed-off-by: JamesMurkin <[email protected]>
Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* Make ingesters limit how many events they process at once Currently we limit how many messages are processed at once However a single message may contain thousands of events, so limit on messages is a bit flawed - A limit of 1000, could mean you're processing 1000 or 1000000+ events at once This should make the ingesters less prone to long pauses if some large messages come along Also adjusted event ingester config to be more standard with other config naming Signed-off-by: JamesMurkin <[email protected]> * Comment improvement Signed-off-by: JamesMurkin <[email protected]> * Comment improvement Signed-off-by: JamesMurkin <[email protected]> * gofumpt Signed-off-by: JamesMurkin <[email protected]> * Improve config descriptions Signed-off-by: JamesMurkin <[email protected]> * Limit Pulsar messages to have a configurable max number of events per message Currently we can publish very large messages (100k+ events per message) This can make the time to process messages quite unpredictable, as they can be anywhere between 1 event and 100000+ events Now we restrict how many messages we put into each message (via `maxAllowedEventsPerMessage`), which should make how many changes a given message may contain somewhat more predictable Signed-off-by: JamesMurkin <[email protected]> * Revert "Limit Pulsar messages to have a configurable max number of events per message" This reverts commit 11a8a2a. * Rename config to MaxOutputMessageSizeBytes Signed-off-by: JamesMurkin <[email protected]> --------- Signed-off-by: JamesMurkin <[email protected]>
* Improve performance of selectNodeForJobWithFairPreemption Testing with the worst case, this makes the function 2-4 times faster - Combined with #3678 it is about 3-5 times faster Even the average case should be improved, as we now do less work The main "trick" here is to not perform node.UnsafeCopy() and nodeDb.unbindJobFromNodeInPlace until it is needed (once you know which node is the selected one) - These functions are expensive, particularly UnsafeCopy Instead just keep track of the evicted nodes + available resource locally until you've found one that matches. There are more enhancements that could be made here: - Work out when it is better to check static requirements first (possibly even save these and use them for all jobs with the same scheduling key) - Improve nodeDb speed to look up a node (use a map?) - Improve nodeDb speed to iterate evicted jobs (use a slice? / effectively we just need a sorted hashmap, not a fully fledged radix tree) Signed-off-by: JamesMurkin <[email protected]> * Lint Signed-off-by: JamesMurkin <[email protected]> * Move struct definition inside function Signed-off-by: JamesMurkin <[email protected]> --------- Signed-off-by: JamesMurkin <[email protected]>
* Syncing out Armada Airflow Operator (#164) * changes * Fixing tests * Formatting * Making integration tests work * Formatting files * Formatting, ensuring Armada trigger can cancel * Fixing docs and adding back examples * Fixing docs * Addressing comments, supporting more python versions * Updating docs * Removing airflow operator build script from github workflow check --------- Co-authored-by: Mustafa Ilyas <[email protected]>
Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* wip Signed-off-by: Chris Martin <[email protected]> * wip Signed-off-by: Chris Martin <[email protected]> * wip Signed-off-by: Chris Martin <[email protected]> * wip Signed-off-by: Chris Martin <[email protected]> * unit tests for ingester Signed-off-by: Chris Martin <[email protected]> * wip Signed-off-by: Chris Martin <[email protected]> * wip Signed-off-by: Chris Martin <[email protected]> * fix tests Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> * extra test Signed-off-by: Chris Martin <[email protected]> * code review comments Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* wip Signed-off-by: Chris Martin <[email protected]> * wip Signed-off-by: Chris Martin <[email protected]> * fix some tests Signed-off-by: Chris Martin <[email protected]> * fix some tests Signed-off-by: Chris Martin <[email protected]> * wip Signed-off-by: Chris Martin <[email protected]> * extra test Signed-off-by: Chris Martin <[email protected]> * allow ability to disable adjusted fair share protection. Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> * extra test Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* Make ingesters limit how many events they process at once Currently we limit how many messages are processed at once However a single message may contain thousands of events, so limit on messages is a bit flawed - A limit of 1000, could mean you're processing 1000 or 1000000+ events at once This should make the ingesters less prone to long pauses if some large messages come along Also adjusted event ingester config to be more standard with other config naming Signed-off-by: JamesMurkin <[email protected]> * Comment improvement Signed-off-by: JamesMurkin <[email protected]> * Comment improvement Signed-off-by: JamesMurkin <[email protected]> * gofumpt Signed-off-by: JamesMurkin <[email protected]> * Improve config descriptions Signed-off-by: JamesMurkin <[email protected]> * Limit Pulsar messages to have a configurable max number of events per message Currently we can publish very large messages (100k+ events per message) This can make the time to process messages quite unpredictable, as they can be anywhere between 1 event and 100000+ events Now we restrict how many messages we put into each message (via `maxAllowedEventsPerMessage`), which should make how many changes a given message may contain somewhat more predictable Signed-off-by: JamesMurkin <[email protected]> * Revert "Limit Pulsar messages to have a configurable max number of events per message" This reverts commit 11a8a2a. * Improve logging in ingester pipeline This should help us understand what is happening in our ingestion pipelines - Should log if we are no longer receiving pulsar messages for 2mins - Will log a summary of how many messages and event in each "batch" - Will log a summary of the types of events in each batch - Will log a summary of how long Convert took for each batch This is admittedly quite a "quick" fix and better long term steps would be: - Metrics or spans - Some of these could be at the ingseter pipeline level (generic) - Some would need to be done in each ingester to expose more detailed information such as which query is all the time being spent on Signed-off-by: JamesMurkin <[email protected]> --------- Signed-off-by: JamesMurkin <[email protected]>
* Support state rejected and job errors Signed-off-by: Robert Smith <[email protected]> * fix test Signed-off-by: Robert Smith <[email protected]> --------- Signed-off-by: Robert Smith <[email protected]> Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
The API for working out if a timer is stopped / needs draining is more complicated than I understood. We could fix what we have to use it properly, but instead use simpler approach that isn't error prone It also means we can print how long it has been since we last received an event Signed-off-by: JamesMurkin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* wip Signed-off-by: Chris Martin <[email protected]> * add tests Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> * code review Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
The function that counts the number of events per event sequence is wrong as it is actually counting the number of event sequences (each of which can contain many events) This just makes us actually count all the events in all event sequences Signed-off-by: JamesMurkin <[email protected]>
* wip Signed-off-by: Chris Martin <[email protected]> * wip Signed-off-by: Chris Martin <[email protected]> * wip Signed-off-by: Chris Martin <[email protected]> * code review comment Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* initial timestamps Signed-off-by: Chris Martin <[email protected]> * initial timestamps Signed-off-by: Chris Martin <[email protected]> * initial timestamps Signed-off-by: Chris Martin <[email protected]> * tests Signed-off-by: Chris Martin <[email protected]> * regen proto Signed-off-by: Chris Martin <[email protected]> * fix diff Signed-off-by: Chris Martin <[email protected]> * more fixes Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> * fix tests Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* initial timestamps Signed-off-by: Chris Martin <[email protected]> * initial timestamps Signed-off-by: Chris Martin <[email protected]> * initial timestamps Signed-off-by: Chris Martin <[email protected]> * tests Signed-off-by: Chris Martin <[email protected]> * regen proto Signed-off-by: Chris Martin <[email protected]> * fix diff Signed-off-by: Chris Martin <[email protected]> * submit.go Signed-off-by: Chris Martin <[email protected]> * submit.go Signed-off-by: Chris Martin <[email protected]> * more fixes Signed-off-by: Chris Martin <[email protected]> * more fixes Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> * fix tests Signed-off-by: Chris Martin <[email protected]> * fix proto gen Signed-off-by: Chris Martin <[email protected]> * fix test Signed-off-by: Chris Martin <[email protected]> * comment out test Signed-off-by: Chris Martin <[email protected]> * fix Signed-off-by: Chris Martin <[email protected]> * fix Signed-off-by: Chris Martin <[email protected]> * fix Signed-off-by: Chris Martin <[email protected]> * another test Signed-off-by: Chris Martin <[email protected]> * another test Signed-off-by: Chris Martin <[email protected]> * wip Signed-off-by: Chris Martin <[email protected]> * fix merge conflict Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* initial timestamps Signed-off-by: Chris Martin <[email protected]> * initial timestamps Signed-off-by: Chris Martin <[email protected]> * initial timestamps Signed-off-by: Chris Martin <[email protected]> * tests Signed-off-by: Chris Martin <[email protected]> * regen proto Signed-off-by: Chris Martin <[email protected]> * fix diff Signed-off-by: Chris Martin <[email protected]> * submit.go Signed-off-by: Chris Martin <[email protected]> * submit.go Signed-off-by: Chris Martin <[email protected]> * more fixes Signed-off-by: Chris Martin <[email protected]> * more fixes Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> * fix tests Signed-off-by: Chris Martin <[email protected]> * fix proto gen Signed-off-by: Chris Martin <[email protected]> * fix test Signed-off-by: Chris Martin <[email protected]> * comment out test Signed-off-by: Chris Martin <[email protected]> * fix Signed-off-by: Chris Martin <[email protected]> * fix Signed-off-by: Chris Martin <[email protected]> * fix Signed-off-by: Chris Martin <[email protected]> * another test Signed-off-by: Chris Martin <[email protected]> * another test Signed-off-by: Chris Martin <[email protected]> * initial commit Signed-off-by: Chris Martin <[email protected]> * fix test Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* initial timestamps Signed-off-by: Chris Martin <[email protected]> * initial timestamps Signed-off-by: Chris Martin <[email protected]> * initial timestamps Signed-off-by: Chris Martin <[email protected]> * tests Signed-off-by: Chris Martin <[email protected]> * regen proto Signed-off-by: Chris Martin <[email protected]> * fix diff Signed-off-by: Chris Martin <[email protected]> * submit.go Signed-off-by: Chris Martin <[email protected]> * submit.go Signed-off-by: Chris Martin <[email protected]> * more fixes Signed-off-by: Chris Martin <[email protected]> * more fixes Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> * fix tests Signed-off-by: Chris Martin <[email protected]> * fix proto gen Signed-off-by: Chris Martin <[email protected]> * fix test Signed-off-by: Chris Martin <[email protected]> * comment out test Signed-off-by: Chris Martin <[email protected]> * fix Signed-off-by: Chris Martin <[email protected]> * fix Signed-off-by: Chris Martin <[email protected]> * fix Signed-off-by: Chris Martin <[email protected]> * another test Signed-off-by: Chris Martin <[email protected]> * another test Signed-off-by: Chris Martin <[email protected]> * initial commit Signed-off-by: Chris Martin <[email protected]> * remove gogo from testsuite Signed-off-by: Chris Martin <[email protected]> * executor Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> * unit test Signed-off-by: Chris Martin <[email protected]> * added test Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* remove timestamp Signed-off-by: Chris Martin <[email protected]> * remove timestamp Signed-off-by: Chris Martin <[email protected]> * fix import order Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* fairness error Signed-off-by: Chris Martin <[email protected]> * Update internal/scheduler/scheduler_metrics.go Co-authored-by: JamesMurkin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: JamesMurkin <[email protected]>
* [WIP] Add support for mixed pool clusters Currently an executor (k8s cluster) must belong to a pool. This can be quite restricting if you have several nodes, all of which you want to be in their own pools. As you have to build a cluster for each group of nodes With this change we are making it so a Node belongs to a pool (marked via label) - and defaulting to the cluster pool - At some point after this migration, we may remove the cluster having a pool and instead have a static "default" pool As part of this change I've also removed the minimumJobSize functionality as it is unused - If we need it back, we'll need to add it as config at the scheduling level (minimum per pool) Signed-off-by: JamesMurkin <[email protected]> * Fix imports Signed-off-by: JamesMurkin <[email protected]> * gofumpt Signed-off-by: JamesMurkin <[email protected]> * Remove minimumJobSize, add tests, pass through Pool via JobRunLeased Signed-off-by: JamesMurkin <[email protected]> * Add tests + tidy Signed-off-by: JamesMurkin <[email protected]> * Format Signed-off-by: JamesMurkin <[email protected]> * Remove unused field Signed-off-by: JamesMurkin <[email protected]> * Default pool Signed-off-by: JamesMurkin <[email protected]> * Improve comment --------- Signed-off-by: JamesMurkin <[email protected]>
Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
Signed-off-by: Robert Smith <[email protected]>
Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: robertdavidsmith <[email protected]>
Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* remove jobservice Signed-off-by: Chris Martin <[email protected]> * go mod tidy Signed-off-by: Chris Martin <[email protected]> * remove grpc pool Signed-off-by: Chris Martin <[email protected]> * remove json marshaller Signed-off-by: Chris Martin <[email protected]> * Removing remaining jobservice references * Using built in JSON marshaller to avoid compile errors * Go mod tidy --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]> Co-authored-by: Mustafa <[email protected]>
Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
This deepcopy isn't actually doing anything useful and is quite slow - In tests where the scheduler was only evicting/rescheduling jobs, it was 10% of the scheduling round It is no longer needed, it was originally added as the scheduler would modify the job it was scheduling. This is no longer the case and that is now added to the JobSchedulingContext instead I've added a very brief test to confirm it no longer modifies the original jobs scheduling details - Ideally we'd check the job hasn't changed at all - but it isn't completely straight forward Signed-off-by: JamesMurkin <[email protected]>
* [armadactl] Remove client side podspec validation Reasons for removal are: - It isn't very useful, kubernetes client side validation is pretty weak - It can't be used without access to the internet Alternatively we could add it back and embed the podspec schema, so it can be used offline. However I think it provides too little value in reality to worth doing right now. Signed-off-by: JamesMurkin <[email protected]> * go mod tidy Signed-off-by: JamesMurkin <[email protected]> --------- Signed-off-by: JamesMurkin <[email protected]>
Co-authored-by: Robert Smith <[email protected]>
* implement cap Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
Signed-off-by: Robert Smith <[email protected]>
…ore profiling details in sched config (#3802) Co-authored-by: Robert Smith <[email protected]> Signed-off-by: Robert Smith <[email protected]>
Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
…rror handling (#3809) Co-authored-by: Robert Smith <[email protected]> Signed-off-by: Robert Smith <[email protected]>
* Add test for pulsarBatchSize on ingestion pipeline This test just confirms pulsarBatchSize is limit the number of events we process at once correctly Signed-off-by: JamesMurkin <[email protected]> * Merge master - fix proto time Signed-off-by: JamesMurkin <[email protected]> * Gofumpt Signed-off-by: JamesMurkin <[email protected]> --------- Signed-off-by: JamesMurkin <[email protected]>
…3801) We pass priorityClasses around a lot, but largely these are unused at the bottom of the stack - I've removed this passing where the priorityClass is not used Removed a few other minor unused parameters Signed-off-by: JamesMurkin <[email protected]>
* Use Publisher Eveywhere Use the pulsarutils.Publisher everywhere we publish eventsequences Motivations for this are: - Same code used to publish to pulsar used in all places - Better testing on that piece of code - Less work if we need to work on publisher in future (only 1 place to update) Signed-off-by: JamesMurkin <[email protected]> * Fix config validation Signed-off-by: JamesMurkin <[email protected]> * imports Signed-off-by: JamesMurkin <[email protected]> * gofumpt Signed-off-by: JamesMurkin <[email protected]> * Fix producer names Signed-off-by: JamesMurkin <[email protected]> --------- Signed-off-by: JamesMurkin <[email protected]>
* fix node label bug (#181) * lint Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Christopher Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
…elease-rc workflows and conditions in build workflow (#3810) Signed-off-by: Ivan Pavlovic <[email protected]>
* wip Signed-off-by: Chris Martin <[email protected]> * fix some tests Signed-off-by: Chris Martin <[email protected]> * fix some tests Signed-off-by: Chris Martin <[email protected]> * fix some tests Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> * update proto Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* wip Signed-off-by: Chris Martin <[email protected]> * add test Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* wip Signed-off-by: Chris Martin <[email protected]> * wip Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> * lint Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* remove stringinterner from nodedb Signed-off-by: Chris Martin <[email protected]> * fix case of labels being nil Signed-off-by: Chris Martin <[email protected]> --------- Signed-off-by: Chris Martin <[email protected]> Co-authored-by: Chris Martin <[email protected]>
* Fixing unfeasible scheduling keys bug * Renaming test parameter to be more consistent with other parameters * Renaming constraints Co-authored-by: Mustafa Ilyas <[email protected]>
pull-bot-infra
bot
merged commit Jul 30, 2024
8a64c0f
into
gr-oss-devops:master
92 of 101 checks passed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by pull-bot-infra[bot]
Can you help keep this open source service alive? 💖 Please sponsor : )