-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add bundle data snapshotting for resolution #433
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not clear to me how this resolves the substantive issue at hand, which also manifests when the resolution calls out to read cluster state more than once.
I would suggest a different approach to the resolution library top-down instead: don't pass in delegated / enclosed sources that fetch data on their own time, instead, expect as input to the resolution pass a full set of variables over which resolution will occur. The client/caller must assemble these intentionally and run them through the resolver.
This would make it much more clear what the lifetime of the objects are and much less likely that somewhere in the myriad layers of abstraction that exist today in sources other mistakes occur.
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #433 +/- ##
==========================================
- Coverage 84.00% 83.68% -0.32%
==========================================
Files 23 23
Lines 844 852 +8
==========================================
+ Hits 709 713 +4
- Misses 93 95 +2
- Partials 42 44 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
This aims to address #418 is about bundle information being retrieved and processed multiple times. We no longer do that: we retrieve and process it once and hold for duration of resolution.
I tried out a few different approaches and one of them was fetching bundles somewhere before this line:
And passing a slice of all bundles down to all variable sources somehow like this: solution, err := r.NewSolver(allBundles).Solve(ctx) I think what I ended up with wasn't better than having a client1:
I think variable sources are a bit messy today and can be simplified. But I do not see how passing a slince of bundles around instead of a thing1 which gives me bundles makes it more clear. Also why would I want to worry about lifetime of an object withing variable soruces? From a variable source perspective it doesn't matter lifetime of the objects are. From the whole resolution process it is important that variable sources see the same view of catalogs and we achieve that. So my thinking is that:
Footnotes
|
a29337a
to
42f2555
Compare
Devil's advocate: what is the problem with updating your functions as they require new arguments? A clear list of arguments for a static transformation (variable inputs -> resolution output) is easy to read and defines all the requiremens up front. How often do you think you need to change the set of inputs to your resolution? How much work is it to add the new data source to the function signature? Is this something that's happening so often that you need to find a way to optimize the amount of time it takes to write this code?
It's very simple: if I give you a static reference to the data you operate over, and your operation is a pure function, it's impossible to ever get into multiple-fetch / incoherent states and trivial to unit test. Passing in objects that allow you to fetch data on-demand is how the issues arise. Sure, you can wrap every single one of those to only ever return a slice of data, but to what end? Passing the slice itself is simple, and it's impossible to mess up and accidentally add a source in the future that changes during resolution. |
+1 - make the caller assemble the inputs, and pass a static set of inputs in/around. |
I think we need to be more prescriptive than "assemble inputs, pass them around" because, technically, that's exactly what already happens, right? The resolution code asks all of its variable sources for the inputs, and then it passes the static set of variables to the resolver. There are multiple levels of inputs:
|
In that PoC I did awhile back, collecting all the cluster state up front is essentially what I did: https://github.com/joelanford/operator-controller/blob/31082f42fa9fbe61a4c9155683d0956e2425b774/internal/resolution/v2/variablesources/operator.go#L29-L42 In that PR, the only variable source was that |
You have to pass arguments and store a reference even in places where you don't need them (intermediate components). At least with the current state of variables sources. When you pass around a struct/an interface - you add a new method and imidiately can use them deep in the call stack. There is also readability aspect of it: easier to read shorter list of arguments + when you have a "client" - it is intuitively clear where the data is comming from. If something is not clear - you can read a comment in a single place (on "client") which explains it and don't have to follow the breadcrumbs of slices passed from a component to component.
In variable soruce we very often filter input data. I think It is easy to pass down a filtered slice of data down to the next function instead of unfiltered slice of all bundles and get the wrong solution. Harder to do this when you pass the "client" which always returns unfiltered data. I can see a number of potential bugs / failure scenarios with both approaches and I don't think that one approach is substantially better than the other.
I would like to reduce the number of variable sources. I created #437 for this topic. |
We now snapshot bundle data in the client for the duration of single resolution. This reduces reads (from network or cache) and unmarshalling since data in memory. And it also ensures that different variable sources involved in the resolution process have the same view of catalogs and bundles. Signed-off-by: Mikalai Radchuk <[email protected]>
42f2555
to
a7f848f
Compare
Marking this ready for review. I'm not convinced so far that we should take another approach here: I do not think that passing slices around is substantially better than passing a client. Issue #418 is not on my priority list at the moment and I already spent more time on it than initially planned. I'm happy to address feedback with the current approach, but if we decide that we need to do it differently - I'll leave it for another time. |
I think this statement is entirely predicated on how the code is factored - to what extent are intermediate components even necessary? Why? In any case, this seems like a small price to pay.
I'll be honest, I just got familiar with the v0 resolver implementation - which, if I understand correctly - is the basis for this one - and the myriad layers of abstraction are way, way harder to read and understand. I think you are really an expert in this domain so take this with a grain of salt, but your experience is not universal.
This doesn't seem convincing. The resolver continues to expect that callers provide it the correct data. The surface of this problem has not changed one iota. A static approach simply calls
The critical thing is that one class of bug - multiple fetches leading to performance and correctness issues that cannot be mitigated - are entirely impossible. This is an incredibly valuable property and critical to a system that can be trusted to be correct over time. Adding dynamism to an inherently static process (the transform of resolution over inputs) in order to aid in readability at the cost of correctness seems like a bad trade-off. The dozen open-forever bugs against v0 resolution that cannot be closed without rearchitecting the system are the result. If you would like to propose a different approach that meets your readability bar without sacrificing correctness, I'm all ears! I just suggested the static data approach as it's the simplest one I thought of at the time. |
They are not necessary and I do not like how variable sources are currently structured. It was quite hard for me to follow what is going on when I was learning operator-controller codebase. There is #437 to simplify variable sources.
Variables sources ( We only call
Why impossible? In both approaches programmer error leads to correctness issues. Be it someone accidentally overriding a slice of "all bundles" or someone accidentally deleting caching code. The result is incoherent view of the world in both approaches.
Where specifically we undermine correctness here? As far as I understand - you are talking about potential bugs and regressions. But as we discussed above - both approaches have this potential:
Exactly. I think we can not say for sure that with approach 1 programmers are less likely to mess up compared to approach 2. |
@m1kola sorry, I am not sure where I am failing to explain this. Given two functions: func ResolveDynamic(client client.Client) {}
func ResolveStatic(data []Item) {} The Programmer error always occurs. Using "but it's possible to horrendously clobber changes to this code" as an argument is not useful. I am making a fundamental point about what is and is not possible in the architectural constraints within You claim that variable resolution is done only once - if that's true, two things:
|
@stevekuznetsov I'll start with these questions:
Think of the client as of shared informer (again - client is probably no the best name for this).
We were talking about possibility of accidentally introducing correctness issues. Here is one of the examples of how it is possible with Examplepackage main
import "fmt"
func ResolveStatic(data []Item) {
Step1(data)
// Oops, accidentally damaged something and
// rest of the steps get wrong data
data = data[:1]
Step2(data)
}
func Step1(data []Item) {
fmt.Println("Step1", data)
}
func Step2(data []Item) {
fmt.Println("Step2", data)
}
func main() {
data := []Item{
{name: "1"},
{name: "2"},
}
ResolveStatic(data)
fmt.Println("end", data)
} I do think that there is no material difference in this aspect.
What you referring to was we due to incoherent caches (we had multiple caches where we only needed one, if I remember correctly). I see how this supports "correctness is important" argument and I totally agree that correctness is important. However I do not see how this supports "passing slice around is better than client/other data structure" argument. I do not think there is substantial difference: a programmer can mess up with any of these accidentally as I illustrated above.
I'm just trying to show that there is no substantial difference between two approaches:
Yes, it is true. But adding another call to a client/store doesn't matter if we make sure on a certain level that we return correct objects (like we do in this PR). In PR we dealt with correctness & performance issues (same view on all levels, no extra unmarshalling) so it all boils down to how we pass the data (in slices or encapsulated in a client/store). |
That is only partially correct. The v0 code - and, likely, the v1 code - fetches data (either from live clients or from listers) more than once. Even parts of the system that do not use a cache suffer from correctness problems simply because they access data more than once from the cluster. The system is not guaranteed to see the same list of CSVs - or catalog data, or whatever - every time it is fetched. If the system calls for the same data more than once during a resolution, your sources can be incoherent and the output of the resolution will not be correct.
This is a pedantic distinction that makes no difference from the point of view of correctness. It does not matter how many layers of abstraction are put on top of the data fetching. If data fetching is allowed to happen more than once, the issue arises. My apologies if my imprecise language is getting in the way of communicating this in the specific terms of this library, but I do hope the broader point is clear.
No. Working around the issue (with the sync package, or approaches like that in this PR) may be possible, but my point has been - from the beginning - that it is inherently unnecessary. The problem statement of resolution runs against the state of the system at a particular point in time. Designing resolution to allow for fetching the state of the system more than once introduces these failure modes. Time is better spent forming a design that better mirrors the problem domain and makes those failure modes impossible. |
@stevekuznetsov I agree. In v1 we deal with catalog data in this PR, but we still need to do that for kube API data. I have some ideas for #437 and I think as a side effect we will be fetching cluster data once as well. But we can create a separate issue to track that (if we don't already have one, need it check).
I agree here too. That's why we prevent fetching the same data more than once in this PR. Once resolution - one fetch (either from network or from FS cache).
🤷♂️ Just answering questions.
Absolutely. If we merge this PR, we will not be allowing fetching more than once during the same resolution.
With this PR - it won't be allowed to fetch state of the system (catalog) more than once. It will be exactly as you say: "resolution runs against the state of the system at a particular point in time". But good point on fetching the state of the cluster. I'm wrapping up for today, but tomorrow I'll check whether we have an issue for this already so we don't rely on resolving it as a side effect of #437 |
Description
We now snapshot bundle data in the client for the duration of single resolution.
This reduces reads (from network or cache) and unmarshalling since data in memory. And it also ensures that different variable sources involved in the resolution process have the same view of catalogs and bundles.
Closes #418
Reviewer Checklist