-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add compute agent store for ncproxy reconnect #1097
Add compute agent store for ncproxy reconnect #1097
Conversation
920ce94
to
dda49a1
Compare
2a506df
to
00296f9
Compare
internal/uvm/network.go
Outdated
if err != nil { | ||
return nil, errors.Wrap(err, "failed to connect to ncproxy service") | ||
} | ||
client := ttrpc.NewClient(conn, ttrpc.WithOnClose(func() { conn.Close() })) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the underlying conn
get cleaned up? I see we will close it when the client is closed here, but then we cast it to a type which doesn't implement io.Closer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're correct, it gets leaked from reading through the code. I'll work on a fix for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've fixed this in recent iterations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping this open so this can be reviewed and validated as such :)
It would be helpful to have a comment or commit description explaining how the overall ncproxy reconnect scenario will work. |
00296f9
to
927da7c
Compare
Made some changes in #1126 that I will wait for merging so I can make relevant changes in this PR. |
a117708
to
11c3c61
Compare
f76d5a1
to
e4152d9
Compare
Updated this PR to rebase and add new tests now that the above is merged. |
cmd/ncproxy/server.go
Outdated
// reconnectComputeAgents creates new compute agent connections from the database of | ||
// active compute agent addresses and adds them to the compute agent client cache | ||
// this MUST be called before the server start serving anything so that we can | ||
// ensure that the cache is ready when they do. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment or some other location could use additional detail how ncproxy reconnect is intended to work. Documenting the database schema would be helpful too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What additional information do you think would be helpful to detail how the reconnect flow works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like:
reconnectComputeAgents
Ncproxy maintains a cache of active compute agents in order reestablish connections if the service is restarted. The cache is persisted in a bolt database with the following schema:
On restart ncproxy will attempt to create new compute agent connections from the database of active compute agent addresses and add them to its compute agent client cache. Reconnect MUST be called before the server is allowed to start serving anything so that we can ensure that the cache is ready. Reconnections are performed in parallel to improve service startup performance.
There are a few failure modes for reconnect. If a compute agent entry is stale the reconnect will fail in the following way: xxxxx. Other failure modes are possible but not expected. In those cases we log the failures but allow the service start to proceed. We chose this approach vs just failing service start for the following reasons xxxx.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added!
e4152d9
to
e4dd6a1
Compare
Signed-off-by: Kathryn Baldauf <[email protected]>
Signed-off-by: Kathryn Baldauf <[email protected]>
e4dd6a1
to
6d440d1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
cmd/ncproxy/ncproxy.go
Outdated
@@ -599,14 +649,29 @@ func (s *grpcService) GetNetworks(ctx context.Context, req *ncproxygrpc.GetNetwo | |||
// TTRPC service exposed for use by the shim. | |||
type ttrpcService struct { | |||
containerIDToComputeAgent *computeAgentCache | |||
agentStore *computeAgentStore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to add a comment here clarifying what the difference is between a "store" and a "cache". That distinction is a bit confusing currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
cmd/ncproxy/run.go
Outdated
@@ -89,6 +107,7 @@ func run(clicontext *cli.Context) error { | |||
var ( | |||
configPath = clicontext.GlobalString("config") | |||
logDir = clicontext.GlobalString("log-directory") | |||
dbPath = clicontext.GlobalString("log-directory") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be database-path
instead of log-directory
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch :) fixed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of changes, otherwise looking good
Signed-off-by: Kathryn Baldauf <[email protected]>
6d440d1
to
bf9daee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM assuming CI is green :)
Related work items: microsoft#1067, microsoft#1097, microsoft#1119, microsoft#1170, microsoft#1176, microsoft#1180, microsoft#1181, microsoft#1182, microsoft#1183, microsoft#1184, microsoft#1185, microsoft#1186, microsoft#1187, microsoft#1188, microsoft#1189, microsoft#1191, microsoft#1193, microsoft#1194, microsoft#1195, microsoft#1196, microsoft#1197, microsoft#1200, microsoft#1201, microsoft#1202, microsoft#1203, microsoft#1204, microsoft#1205, microsoft#1206, microsoft#1207, microsoft#1209, microsoft#1210, microsoft#1211, microsoft#1218, microsoft#1219, microsoft#1220, microsoft#1223
…store Add compute agent store for ncproxy reconnect
This PR adds a bolt database store for ncproxy to store container IDs to compute agent addresses so we can reconnect to all compute agents if/when ncproxy restarts.
Context:
Ncproxy runs as a proxy service. It has one client connection to the node network agent and runs as a server for the node network agent in return. It additionally runs as one server for all compute agents and as a client to many compute agent servers.
Restart logic:
On restart, ncproxy will create a new client connection to the node network agent. It will then start up a new server for the node network agent. When the node network agent attempts to create a new client to ncproxy, it will be reconnected to this new server. Then ncproxy will create a new server for compute agent connections. When compute agents attempt to create new clients to ncproxy, they will be reconnected to this new server. Lastly ncproxy will attempt to create new clients to the compute agent servers by reading the new
computeAgentStore
database for the container IDs and corresponding compute agent server addresses.Signed-off-by: Kathryn Baldauf [email protected]