-
Notifications
You must be signed in to change notification settings - Fork 163
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discovery: Initial connection check #2459
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 16 files reviewed, 1 unresolved discussion (waiting on @scrye)
a discussion (no related file):
full run: https://buildkite.com/scionproto/scionproto/builds/1974
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 9 of 9 files at r1, 2 of 2 files at r2, 5 of 5 files at r3, 1 of 1 files at r4.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @oncilla)
acceptance/discovery_util/util.sh, line 99 at r3 (raw file):
check_br_fail_action() { stop_mock_ds # Check that services continue if fail action is not set.
s/services/BR (continues)
acceptance/discovery_util/util.sh, line 105 at r3 (raw file):
local err="$( check_stopped_count "br$IA_FILE-1" 1 )" || fail "Error: br$IA_FILE-1 not running err=$err" # Check that services exit if fail action is fatal
s/services/BS (exits)
acceptance/discovery_util/util.sh, line 82 at r4 (raw file):
./tools/dc scion restart "scion_ps$IA_FILE-1" "scion_cs$IA_FILE-1" "scion_sd$IA_FILE" sleep 10 local err="$( check_stopped_count "ps$IA_FILE-1" 1 )" || fail "Error: ps$IA_FILE-1 not running err=$err"
Instead of log counting, wouldn't checking for process state be simpler?
acceptance/discovery_util/util.sh, line 118 at r4 (raw file):
check_stopped_count() { local c="$( grep "\[INFO\] =====================> Service stopped" "logs/$1.log" | wc -l )" [ "$c" == "$2" ] || echo "Invalid count expected=$2 actual=$c"
Doesn't the echo mean that this function is always successful, and thus none of the || fail
above can ever happen?
(I dislike bash so much...)
go/lib/infra/modules/idiscovery/config.go, line 92 at r1 (raw file):
} type Connect struct {
I'd call this ConnectParams
, Connect
is verb-y and looks odd as a type name.
go/lib/infra/modules/idiscovery/config.go, line 73 at r2 (raw file):
// Interval specifies the time between two queries. Interval util.DurWrap // Timeout specifies the timout for a single query.
s/timout/timeout
go/lib/infra/modules/idiscovery/idiscovery.go, line 188 at r2 (raw file):
// startPeriodicFetcher starts a runner that periodically fetches the topology. // If during the InitialPeriod no topology is successfully fetched, the process takes // the configrued FailAction.
s/configrued/configured
go/lib/infra/modules/idiscovery/idiscovery.go, line 203 at r2 (raw file):
r.fetcher = periodic.StartPeriodicTask(fetcher, periodic.NewTicker(cfg.Interval.Duration), cfg.Timeout.Duration) // Start periodic checker and trigger immediate fetch from discovery service.
Can probably delete this, code is self-explanatory.
go/lib/infra/modules/idiscovery/idiscovery.go, line 205 at r2 (raw file):
// Start periodic checker and trigger immediate fetch from discovery service. r.fetcher.TriggerRun() r.checker = periodic.StartPeriodicTask(connChecker{runner: r}, periodic.NewTicker(time.Second),
I might still not fully grasp how this works, but bear with me. Suppose we have the following config:
discovery.static.Timeout = 5s
discovery.static.Interval = 5s
discovery.static.connect.InitialPeriod = 20s
The checker will attempt to tick every second, but the fetcher might still be running the older iteration of the task. This will cause the checker goroutine to block until the trigger channel of the fetcher is drained (which is not deterministic if a new tick interval elapsed on the fetcher).
This is because TriggerRun
blocks until the run is actually started.
The outcome is that the periodicity of the initial fetch is affected, and will not behave as promised: On start up, the discovery service is queried every second until a valid response is received or the InitialPeriod has passed.
.
go/lib/infra/modules/idiscovery/idiscovery.go, line 208 at r2 (raw file):
time.Second) // Kill periodic checker after initial period has passed. time.AfterFunc(cfg.Connect.InitialPeriod.Duration, func() {
On success, this AfterFunc
uses up a no-op goroutine until the cfg.Connect.InitialPeriod.Duration
elapses. This is because the timer returned by the AfterFunc
call is never stopped.
It's not a big issue, since it happens rarely and isn't too costly.
go/lib/infra/modules/idiscovery/idiscoverytest/config.go, line 22 at r1 (raw file):
. "github.com/smartystreets/goconvey/convey" . "github.com/scionproto/scion/go/lib/infra/modules/idiscovery"
Do you need the .
import? We tend to avoid it in the codebase, and there are only a few types from that package in here.
213331c
to
658b722
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 7 of 16 files reviewed, 10 unresolved discussions (waiting on @scrye)
acceptance/discovery_util/util.sh, line 105 at r3 (raw file):
Previously, scrye (Sergiu Costea) wrote…
s/services/BS (exits)
Done.
acceptance/discovery_util/util.sh, line 82 at r4 (raw file):
Previously, scrye (Sergiu Costea) wrote…
Instead of log counting, wouldn't checking for process state be simpler?
Let's try if it works on ci.
acceptance/discovery_util/util.sh, line 118 at r4 (raw file):
Previously, scrye (Sergiu Costea) wrote…
Doesn't the echo mean that this function is always successful, and thus none of the
|| fail
above can ever happen?(I dislike bash so much...)
Done.
go/lib/infra/modules/idiscovery/config.go, line 92 at r1 (raw file):
Previously, scrye (Sergiu Costea) wrote…
I'd call this
ConnectParams
,Connect
is verb-y and looks odd as a type name.
Done.
go/lib/infra/modules/idiscovery/config.go, line 73 at r2 (raw file):
Previously, scrye (Sergiu Costea) wrote…
s/timout/timeout
Done.
go/lib/infra/modules/idiscovery/idiscovery.go, line 188 at r2 (raw file):
Previously, scrye (Sergiu Costea) wrote…
s/configrued/configured
Done.
go/lib/infra/modules/idiscovery/idiscovery.go, line 203 at r2 (raw file):
Previously, scrye (Sergiu Costea) wrote…
Can probably delete this, code is self-explanatory.
Done.
go/lib/infra/modules/idiscovery/idiscovery.go, line 205 at r2 (raw file):
Previously, scrye (Sergiu Costea) wrote…
I might still not fully grasp how this works, but bear with me. Suppose we have the following config:
discovery.static.Timeout = 5s discovery.static.Interval = 5s discovery.static.connect.InitialPeriod = 20s
The checker will attempt to tick every second, but the fetcher might still be running the older iteration of the task. This will cause the checker goroutine to block until the trigger channel of the fetcher is drained (which is not deterministic if a new tick interval elapsed on the fetcher).
This is because
TriggerRun
blocks until the run is actually started.The outcome is that the periodicity of the initial fetch is affected, and will not behave as promised:
On start up, the discovery service is queried every second until a valid response is received or the InitialPeriod has passed.
.
You are right. Refactored the code quite a bit to fulfill the promise.
Done.
go/lib/infra/modules/idiscovery/idiscovery.go, line 208 at r2 (raw file):
Previously, scrye (Sergiu Costea) wrote…
On success, this
AfterFunc
uses up a no-op goroutine until thecfg.Connect.InitialPeriod.Duration
elapses. This is because the timer returned by theAfterFunc
call is never stopped.It's not a big issue, since it happens rarely and isn't too costly.
Gone after the refactor.
go/lib/infra/modules/idiscovery/idiscoverytest/config.go, line 22 at r1 (raw file):
Previously, scrye (Sergiu Costea) wrote…
Do you need the
.
import? We tend to avoid it in the codebase, and there are only a few types from that package in here.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r6, 4 of 5 files at r7, 1 of 4 files at r9.
Reviewable status: 10 of 16 files reviewed, 10 unresolved discussions (waiting on @scrye)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 4 files at r9, 4 of 4 files at r10.
Reviewable status: 15 of 16 files reviewed, 1 unresolved discussion (waiting on @scrye)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r11.
Reviewable status:complete! all files reviewed, all discussions resolved
Add initial connect period to the discovery.Config. Adjust the sample files accordingly.
idiscovery now tracks whether a topology has been successfully fetched from the discovery service in the initial period. During this period the discovery service is queried every second. The applicable actions are `Fatal` and `Continue`.
Check that the services respect the configured fail action.
e9a42e6
to
cb71a5d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r20.
Reviewable status: 6 of 17 files reviewed, 1 unresolved discussion (waiting on @oncilla and @scrye)
acceptance/common.sh, line 92 at r20 (raw file):
####################################### # Returns whether is running in docker
The syntax is a bit weird (I thought a word was missing, until I realized it is referring to the script itself).
Maybe reword to Returns whether this script is running in docker
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 2 files at r13, 4 of 5 files at r14, 2 of 4 files at r16, 4 of 4 files at r17.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @oncilla)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 16 of 17 files reviewed, 1 unresolved discussion (waiting on @scrye)
acceptance/common.sh, line 92 at r20 (raw file):
Previously, scrye (Sergiu Costea) wrote…
The syntax is a bit weird (I thought a word was missing, until I realized it is referring to the script itself).
Maybe reword to
Returns whether this script is running in docker
.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r21.
Reviewable status:complete! all files reviewed, all discussions resolved
Enables the
idiscovery
library to initially check that topologies can be successfully fetched from the discovery service.On start up, the discovery service is queried every second until a valid response is received or the
InitialPeriod
has passed. If that period has passed, and no valid topology has been received,the process will take the configured fail action:
Fatal
: Exit process with errorIgnore
: Ignore, and keep the periodic queries runningfixes #2431
This change is![Reviewable](https://camo.githubusercontent.com/1541c4039185914e83657d3683ec25920c672c6c5c7ab4240ee7bff601adec0b/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)