Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discovery: Initial connection check #2459

Merged
merged 10 commits into from
Feb 15, 2019

Conversation

oncilla
Copy link
Contributor

@oncilla oncilla commented Feb 14, 2019

Enables the idiscovery library to initially check that topologies can be successfully fetched from the discovery service.
On start up, the discovery service is queried every second until a valid response is received or the InitialPeriod has passed. If that period has passed, and no valid topology has been received,
the process will take the configured fail action:

  • Fatal: Exit process with error
  • Ignore: Ignore, and keep the periodic queries running

fixes #2431


This change is Reviewable

@oncilla oncilla added this to the Q1S2 milestone Feb 14, 2019
@oncilla oncilla requested a review from scrye February 14, 2019 15:52
Copy link
Contributor Author

@oncilla oncilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 16 files reviewed, 1 unresolved discussion (waiting on @scrye)

a discussion (no related file):
full run: https://buildkite.com/scionproto/scionproto/builds/1974


Copy link
Contributor

@scrye scrye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 9 of 9 files at r1, 2 of 2 files at r2, 5 of 5 files at r3, 1 of 1 files at r4.
Reviewable status: all files reviewed, 11 unresolved discussions (waiting on @oncilla)


acceptance/discovery_util/util.sh, line 99 at r3 (raw file):

check_br_fail_action() {
    stop_mock_ds
    # Check that services continue if fail action is not set.

s/services/BR (continues)


acceptance/discovery_util/util.sh, line 105 at r3 (raw file):

    local err="$( check_stopped_count "br$IA_FILE-1" 1 )" || fail "Error: br$IA_FILE-1 not running err=$err"

    # Check that services exit if fail action is fatal

s/services/BS (exits)


acceptance/discovery_util/util.sh, line 82 at r4 (raw file):

    ./tools/dc scion restart "scion_ps$IA_FILE-1" "scion_cs$IA_FILE-1" "scion_sd$IA_FILE"
    sleep 10
    local err="$( check_stopped_count "ps$IA_FILE-1" 1 )" || fail "Error: ps$IA_FILE-1 not running err=$err"

Instead of log counting, wouldn't checking for process state be simpler?


acceptance/discovery_util/util.sh, line 118 at r4 (raw file):

check_stopped_count() {
    local c="$( grep "\[INFO\] =====================> Service stopped" "logs/$1.log" | wc -l )"
    [ "$c" == "$2" ] || echo "Invalid count expected=$2 actual=$c"

Doesn't the echo mean that this function is always successful, and thus none of the || fail above can ever happen?

(I dislike bash so much...)


go/lib/infra/modules/idiscovery/config.go, line 92 at r1 (raw file):

}

type Connect struct {

I'd call this ConnectParams, Connect is verb-y and looks odd as a type name.


go/lib/infra/modules/idiscovery/config.go, line 73 at r2 (raw file):

	// Interval specifies the time between two queries.
	Interval util.DurWrap
	// Timeout specifies the timout for a single query.

s/timout/timeout


go/lib/infra/modules/idiscovery/idiscovery.go, line 188 at r2 (raw file):

// startPeriodicFetcher starts a runner that periodically fetches the topology.
// If during the InitialPeriod no topology is successfully fetched, the process takes
// the configrued FailAction.

s/configrued/configured


go/lib/infra/modules/idiscovery/idiscovery.go, line 203 at r2 (raw file):

	r.fetcher = periodic.StartPeriodicTask(fetcher, periodic.NewTicker(cfg.Interval.Duration),
		cfg.Timeout.Duration)
	// Start periodic checker and trigger immediate fetch from discovery service.

Can probably delete this, code is self-explanatory.


go/lib/infra/modules/idiscovery/idiscovery.go, line 205 at r2 (raw file):

	// Start periodic checker and trigger immediate fetch from discovery service.
	r.fetcher.TriggerRun()
	r.checker = periodic.StartPeriodicTask(connChecker{runner: r}, periodic.NewTicker(time.Second),

I might still not fully grasp how this works, but bear with me. Suppose we have the following config:

discovery.static.Timeout = 5s
discovery.static.Interval = 5s
discovery.static.connect.InitialPeriod = 20s

The checker will attempt to tick every second, but the fetcher might still be running the older iteration of the task. This will cause the checker goroutine to block until the trigger channel of the fetcher is drained (which is not deterministic if a new tick interval elapsed on the fetcher).

This is because TriggerRun blocks until the run is actually started.

The outcome is that the periodicity of the initial fetch is affected, and will not behave as promised: On start up, the discovery service is queried every second until a valid response is received or the InitialPeriod has passed. .


go/lib/infra/modules/idiscovery/idiscovery.go, line 208 at r2 (raw file):

		time.Second)
	// Kill periodic checker after initial period has passed.
	time.AfterFunc(cfg.Connect.InitialPeriod.Duration, func() {

On success, this AfterFunc uses up a no-op goroutine until the cfg.Connect.InitialPeriod.Duration elapses. This is because the timer returned by the AfterFunc call is never stopped.

It's not a big issue, since it happens rarely and isn't too costly.


go/lib/infra/modules/idiscovery/idiscoverytest/config.go, line 22 at r1 (raw file):

	. "github.com/smartystreets/goconvey/convey"

	. "github.com/scionproto/scion/go/lib/infra/modules/idiscovery"

Do you need the . import? We tend to avoid it in the codebase, and there are only a few types from that package in here.

@oncilla oncilla force-pushed the infra-initial-connect branch from 213331c to 658b722 Compare February 15, 2019 13:49
Copy link
Contributor Author

@oncilla oncilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 7 of 16 files reviewed, 10 unresolved discussions (waiting on @scrye)


acceptance/discovery_util/util.sh, line 105 at r3 (raw file):

Previously, scrye (Sergiu Costea) wrote…

s/services/BS (exits)

Done.


acceptance/discovery_util/util.sh, line 82 at r4 (raw file):

Previously, scrye (Sergiu Costea) wrote…

Instead of log counting, wouldn't checking for process state be simpler?

Let's try if it works on ci.


acceptance/discovery_util/util.sh, line 118 at r4 (raw file):

Previously, scrye (Sergiu Costea) wrote…

Doesn't the echo mean that this function is always successful, and thus none of the || fail above can ever happen?

(I dislike bash so much...)

Done.


go/lib/infra/modules/idiscovery/config.go, line 92 at r1 (raw file):

Previously, scrye (Sergiu Costea) wrote…

I'd call this ConnectParams, Connect is verb-y and looks odd as a type name.

Done.


go/lib/infra/modules/idiscovery/config.go, line 73 at r2 (raw file):

Previously, scrye (Sergiu Costea) wrote…

s/timout/timeout

Done.


go/lib/infra/modules/idiscovery/idiscovery.go, line 188 at r2 (raw file):

Previously, scrye (Sergiu Costea) wrote…

s/configrued/configured

Done.


go/lib/infra/modules/idiscovery/idiscovery.go, line 203 at r2 (raw file):

Previously, scrye (Sergiu Costea) wrote…

Can probably delete this, code is self-explanatory.

Done.


go/lib/infra/modules/idiscovery/idiscovery.go, line 205 at r2 (raw file):

Previously, scrye (Sergiu Costea) wrote…

I might still not fully grasp how this works, but bear with me. Suppose we have the following config:

discovery.static.Timeout = 5s
discovery.static.Interval = 5s
discovery.static.connect.InitialPeriod = 20s

The checker will attempt to tick every second, but the fetcher might still be running the older iteration of the task. This will cause the checker goroutine to block until the trigger channel of the fetcher is drained (which is not deterministic if a new tick interval elapsed on the fetcher).

This is because TriggerRun blocks until the run is actually started.

The outcome is that the periodicity of the initial fetch is affected, and will not behave as promised: On start up, the discovery service is queried every second until a valid response is received or the InitialPeriod has passed. .

You are right. Refactored the code quite a bit to fulfill the promise.
Done.


go/lib/infra/modules/idiscovery/idiscovery.go, line 208 at r2 (raw file):

Previously, scrye (Sergiu Costea) wrote…

On success, this AfterFunc uses up a no-op goroutine until the cfg.Connect.InitialPeriod.Duration elapses. This is because the timer returned by the AfterFunc call is never stopped.

It's not a big issue, since it happens rarely and isn't too costly.

Gone after the refactor.


go/lib/infra/modules/idiscovery/idiscoverytest/config.go, line 22 at r1 (raw file):

Previously, scrye (Sergiu Costea) wrote…

Do you need the . import? We tend to avoid it in the codebase, and there are only a few types from that package in here.

Done.

Copy link
Contributor

@scrye scrye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r6, 4 of 5 files at r7, 1 of 4 files at r9.
Reviewable status: 10 of 16 files reviewed, 10 unresolved discussions (waiting on @scrye)

Copy link
Contributor

@scrye scrye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r9, 4 of 4 files at r10.
Reviewable status: 15 of 16 files reviewed, 1 unresolved discussion (waiting on @scrye)

Copy link
Contributor

@scrye scrye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r11.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

Add initial connect period to the discovery.Config.
Adjust the sample files accordingly.
idiscovery now tracks whether a topology has been successfully fetched
from the discovery service in the initial period.

During this period the discovery service is queried every second.

The applicable actions are `Fatal` and `Continue`.
Check that the services respect the configured fail action.
@oncilla oncilla force-pushed the infra-initial-connect branch from e9a42e6 to cb71a5d Compare February 15, 2019 16:09
Copy link
Contributor

@scrye scrye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r20.
Reviewable status: 6 of 17 files reviewed, 1 unresolved discussion (waiting on @oncilla and @scrye)


acceptance/common.sh, line 92 at r20 (raw file):

#######################################
# Returns whether is running in docker

The syntax is a bit weird (I thought a word was missing, until I realized it is referring to the script itself).

Maybe reword to Returns whether this script is running in docker.

Copy link
Contributor

@scrye scrye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r13, 4 of 5 files at r14, 2 of 4 files at r16, 4 of 4 files at r17.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @oncilla)

Copy link
Contributor Author

@oncilla oncilla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 16 of 17 files reviewed, 1 unresolved discussion (waiting on @scrye)


acceptance/common.sh, line 92 at r20 (raw file):

Previously, scrye (Sergiu Costea) wrote…

The syntax is a bit weird (I thought a word was missing, until I realized it is referring to the script itself).

Maybe reword to Returns whether this script is running in docker.

Done.

Copy link
Contributor

@scrye scrye left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r21.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved

@oncilla oncilla merged commit 12cff7e into scionproto:master Feb 15, 2019
@oncilla oncilla deleted the infra-initial-connect branch February 15, 2019 16:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Discovery: Enable initial connection check
2 participants