Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when executing continual tests with upgrade framework #2026

Closed
mgencur opened this issue Feb 17, 2021 · 1 comment · Fixed by knative/serving#10815
Closed

Comments

@mgencur
Copy link
Contributor

mgencur commented Feb 17, 2021

Expected Behavior

Tests executed via the upgrade framework in https://github.com/knative/pkg/tree/master/test/upgrade properly run the "continual" tests which run in background during an upgrade.

Actual Behavior

The test framework occasionally detects a race condition when tests are executed with -race flag. The example error can be seen here: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift-knative_serverless-operator/804/pull-ci-openshift-knative-serverless-operator-main-4.6-upgrade-tests-aws-ocp-46/1361689216857149440
The stacktrace looks like this:

2021-02-16T15:47:08.236Z	INFO	upgrade/steps.go:57	3) 🔄 Starting continual tests. 3 tests are registered.
2021-02-16T15:47:08.236Z	INFO	upgrade/steps.go:60	3.1) Starting continual tests of "ProbeTest".
    testing.go:954: race detected during execution of test
2021-02-16T15:47:20.489Z	ERROR	upgrade/functions.go:41	💣🤬💔️ Upgrade suite have failed!
knative.dev/pkg/test/upgrade.(*Suite).Execute
	/go/src/github.com/openshift-knative/serverless-operator/vendor/knative.dev/pkg/test/upgrade/functions.go:41
github.com/openshift-knative/serverless-operator/test/upgrade.TestServerlessUpgrade
	/go/src/github.com/openshift-knative/serverless-operator/test/upgrade/upgrade_test.go:50
testing.tRunner
	/usr/local/go/src/testing/testing.go:1039

Steps to Reproduce the Problem

  1. Define multiple continual tests as defined below
  2. Execute tests with -race flag
suite := pkgupgrade.Suite{
	Tests: pkgupgrade.Tests{
		PreUpgrade:    preUpgradeTests(),
		PostUpgrade:   postUpgradeTests(),
		Continual:     []pkgupgrade.BackgroundOperation{
			servingupgrade.ProbeTest(),
			servingupgrade.AutoscaleSustainingTest(),
			eventingupgrade.ContinualTest(),
		},
	},
  ...
}

Additional Info

mgencur added a commit to mgencur/serving-1 that referenced this issue Feb 19, 2021
Fixes knative/pkg#2026

The actual issue is that the test context expires between individual
stages run by the upgrade framework. This fix passes an external logger
that survives the stages.
mgencur added a commit to mgencur/serving-1 that referenced this issue Feb 19, 2021
Fixes knative/pkg#2026

The actual issue is that the test context expires between individual
stages run by the upgrade framework. This fix passes an external logger
that survives the stages.
mgencur added a commit to mgencur/serving-1 that referenced this issue Feb 19, 2021
Fixes knative/pkg#2026

The actual issue is that the test context expires between individual
stages run by the upgrade framework. This fix passes an external logger
that survives the stages.
@mgencur
Copy link
Contributor Author

mgencur commented Feb 19, 2021

More details about the root cause from verbose logs:

=== RUN   TestServerlessUpgrade/ContinualTests/SetupProbeTest
    service.go:98: Creating a new Service service upgrade-probe
    crd.go:36:  resource {<nil> <nil> <*>{&TypeMeta{Kind:,APIVersion:,} &ObjectMeta{Name:upgrade-probe,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ClusterName:,ManagedFields:[]ManagedFieldsEntry{},} {{&ObjectMeta{Name:,GenerateName:,Namespace:,SelfLink:,UID:,ResourceVersion:,Generation:0,CreationTimestamp:0001-01-01 00:00:00 +0000 UTC,DeletionTimestamp:<nil>,DeletionGracePeriodSeconds:nil,Labels:map[string]string{},Annotations:map[string]string{},OwnerReferences:[]OwnerReference{},Finalizers:[],ClusterName:,ManagedFields:[]ManagedFieldsEntry{},}} {<nil>}} {{0 <nil> <nil>} { } {<nil> <nil> <nil>}}} <nil>}
    service.go:116: Waiting for Service to transition to Ready. service upgrade-probe
    service.go:121: Checking to ensure Service Status is populated for Ready service
    service.go:127: Getting latest objects Created by Service
    service.go:130: Successfully created Service upgrade-probe
    spoof.go:103: Spoofing upgrade-probe-serving-tests.apps.ci-op-k6xfg5sw-113e4.origin-ci-int-aws.dev.rhcloud.com -> upgrade-probe-serving-tests.apps.ci-op-k6xfg5sw-113e4.origin-ci-int-aws.dev.rhcloud.com
    prober.go:123: Starting Route prober for http://upgrade-probe-serving-tests.apps.ci-op-k6xfg5sw-113e4.origin-ci-int-aws.dev.rhcloud.com.
==================
WARNING: DATA RACE
Read at 0x00c000edc3a3 by goroutine 97:
  testing.(*common).logDepth()
      /usr/local/go/src/testing/testing.go:710 +0xa7
  testing.(*common).log()
      /usr/local/go/src/testing/testing.go:703 +0x8f
  testing.(*common).Logf()
      /usr/local/go/src/testing/testing.go:749 +0x21
  testing.(*common).Logf-fm()
      /usr/local/go/src/testing/testing.go:749 +0x73
  knative.dev/pkg/test/spoof.New()
      /go/src/github.com/openshift-knative/serverless-operator/vendor/knative.dev/pkg/test/spoof/spoof.go:103 +0x206
  knative.dev/pkg/test.NewSpoofingClient()
      /go/src/github.com/openshift-knative/serverless-operator/vendor/knative.dev/pkg/test/clients.go:47 +0x1df
  knative.dev/serving/test.(*manager).Spawn.func1()
      /go/src/github.com/openshift-knative/serverless-operator/vendor/knative.dev/serving/test/prober.go:142 +0x21b
  golang.org/x/sync/errgroup.(*Group).Go.func1()
      /go/src/github.com/openshift-knative/serverless-operator/vendor/golang.org/x/sync/errgroup/errgroup.go:57 +0x85
Previous write at 0x00c000edc3a3 by goroutine 115:
  testing.tRunner.func1()
      /usr/local/go/src/testing/testing.go:1025 +0x467
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1043 +0x20d
  knative.dev/pkg/test/upgrade.(*suiteExecution).startContinualTests.func1.1()
      /go/src/github.com/openshift-knative/serverless-operator/vendor/knative.dev/pkg/test/upgrade/steps.go:67 +0x4f
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1039 +0x1eb
Goroutine 97 (running) created at:
  golang.org/x/sync/errgroup.(*Group).Go()
      /go/src/github.com/openshift-knative/serverless-operator/vendor/golang.org/x/sync/errgroup/errgroup.go:54 +0x73
  knative.dev/serving/test.(*manager).Spawn()
      /go/src/github.com/openshift-knative/serverless-operator/vendor/knative.dev/serving/test/prober.go:141 +0x69f
  knative.dev/serving/test.RunRouteProber()
      /go/src/github.com/openshift-knative/serverless-operator/vendor/knative.dev/serving/test/prober.go:228 +0x198
  knative.dev/serving/test/upgrade.ProbeTest.func1()
      /go/src/github.com/openshift-knative/serverless-operator/vendor/knative.dev/serving/test/upgrade/probe.go:54 +0x71e
  knative.dev/serving/test/v1.CreateServiceReady()
      /go/src/github.com/openshift-knative/serverless-operator/vendor/knative.dev/serving/test/v1/service.go:99 +0x256
  knative.dev/serving/test/upgrade.ProbeTest.func1()
      /go/src/github.com/openshift-knative/serverless-operator/vendor/knative.dev/serving/test/upgrade/probe.go:45 +0x267
  knative.dev/pkg/test/upgrade.(*suiteExecution).startContinualTests.func1.1()
      /go/src/github.com/openshift-knative/serverless-operator/vendor/knative.dev/pkg/test/upgrade/steps.go:67 +0x4f
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1039 +0x1eb
Goroutine 115 (running) created at:
  testing.(*T).Run()
      /usr/local/go/src/testing/testing.go:1090 +0x700
  knative.dev/pkg/test/upgrade.(*suiteExecution).startContinualTests.func1()
      /go/src/github.com/openshift-knative/serverless-operator/vendor/knative.dev/pkg/test/upgrade/steps.go:66 +0x547
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:1039 +0x1eb

openshift-merge-robot pushed a commit to openshift/knative-serving that referenced this issue Mar 1, 2021
Fixes knative/pkg#2026

The actual issue is that the test context expires between individual
stages run by the upgrade framework. This fix passes an external logger
that survives the stages.
knative-prow-robot pushed a commit to knative/serving that referenced this issue Mar 1, 2021
Fixes knative/pkg#2026

The actual issue is that the test context expires between individual
stages run by the upgrade framework. This fix passes an external logger
that survives the stages.
markusthoemmes pushed a commit to markusthoemmes/knative-serving that referenced this issue Apr 7, 2021
Fixes knative/pkg#2026

The actual issue is that the test context expires between individual
stages run by the upgrade framework. This fix passes an external logger
that survives the stages.
openshift-merge-robot pushed a commit to openshift/knative-serving that referenced this issue Apr 7, 2021
* Fix race condition with Prober logger in upgrade tests (#670)

Fixes knative/pkg#2026

The actual issue is that the test context expires between individual
stages run by the upgrade framework. This fix passes an external logger
that survives the stages.

* Only use exec probe at startup time (knative#10741)

* Only use exec probe at startup time

Now that StartupProbe is available, we can avoid using spawning the exec
probe other than at startup time. For requests after startup this
directly uses the same endpoint as the exec probe in the QP as the
target of a HTTP readiness probe.

Following on from this I think we may want to rework quite a bit of how
our readiness probe stuff works (e.g. it'd be nice to keep the probes on
the user container so failures are on the right object, and we currently
ignore probes ~entirely after startup if periodSeconds>0), but this is a
minimal change that should be entirely backwards-compatible and saves
quite a few cpu cycles.

* Use ProgressDeadline as failure timeout for startup probe

- Also just drop exec probe entirely for periodSeconds > 1 since these
  can just use the readiness probe now. (Easier than figuring out how to
  do square ProgressDeadline with a custom period).

* See if flag is what's making upgrades unhappy

* reorganize comments

* Default PeriodSeconds of the readiness probe to 1 if unset (knative#10992)

Co-authored-by: Martin Gencur <[email protected]>
Co-authored-by: Julian Friedman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant