Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make cancel() synchronous in DBMAsyncJob #14717

Merged
merged 13 commits into from
Jun 9, 2023

Conversation

edengorevoy
Copy link
Contributor

@edengorevoy edengorevoy commented Jun 8, 2023

What does this PR do?

This PR adds some logic in the cancel() method of DMBAsyncJob to ensure that cancel() is a synchronous operation.

This PR also edits some tests in the postgres, MySQL, and sqlserver agent integrations (those who inherit from DMBAsyncJob) to try to make them less flaky. The change adds a run_one_check function in the postgres test utils to run a check and then cancel it + wait for all threads to cancel, so that assertions on metrics happen in a single-threaded environment.

Motivation

We are running into some testing issues in this PR. The errors are potentially race conditions that arise at runtime due to the metrics list in the Aggregator being iterated over at the same time as they're being added to in separate threads owned by DBM.
We should make sure cancellation is synchronous in DBMAsyncJob to further debug this issue and verify that it is the source of these errors.
DBMAsyncJob is only used by objects owned by DBM
, so making this change should not cause issues in other teams.

Additional Notes

I intend to rebase the above PR on this branch so I can see how changing the condition for starting the job loop affects the tests, and I want to use this modified cancel() function to facilitate that investigation.

Review checklist (to be filled by reviewers)

  • Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
  • PR title must be written as a CHANGELOG entry (see why)
  • Files changes must correspond to the primary purpose of the PR as described in the title (small unrelated changes should have their own PR)
  • PR must have changelog/ and integration/ labels attached
  • If the PR doesn't need to be tested during QA, please add a qa/skip-qa label.

@ghost ghost added the base_package label Jun 8, 2023
@edengorevoy edengorevoy changed the title fully cancel in dbmasyncjob Make cancel() synchronous in DBMAsyncJob Jun 8, 2023
@edengorevoy edengorevoy added category/bugfix For use during Agent Release period changelog/Added and removed category/bugfix For use during Agent Release period labels Jun 8, 2023
@codecov
Copy link

codecov bot commented Jun 8, 2023

Codecov Report

Merging #14717 (919e208) into master (5f522e9) will decrease coverage by 0.17%.
The diff coverage is 50.00%.

Flag Coverage Δ
activemq_xml 82.31% <ø> (ø)
amazon_msk 89.07% <ø> (ø)
apache 95.08% <ø> (ø)
arangodb 98.23% <ø> (ø)
argocd 88.43% <ø> (ø)
aspdotnet 100.00% <ø> (+26.19%) ⬆️
avi_vantage 92.54% <ø> (ø)
azure_iot_edge 82.08% <ø> (ø)
boundary 100.00% <ø> (ø)
btrfs 82.91% <ø> (ø)
calico 83.54% <ø> (ø)
cilium 75.46% <ø> (+0.92%) ⬆️
citrix_hypervisor 87.50% <ø> (ø)
cloud_foundry_api 96.35% <ø> (+0.12%) ⬆️
cloudera 99.82% <ø> (ø)
cockroachdb 91.90% <ø> (ø)
consul 91.65% <ø> (ø)
coredns 94.57% <ø> (ø)
crio 89.79% <ø> (ø)
datadog_checks_base 89.57% <50.00%> (+0.33%) ⬆️
datadog_checks_dev 80.16% <ø> (-2.54%) ⬇️
datadog_checks_downloader 81.65% <ø> (ø)
datadog_cluster_agent 90.19% <ø> (ø)
ddev 98.69% <ø> (-0.51%) ⬇️
directory 95.87% <ø> (+0.43%) ⬆️
disk 89.19% <ø> (-2.14%) ⬇️
dns_check 93.90% <ø> (ø)
druid 98.47% <ø> (ø)
eks_fargate 94.05% <ø> (ø)
envoy 95.02% <ø> (+0.42%) ⬆️
etcd 95.56% <ø> (ø)
fluentd 94.77% <ø> (ø)
gitlab_runner 91.94% <ø> (ø)
go_expvar 92.73% <ø> (ø)
gunicorn 92.85% <ø> (ø)
harbor 80.04% <ø> (ø)
hazelcast 92.39% <ø> (ø)
hdfs_datanode 89.74% <ø> (ø)
hdfs_namenode 86.72% <ø> (ø)
http_check 96.09% <ø> (+2.15%) ⬆️
ibm_ace 91.79% <ø> (ø)
ibm_i 81.95% <ø> (ø)
istio 77.43% <ø> (+0.55%) ⬆️
kong 87.56% <ø> (ø)
kube_apiserver_metrics 98.10% <ø> (ø)
kube_controller_manager 96.00% <ø> (ø)
kube_dns 95.97% <ø> (ø)
kube_scheduler 96.53% <ø> (ø)
kubelet 90.90% <ø> (ø)
linkerd 85.14% <ø> (+1.14%) ⬆️
mapreduce 81.35% <ø> (ø)
marathon 83.43% <ø> (ø)
mesos_master 89.75% <ø> (ø)
mesos_slave 93.63% <ø> (ø)
network 63.19% <ø> (-29.71%) ⬇️
nfsstat 95.20% <ø> (ø)
nginx 95.24% <ø> (+0.54%) ⬆️
nginx_ingress_controller 98.36% <ø> (ø)
openmetrics 98.08% <ø> (ø)
openstack 51.45% <ø> (ø)
openstack_controller 91.12% <ø> (ø)
pdh_check 97.82% <ø> (ø)
pgbouncer 91.33% <ø> (ø)
postfix 88.04% <ø> (ø)
process 85.42% <ø> (+0.28%) ⬆️
prometheus 94.17% <ø> (ø)
rethinkdb 97.93% <ø> (ø)
riak 99.22% <ø> (ø)
silk 93.82% <ø> (ø)
singlestore 90.81% <ø> (ø)
snowflake 96.76% <ø> (ø)
spark 93.63% <ø> (ø)
squid 100.00% <ø> (ø)
statsd 87.36% <ø> (+1.05%) ⬆️
supervisord 92.69% <ø> (ø)
system_core 90.90% <ø> (ø)
tcp_check 90.23% <ø> (-2.70%) ⬇️
teamcity 88.35% <ø> (+2.87%) ⬆️
teradata 94.24% <ø> (ø)
tls 92.18% <ø> (+0.82%) ⬆️
tokumx 58.40% <ø> (ø)
traffic_server 96.13% <ø> (ø)
twemproxy 79.45% <ø> (ø)
varnish 84.39% <ø> (+0.26%) ⬆️
vertica 98.50% <ø> (ø)
vsphere 90.58% <ø> (+0.08%) ⬆️
win32_event_log 86.40% <ø> (+0.27%) ⬆️
windows_performance_counters 98.36% <ø> (ø)
yarn 89.50% <ø> (ø)
zk 82.62% <ø> (+1.78%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@edengorevoy edengorevoy requested a review from justiniso June 8, 2023 21:40
@edengorevoy edengorevoy marked this pull request as ready for review June 8, 2023 21:40
@edengorevoy edengorevoy requested review from a team as code owners June 8, 2023 21:40
justiniso
justiniso previously approved these changes Jun 8, 2023
@github-actions
Copy link

github-actions bot commented Jun 8, 2023

Test Results

     945 files       945 suites   6h 18m 44s ⏱️
  5 417 tests   5 332 ✔️      80 💤 5
20 370 runs  16 907 ✔️ 3 458 💤 5

For more details on these failures, see this check.

Results for commit d084f7c.

♻️ This comment has been updated with latest results.

@@ -809,9 +809,6 @@ def test_async_job_cancel(aggregator, dd_run_check, dbm_instance):
mysql_check = MySql(common.CHECK_NAME, {}, [dbm_instance])
dd_run_check(mysql_check)
mysql_check.cancel()
# wait for it to stop and make sure it doesn't throw any exceptions
mysql_check._statement_samples._job_loop_future.result()
mysql_check._statement_metrics._job_loop_future.result()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@edengorevoy edengorevoy merged commit 3aca91b into master Jun 9, 2023
@edengorevoy edengorevoy deleted the cleanly-cancel-in-dbmasyncjob branch June 9, 2023 21:12
self._cancel_event.set()

if self._job_loop_future is not None:
self._job_loop_future.result(10)

Copy link
Member

@FlorentClarret FlorentClarret Jun 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jmeunier28 @edengorevoy 👋 Keep in mind that you'll have to bump the minimum base package version in your integrations if they now rely on this change 🙂 (e.g. https://github.com/DataDog/integrations-core/blob/master/mysql/pyproject.toml#L33)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants