Skip to content
This repository was archived by the owner on Oct 23, 2024. It is now read-only.

[DCOS-41388] Fix bug where v1 mesos client instantiation defaults to v0 #2655

Merged
merged 10 commits into from
Sep 25, 2018

Conversation

takirala
Copy link
Contributor

@takirala takirala commented Sep 5, 2018

The scheduler is trying to instantiate a V1 mesos client here https://github.com/mesosphere/dcos-commons/blob/9f1bc016c2ad2a7f5d9632f3a8fa0e4c412ec570/sdk/scheduler/src/main/java/com/mesosphere/sdk/framework/SchedulerDriverFactory.java#L112
but by calling super.startInternal() libmesos adapter is creating a v0 client https://github.com/mesosphere/mesos-http-adapter/blob/5965d467c86ce7a1092bda5624dd2da9b0776334/src/main/java/com/mesosphere/mesos/http-adapter/MesosToSchedulerDriverAdapter.java#L316-L318

We removed the environment variables from the config.json and marathon mustache template through #2566. We should fix this without reintroducing the variable in config.json (and thus intentionally prohibiting the user to specify V0 or V1 easily but still retaining the behavior of creating V0 client if needed).

The ideal way to test this would have been calling upstream to see if client is speaking V0 or V1 (but should upstream know this?). I have written unit test and this is the behavior :

  • If upstream supports V1 and V1 is specified in environment (this is the default in SDK 50+) then we should create V1.
  • All other cases, default to V0.

I have upgraded mockito as I wanted to mock final classes. While I am here, I have also upgraded the gradle version that we use. This is what I have done to upgrade gradle version :

  • Performed the upgrade using ./gradlew wrapper --gradle-version=4.10 --distribution-type=bin

  • Did a ./gradlew clean build --warning-mode all and fixed the warnings printed (Except for

    Registering invalid inputs and outputs via TaskInputs and TaskOutputs methods has been deprecated. This is scheduled to be removed in Gradle 5.0. A problem was found with the configuration of task ':keystore-app:shadowJar'.
    No value has been specified for property 'mainClassName'.`

because of GradleUp/shadow#336

  • Removed the auto-generated gradlew.bat file
  • Cleaned up some of the build.gradle files
    • Removed unused variables
    • Removed the inherited plugins
    • Removed the inherited repositories
    • We can always put the above two back when we need something done differently for a specific framework.

Upgrading gradle was not really necessary for DCOS-41388 but I did that because this would be useful for the upcoming SDK refactoring. Gradle implementation and api dependencies

@takirala takirala self-assigned this Sep 5, 2018
@takirala takirala changed the title [DCOS-41388] Fix bug for v1 mesos client instantiation defaulting to v0 [DCOS-41388] Fix bug where v1 mesos client instantiation defaults to v0 Sep 5, 2018
@takirala takirala added the wip label Sep 6, 2018
@takirala takirala force-pushed the fix-v1-mesos-client branch 4 times, most recently from 3fee424 to 8bc7631 Compare September 11, 2018 01:41
@@ -166,7 +167,7 @@ dependencies {

testCompile 'org.hamcrest:hamcrest-all:1.3' // note: must be above junit
testCompile 'junit:junit:4.12'
testCompile 'org.mockito:mockito-all:1.10.19'
testCompile 'org.mockito:mockito-core:2.10.0'
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mockito does not produce the mockito-all artifact anymore ; this one was primarily aimed at ant users, and contained other dependencies. We felt it was time to move on and remove such artifacts as they cause problems in dependency management system like maven or gradle.

https://github.com/mockito/mockito/wiki/What%27s-new-in-Mockito-2#incompatible-changes-with-110

classpath "com.github.jengelman.gradle.plugins:shadow:2.0.1"
}
plugins {
id "com.github.johnrengelman.shadow" version "2.0.4"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove usage of Gradle internal AbstractFileCollection.

https://github.com/johnrengelman/shadow/releases/tag/2.0.4

@@ -0,0 +1 @@
mock-maker-inline
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

final Capabilities capabilities,
final FrameworkInfo frameworkInfo,
final String masterUrl,
@Nullable final Credential credential,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about switching this to an Optional<Credential>? Both here and at L139

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried that but I feel there is very less value in adding that considering I still have to do credential.map(EvolverDevolver.evolve(_)).getOrElse(null) as null is what gets passed in to V0Mesos and V1Mesos constructors. I can use constructors which do not have Credential in them but it would end up duplicating code.

Either way, I thought @Nullable ... is less costlier than Optional<?> .. and also ti would add very less value in this case by complicating a simple ternary operation.

data.middle
)
.apply(mock(MesosToSchedulerDriverAdapter.class))
.getClass()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this be done as assertTrue([...].getClass() instanceof data.right)? Or is that the issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am getting a compile time error if I do a instanceof as data.right is of type ? extends Class<? extends Mesos>

@takirala takirala requested a review from nickbp September 19, 2018 18:53
@@ -423,14 +423,20 @@ def _check_tasks_updated():
_check_tasks_updated()


def check_tasks_not_updated(service_name, prefix, old_task_ids):
def check_tasks_not_updated(
Copy link
Contributor

@nickbp nickbp Sep 24, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did some digging around and I think I know what's happening:

  1. I'd guess that the V1 API is faster to mark the framework as inactive, or maybe it's instead slower to mark the framework active again after the scheduler has re-registered?
  2. sdk_tasks.py:L204 filters out frameworks with !active when fetching the list of all frameworks.
  3. Given 1+2, get_task_ids() would return an empty list if the scheduler hasn't completed re-registering.

Given this, I think a better solution here would be to explicitly check for the active framework after restarting the scheduler. This would make the expected behavior a bit more explicit:

  • Implement a function (sdk_tasks.wait_for_active_framework()?) that waits for at least one framework with the specified service name to be active. This could be built using logic similar to the /mesos/frameworks fetch in sdk_tasks._get_service_tasks(), where it just waits for fwk["name"] == service_name and fwk["active"] on any framework entry.
  • Update the hdfs test to invoke that new function before calling check_tasks_not_updated().
  • check_tasks_not_updated() could be reverted to its previous state without the retry.
  • Couldn't hurt to also add a wait_for_active_framework()+check_tasks_not_updated() check to hello-world test_zzzrecovery.kill_scheduler too, because it isn't currently checking for this. Would explain why the failure was only showing up in HDFS to begin with.

@takirala
Copy link
Contributor Author

This is a sample timeline of the events from one of the CI runs :

Schedulers view :

INFO  2018-09-24 10:09:26,115 [main] RawServiceSpec:build(86): Rendered ServiceSpec from /var/lib/mesos/slave/slaves/bb9538b9-81de-43e5-94b6-4448c7721a9e-S5/frameworks/bb9538b9-81de-43e5-94b6-4448c7721a9e-0000/executors/test_integration_hdfs.e80887d9-bfe1-11e8-8ad7-7672825174fd/runs/0b514650-901f-467e-996b-d42463fb2809/./hdfs-scheduler/svc.yml:
.
.
INFO  2018-09-24 10:09:59,744 [main] DefaultConfigurationUpdater:printConfigDiff(253): Difference between configs:
.
.
INFO  2018-09-24 10:10:00,230 [main] SchedulerBuilder:getDefaultScheduler(409): Plan: deploy (COMPLETE)
.
.
INFO  2018-09-24 10:10:03,032 [Thread-3] ApiServer:run(184): API server started at port 18062
INFO  2018-09-24 10:10:04,724 [pool-12-thread-1] com.mesosphere.mesos.HTTPAdapter.MesosToSchedulerDriverAdapter:subscribe(115): Sending SUBSCRIBE call
INFO  2018-09-24 10:10:04,741 [pool-12-thread-1] com.mesosphere.mesos.HTTPAdapter.MesosToSchedulerDriverAdapter:scheduleNextSubscription(134): Backing off for: 1975
INFO  2018-09-24 10:10:04,841 [Thread-6] com.mesosphere.mesos.HTTPAdapter.MesosToSchedulerDriverAdapter:received(181): Received event of type: SUBSCRIBED
INFO  2018-09-24 10:10:04,844 [Thread-6] FrameworkScheduler:registered(126): Registered framework with frameworkId: bb9538b9-81de-43e5-94b6-4448c7721a9e-0002

.
.

pytest's view :

.
.
[2018-09-24 10:09:20,343|sdk_cmd-_ssh(372)|INFO]: SSH command: ssh -oBatchMode=yes -oStrictHostKeyChecking=no -oConnectTimeout=60 -A -q -l core 52.34.52.20 -- "ssh -oBatchMode=yes -oStrictHostKeyChecking=no -oConnectTimeout=60 -A -q -l core 10.0.0.173 -- \"sudo pkill -9 -f -U nobody -o ./hdfs-scheduler/bin/hdfs && echo Successfully killed process by user nobody containing ./hdfs-scheduler/bin/hdfs || (echo Process containing ./hdfs-scheduler/bin/hdfs under user nobody not found: && ps aux && exit 1)\""
.
.
.
polling deploy http endpoint gives 5xx
.
.
[2018-09-24 10:10:03,634|sdk_plan-fn(175)|INFO]: Waiting for COMPLETE deploy plan:
deploy (COMPLETE):
- journal (COMPLETE): journal-0:[node]=COMPLETE, journal-1:[node]=COMPLETE, journal-2:[node]=COMPLETE
- name (COMPLETE): name-0:[node, zkfc]=COMPLETE, name-1:[node, zkfc]=COMPLETE
- data (COMPLETE): data-0:[node]=COMPLETE, data-1:[node]=COMPLETE, data-2:[node]=COMPLETE
[2018-09-24 10:10:03,644|sdk_cmd-_cluster_request(125)|INFO]: (HTTP GET) /dcos-history-service/history/last => 200 (0.010s)
[2018-09-24 10:10:03,654|sdk_cmd-_cluster_request(125)|INFO]: (HTTP GET) /dcos-history-service/history/last => 200 (0.009s)
[2018-09-24 10:10:03,696|sdk_cmd-_cluster_request(125)|INFO]: (HTTP GET) /service/test/integration/hdfs/v1/plans/recovery => 200 (0.042s)
[2018-09-24 10:10:03,697|sdk_plan-fn(175)|INFO]: Waiting for COMPLETE recovery plan:
recovery (COMPLETE):
[2018-09-24 10:10:03,713|sdk_cmd-_cluster_request(125)|INFO]: (HTTP GET) /mesos/slaves => 200 (0.016s)
[2018-09-24 10:10:03,778|sdk_cmd-_cluster_request(125)|INFO]: (HTTP GET) /mesos/frameworks => 200 (0.064s)
[2018-09-24 10:10:03,779|sdk_tasks-get_task_ids(180)|INFO]: Inside get_task_ids : []
[2018-09-24 10:10:03,780|sdk_tasks-d(439)|INFO]: Checking tasks starting with "" have not been updated:
- Old tasks: ['test.integration.hdfs__data-0-node__45e6d6a4-8af8-4c0a-82cd-efd7801b11dc', 'test.integration.hdfs__data-1-node__7c2b492b-84f6-4de4-aa3d-7c33ff1398e4', 'test.integration.hdfs__data-2-node__6ddd1dcb-4abb-4ff4-b6e9-4b132381aa6c', 'test.integration.hdfs__journal-0-node__d5b087bb-afd9-409d-a208-2449804f4eea', 'test.integration.hdfs__journal-1-node__d6485e51-9008-41f2-b423-e923e6f507c0', 'test.integration.hdfs__journal-2-node__55a91ec3-5bab-43f7-8f97-26881a3798d6', 'test.integration.hdfs__name-0-node__c6f22d5e-3aee-478f-8585-636dcd9bfcbd', 'test.integration.hdfs__name-0-zkfc__1f6e8e5c-a246-47ec-9c8a-af38b1cdb380', 'test.integration.hdfs__name-1-node__7a43c513-7af6-419f-8195-ff9e6c87763e', 'test.integration.hdfs__name-1-zkfc__8e790416-2652-46d4-aa3f-dd690950a806']
- Current tasks: []
[2018-09-24 10:10:04,798|sdk_cmd-_cluster_request(125)|INFO]: (HTTP GET) /mesos/slaves => 200 (0.018s)
[2018-09-24 10:10:04,864|sdk_cmd-_cluster_request(125)|INFO]: (HTTP GET) /mesos/frameworks => 200 (0.064s)
[2018-09-24 10:10:04,866|sdk_tasks-get_task_ids(180)|INFO]: Inside get_task_ids : []
[2018-09-24 10:10:04,866|sdk_tasks-d(439)|INFO]: Checking tasks starting with "" have not been updated:
- Old tasks: ['test.integration.hdfs__data-0-node__45e6d6a4-8af8-4c0a-82cd-efd7801b11dc', 'test.integration.hdfs__data-1-node__7c2b492b-84f6-4de4-aa3d-7c33ff1398e4', 'test.integration.hdfs__data-2-node__6ddd1dcb-4abb-4ff4-b6e9-4b132381aa6c', 'test.integration.hdfs__journal-0-node__d5b087bb-afd9-409d-a208-2449804f4eea', 'test.integration.hdfs__journal-1-node__d6485e51-9008-41f2-b423-e923e6f507c0', 'test.integration.hdfs__journal-2-node__55a91ec3-5bab-43f7-8f97-26881a3798d6', 'test.integration.hdfs__name-0-node__c6f22d5e-3aee-478f-8585-636dcd9bfcbd', 'test.integration.hdfs__name-0-zkfc__1f6e8e5c-a246-47ec-9c8a-af38b1cdb380', 'test.integration.hdfs__name-1-node__7a43c513-7af6-419f-8195-ff9e6c87763e', 'test.integration.hdfs__name-1-zkfc__8e790416-2652-46d4-aa3f-dd690950a806']
- Current tasks: []
[2018-09-24 10:10:05,891|sdk_cmd-_cluster_request(125)|INFO]: (HTTP GET) /mesos/slaves => 200 (0.024s)
[2018-09-24 10:10:05,964|sdk_cmd-_cluster_request(125)|INFO]: (HTTP GET) /mesos/frameworks => 200 (0.070s)
[2018-09-24 10:10:05,965|sdk_tasks-get_task_ids(180)|INFO]: Inside get_task_ids : [Task[name="data-1-node"	state=TASK_RUNNING	id=test.integration.hdfs__data-1-node__7c2b492b-84f6-4de4-aa3d-7c33ff1398e4	host=10.0.3.162	framework_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-0002	agent_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-S6], Task[name="name-1-zkfc"	state=TASK_RUNNING	id=test.integration.hdfs__name-1-zkfc__8e790416-2652-46d4-aa3f-dd690950a806	host=10.0.0.173	framework_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-0002	agent_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-S7], Task[name="data-2-node"	state=TASK_RUNNING	id=test.integration.hdfs__data-2-node__6ddd1dcb-4abb-4ff4-b6e9-4b132381aa6c	host=10.0.0.5	framework_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-0002	agent_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-S9], Task[name="name-0-zkfc"	state=TASK_RUNNING	id=test.integration.hdfs__name-0-zkfc__1f6e8e5c-a246-47ec-9c8a-af38b1cdb380	host=10.0.0.5	framework_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-0002	agent_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-S9], Task[name="name-1-node"	state=TASK_RUNNING	id=test.integration.hdfs__name-1-node__7a43c513-7af6-419f-8195-ff9e6c87763e	host=10.0.0.173	framework_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-0002	agent_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-S7], Task[name="name-0-node"	state=TASK_RUNNING	id=test.integration.hdfs__name-0-node__c6f22d5e-3aee-478f-8585-636dcd9bfcbd	host=10.0.0.5	framework_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-0002	agent_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-S9], Task[name="journal-0-node"	state=TASK_RUNNING	id=test.integration.hdfs__journal-0-node__d5b087bb-afd9-409d-a208-2449804f4eea	host=10.0.1.251	framework_id
=bb9538b9-81de-43e5-94b6-4448c7721a9e-0002	agent_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-S8], Task[name="data-0-node"	state=TASK_RUNNING	id=test.integration.hdfs__data-0-node__45e6d6a4-8af8-4c0a-82cd-efd7801b11dc	host=10.0.0.173	framework_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-0002	agent_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-S7], Task[name="journal-1-node"	state=TASK_RUNNING	id=test.integration.hdfs__journal-1-node__d6485e51-9008-41f2-b423-e923e6f507c0	host=10.0.3.162	framework_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-0002	agent_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-S6], Task[name="journal-2-node"	state=TASK_RUNNING	id=test.integration.hdfs__journal-2-node__55a91ec3-5bab-43f7-8f97-26881a3798d6	host=10.0.3.127	framework_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-0002	agent_id=bb9538b9-81de-43e5-94b6-4448c7721a9e-S5]]
[2018-09-24 10:10:05,965|sdk_tasks-d(439)|INFO]: Checking tasks starting with "" have not been updated:
- Old tasks: ['test.integration.hdfs__data-0-node__45e6d6a4-8af8-4c0a-82cd-efd7801b11dc', 'test.integration.hdfs__data-1-node__7c2b492b-84f6-4de4-aa3d-7c33ff1398e4', 'test.integration.hdfs__data-2-node__6ddd1dcb-4abb-4ff4-b6e9-4b132381aa6c', 'test.integration.hdfs__journal-0-node__d5b087bb-afd9-409d-a208-2449804f4eea', 'test.integration.hdfs__journal-1-node__d6485e51-9008-41f2-b423-e923e6f507c0', 'test.integration.hdfs__journal-2-node__55a91ec3-5bab-43f7-8f97-26881a3798d6', 'test.integration.hdfs__name-0-node__c6f22d5e-3aee-478f-8585-636dcd9bfcbd', 'test.integration.hdfs__name-0-zkfc__1f6e8e5c-a246-47ec-9c8a-af38b1cdb380', 'test.integration.hdfs__name-1-node__7a43c513-7af6-419f-8195-ff9e6c87763e', 'test.integration.hdfs__name-1-zkfc__8e790416-2652-46d4-aa3f-dd690950a806']
- Current tasks: ['test.integration.hdfs__data-0-node__45e6d6a4-8af8-4c0a-82cd-efd7801b11dc', 'test.integration.hdfs__data-1-node__7c2b492b-84f6-4de4-aa3d-7c33ff1398e4', 'test.integration.hdfs__data-2-node__6ddd1dcb-4abb-4ff4-b6e9-4b132381aa6c', 'test.integration.hdfs__journal-0-node__d5b087bb-afd9-409d-a208-2449804f4eea', 'test.integration.hdfs__journal-1-node__d6485e51-9008-41f2-b423-e923e6f507c0', 'test.integration.hdfs__journal-2-node__55a91ec3-5bab-43f7-8f97-26881a3798d6', 'test.integration.hdfs__name-0-node__c6f22d5e-3aee-478f-8585-636dcd9bfcbd', 'test.integration.hdfs__name-0-zkfc__1f6e8e5c-a246-47ec-9c8a-af38b1cdb380', 'test.integration.hdfs__name-1-node__7a43c513-7af6-419f-8195-ff9e6c87763e', 'test.integration.hdfs__name-1-zkfc__8e790416-2652-46d4-aa3f-dd690950a806']
[2018-09-24 10:10:05,965|tests.test_sanity-test_kill_scheduler(175)|INFO]: Task(s) are not updated ['test.integration.hdfs__data-1-node__7c2b492b-84f6-4de4-aa3d-7c33ff1398e4', 'test.integration.hdfs__name-1-zkfc__8e790416-2652-46d4-aa3f-dd690950a806', 'test.integration.hdfs__data-2-node__6ddd1dcb-4abb-4ff4-b6e9-4b132381aa6c', 'test.integration.hdfs__name-0-zkfc__1f6e8e5c-a246-47ec-9c8a-af38b1cdb380', 'test.integration.hdfs__name-1-node__7a43c513-7af6-419f-8195-ff9e6c87763e', 'test.integration.hdfs__name-0-node__c6f22d5e-3aee-478f-8585-636dcd9bfcbd', 'test.integration.hdfs__journal-0-node__d5b087bb-afd9-409d-a208-2449804f4eea', 'test.integration.hdfs__data-0-node__45e6d6a4-8af8-4c0a-82cd-efd7801b11dc', 'test.integration.hdfs__journal-1-node__d6485e51-9008-41f2-b423-e923e6f507c0', 'test.integration.hdfs__journal-2-node__55a91ec3-5bab-43f7-8f97-26881a3798d6']

@takirala
Copy link
Contributor Author

The diff for changes since last approval : 2f383a0...76ecf45
(cc : @kaiwalyajoshi @kvish @nickbp )

@takirala takirala merged commit 51e650c into master Sep 25, 2018
@takirala takirala deleted the fix-v1-mesos-client branch September 25, 2018 18:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants