Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

Sovereign cloud support for AzureGermanCloud and AzureUSGovCloud. #499

Merged
merged 10 commits into from Aug 28, 2017
Merged

Sovereign cloud support for AzureGermanCloud and AzureUSGovCloud. #499

merged 10 commits into from Aug 28, 2017

Conversation

wangtt03
Copy link
Contributor

@wangtt03 wangtt03 commented Apr 16, 2017

Changes:

  • Extract all the sovereign cloud configurations to defaults.go
  • Cleanup hard coded URL suffix in shell scripts.
  • Remove FQDN suffix related code from azureconst.go
  • Fix SwarmMode China deployment failure bug by extracting docker engine download repo and docker compose download URL.
  • Update testcases.

This change is Reviewable

Fixes #1066

@msftclas
Copy link

@wangtt03,
Thanks for your contribution as a Microsoft full-time employee or intern. You do not need to sign a CLA.
Thanks,
Microsoft Pull Request Bot

@acs-bot
Copy link

acs-bot commented Apr 16, 2017

Can one of the admins verify this patch?

@wangtt03
Copy link
Contributor Author

@colemickens Could you please review this PR? Thanks!

@raulfiru
Copy link

I'm back in the office only on Tuesday - at the moment i have crappy internet so the latest on Tuesday morning (monday evening for you) i'll regen and deploy

@raulfiru
Copy link

raulfiru commented Apr 17, 2017

might be wrong but i guess you are making that pull request from a private repo so i can't get the branch... and i can't merge into a new brach here since i don't have access (obviously)... So i'm stuck - can get the changes locally.

git remote add --does not work
git fetch origin pull/ID/head:BRANCHNAME --does not work either

another question: do you use the location to determine the right FQDN or have another attribute?

this is what i set:

"location": "germanycentral",

@colemickens
Copy link
Contributor

The repo is open, something like this should work:

git remote add ccg [email protected]:/ChinaCloudGroup/acs-engine.git
git remote update
git checkout sovereign_cloud
git reset --hard ccg/sovereign_cloud

@raulfiru
Copy link

worked!

didn't have to change vm image name and now works like a charm... i still ssh-ed on the master (11PM... will try remote tomorrow)

azureuser@k8s-master-12051246-0:~$ kubectl get services
NAME         CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   10.0.0.1     <none>        443/TCP   7m
azureuser@k8s-master-12051246-0:~$ kubectl get nodes
NAME                        STATUS                     AGE
k8s-agentpool1-12051246-0   Ready                      7m
k8s-agentpool1-12051246-1   Ready                      7m
k8s-agentpool1-12051246-2   Ready                      7m
k8s-master-12051246-0       Ready,SchedulingDisabled   7m

great - tnx guys! you can merge from my point of view ;)

PS: didn't do any regression for non-Azure.de

@wangtt03
Copy link
Contributor Author

@raulfiru I am so happy to hear that! Thanks! 👍

@benkoller
Copy link

I had to change the osImageVersion as 16.04.201703070 seems not to be available in germanycentral (?):

~ ❯❯❯ az vm image list --all --location germanycentral | jq  '[ .[] | select( .sku | contains ("16.04")) ]'
[
  {
    "offer": "UbuntuServer",
    "publisher": "Canonical",
    "sku": "16.04-LTS",
    "urn": "Canonical:UbuntuServer:16.04-LTS:16.04.201701130",
    "version": "16.04.201701130"
  },
  {
    "offer": "UbuntuServer",
    "publisher": "Canonical",
    "sku": "16.04.0-LTS",
    "urn": "Canonical:UbuntuServer:16.04.0-LTS:16.04.201604203",
    "version": "16.04.201604203"
  },
  {
    "offer": "UbuntuServer",
    "publisher": "Canonical",
    "sku": "16.04.0-LTS",
    "urn": "Canonical:UbuntuServer:16.04.0-LTS:16.04.201605161",
    "version": "16.04.201605161"
  },
  {
    "offer": "UbuntuServer",
    "publisher": "Canonical",
    "sku": "16.04.0-LTS",
    "urn": "Canonical:UbuntuServer:16.04.0-LTS:16.04.201609071",
    "version": "16.04.201609071"
  },
  {
    "offer": "UbuntuServer",
    "publisher": "Canonical",
    "sku": "16.04.0-LTS",
    "urn": "Canonical:UbuntuServer:16.04.0-LTS:16.04.201610200",
    "version": "16.04.201610200"
  }
]

After correcting the image version the master vm extension deployment fails as the custom script exits with 1.

Checking /var/log/azure/cluster-provision.log I get a lot of these:

+ /usr/local/bin/kubectl cluster-info
The connection to the server localhost:8080 was refused - did you specify the right host or port?
+ '[' 1 = 0 ']'
+ sleep 1
+ for i in '{1..600}'
+ '[' -e /usr/local/bin/kubectl ']'

Any hints where I should look to troubleshoot further?

@wangtt03
Copy link
Contributor Author

It seems that the apiserver is not started correctly. Please check if hyperkube docker image is pulled, could you please attach the docker ps output?

@benkoller
Copy link

I suspect I mangled the location configuration. Which points are relevant for setting the target environment?

@benkoller
Copy link

So, now I manually adjusted the targetEnvironment and fqdnEndpointSuffix in azuredeploy.parameters.json to get this to work as I still lack the understanding how and where you derive containerService.Location. I really would like to jump a bit deeper in this, but as of right now I lack the time to do so - sorry about that.

How did you plan on setting the location during the template generation?

@raulfiru
Copy link

@benkoller it worked like a charm for me... did you get the pull request as @colemickens wrote above (i had to change some paths) or you used the master branch?

You seam to know what you are doing (much more then me to be honest), but what i did is to get the pull locally (overwrite master), then follow the install guide where i've overwritten:

%GOPATH%\src\github.com\Azure\acs-engine

with the git folder that has the pull

then followed all the other steps and worked...

@benkoller
Copy link

I am on the sovereign_cloud-Branch and followed the above instructions to the letter. In the end all I had to do was the manual change of both FQDN and TargetEnv after template generation, but it doesn't seem like its the method intended by @wangtt03

I unfortunately can't deploy the first workloads today but I'll deploy some stuff tomorrow. Until now all seems fine though.

@raulfiru
Copy link

that is strange indeed... i tried on 2 azure germany subscriptions.

on one worked every time (tried 3 times) - cluster-provision.log:

+ for i in '{1..600}'
+ '[' -e /usr/local/bin/kubectl ']'
+ /usr/local/bin/kubectl cluster-info
The connection to the server localhost:8080 was refused - did you specify the right host or port?
+ '[' 1 = 0 ']'
+ sleep 1
+ for i in '{1..600}'
+ '[' -e /usr/local/bin/kubectl ']'
+ /usr/local/bin/kubectl cluster-info
Kubernetes master is running at http://localhost:8080

on another..also tried 3 times... side by side with different dns names and then i reused the exact same file and deleted the one in the other subscription (i thought it could be the names/certs)

+ sleep 1
+ for i in '{1..600}'
+ '[' -e /usr/local/bin/kubectl ']'
+ /usr/local/bin/kubectl cluster-info
The connection to the server localhost:8080 was refused - did you specify the right host or port?
+ '[' 1 = 0 ']'

failed every time... same as @benkoller

@colemickens
Copy link
Contributor

It's almost surely because of ServicePrincipal problems. I'm guessing the SP that you are using only has permissions on one of the subscriptions.

@raulfiru
Copy link

i'm using different SP for each.

and everything else gets provisioned except: k8s-master-12051246-0/cse0 - and that looks like a script that run on the vm:

"commandToExecute": "[concat('/usr/bin/nohup /bin/bash -c \"/bin/bash /opt/azure/containers/provision.sh ',variables('tenantID'),' ',variables('subscriptionId'),' ',variables('resourceGroup'),' ',variables('location'),' ',variables('subnetName'),' ',variables('nsgName'),' ',variables('virtualNetworkName'),' ',variables('routeTableName'),' ',variables('primaryAvailablitySetName'),' ',variables('servicePrincipalClientId'),' ',variables('servicePrincipalClientSecret'),' ',variables('clientPrivateKey'),' ',variables('targetEnvironment'),' ',variables('networkPolicy'),' ',variables('fqdnEndpointSuffix'), ' >> /var/log/azure/cluster-provision.log 2>&1 &\" &')]"

they run fine on agents but not on the master. It hangs there until the loop (600 sec) timesout

@colemickens
Copy link
Contributor

@raulfiru Yes, it will fail if the SP is invalid. Please follow the troubleshooting steps here to rule out the SP credentials being invalid or having wrong permissions: https://github.com/Azure/acs-engine/blob/master/docs/kubernetes.md#misconfigured-service-principal

@raulfiru
Copy link

you were right

Apr 18 23:42:39 k8s-master-12051246-0 docker[6315]: E0418 23:42:39.796008    6659 kubelet_node_status.go:70] Unable to construct api.Node object for kubelet: failed to get external ID from cloud provider: autorest#WithErrorUnlessStatusCode: POST https://login.microsoftonline.de/b1904b97-e54a-4cb1-8031-6e578f3ad607/oauth2/token?api-version=1.0 failed with 400 Bad Request: StatusCode=400

i'll check the SP tomorrow and retry.

tnx and good night for now

@raulfiru
Copy link

works!

It was my mistake as i was using in the other case the SP name and not the ID.

now, both subscriptions work

@colemickens
Copy link
Contributor

@raulfiru and to confirm, you didn't have to edit anything for FQDN or ubuntu image? I'd like to understand why this isn't working for @benkoller, if that is indeed still the case.

@raulfiru
Copy link

regarding the azure VM image - while the default image in the azuredeply.json is 16.04.201703070, the one is azuredeply.parameters.json is 16.04.201701130 - so it worked and i didn't change anything.

@benkoller
Copy link

@colemickens I have to power down the current deployment and will be out of the office until Monday so I can't check again. Until then I'd like to rule out I missed something during template generation, there are no additional steps to perform apart from ./acs-engine examples/kubernetes.json, correct? Do I need to add a "location": "AzureGermanCloud" or similar to the kubernetes.json?

@wangtt03
Copy link
Contributor Author

There should be a location field target to 'germanycentral' or 'germanynortheast'

@benkoller
Copy link

Can you share / add an updated example/Kubernetes.json?

@wangtt03
Copy link
Contributor Author

{
"apiVersion": "vlabs",
"location":"germanycentral",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes"
},
"masterProfile": {
"count": 1,
"dnsPrefix": "germany-m01",
"vmSize": "Standard_D2_v2"
},
"agentPoolProfiles": [
{
"name": "agentpool1",
"count": 3,
"vmSize": "Standard_D2_v2",
"availabilityProfile": "AvailabilitySet"
}
],
"linuxProfile": {
"adminUsername": "azureuser",
"ssh": {
"publicKeys": [
{
"keyData": "<your_ssh_public_key>"
}
]
}
},
"servicePrincipalProfile": {
"servicePrincipalClientID": "<your_client_id>",
"servicePrincipalClientSecret": "<your_client secret>"
}
}
}

Hope it helps!😊

@benkoller
Copy link

Thx @wangtt03, deployment went without a hitch with the location field. I had used AzureGermanCloud instead of germanycentral, thx for your example.

@jackfrancis
Copy link
Member

@wangtt03 could you rebase this on master, or give me admin perms on ChinaCloudGroup:sovereign_cloud and I'd be happy to do it myself. Thanks either way!

@wangtt03
Copy link
Contributor Author

@jackfrancis ,I will rebase this PR, and I will give you the admin permission also.

Your Name and others added 2 commits August 28, 2017 11:08
change default kubernetes image base to crproxy for China and delete template.go
@jackfrancis
Copy link
Member

@wangtt03 I checked out the ChinaCloudGroup fork of acs-engine, but don't have permissions to push (I have some minor changes that fix make test-style warnings).

@ghost ghost assigned jackfrancis Aug 28, 2017
@jackfrancis
Copy link
Member

@wangtt03 false alarm, I hadn't accepted the collaboration invite, all works now! thanks!

anhowe
anhowe previously requested changes Aug 28, 2017
Copy link
Contributor

@anhowe anhowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one change remaining, otherwise looks good.

fi
sleep 10
done
apt-get update
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all "apt" commands should have retries around them. Please see other areas of code where we do this.

@@ -159,18 +161,18 @@ echo "$HOSTADDR $VMNAME" | sudo tee -a /etc/hosts

echo "Installing and configuring docker"

# simple general command retry function
retrycmd_if_failure() { for i in 1 2 3 4 5; do $@; [ $? -eq 0 ] && break || sleep 5; done ; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anhowe See this function and the below invocations to address your retry apt commands suggestion.

@jackfrancis jackfrancis dismissed anhowe’s stale review August 28, 2017 22:41

retry for apt commands in place

@jackfrancis jackfrancis merged commit 678223c into Azure:master Aug 28, 2017
@ghost ghost removed the in progress label Aug 28, 2017
@ghost
Copy link

ghost commented Aug 29, 2017

🚀

@raulfiru
Copy link

checking it today/tomorrow awesome!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Sovereign cloud support for AzureGermanCloud and AzureUSGovCloud