Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WinRM: Bails out with "[Errno 111] Connection refused" #25532

Closed
dagwieers opened this issue Jun 9, 2017 · 21 comments · May be fixed by diyan/pywinrm#174
Closed

WinRM: Bails out with "[Errno 111] Connection refused" #25532

dagwieers opened this issue Jun 9, 2017 · 21 comments · May be fixed by diyan/pywinrm#174
Labels
affects_2.4 This issue/PR affects Ansible v2.4 bug This issue/PR relates to a bug. has_pr This issue has an associated PR. support:core This issue/PR relates to code supported by the Ansible Engineering Team. traceback This issue/PR includes a traceback. windows Windows community

Comments

@dagwieers
Copy link
Contributor

dagwieers commented Jun 9, 2017

ISSUE TYPE
  • Bug Report
COMPONENT NAME

WinRM

ANSIBLE VERSION

v2.4

OS / ENVIRONMENT

Control master: RHEL7
Target nodes: Windows 2012R2 (with Powershell 4.0, also tried Powershell 5.1)

SUMMARY

I just experienced again a Connection refused. The task was waiting for 3 VMs to appear (wait_for_connection doing a win_ping test), the last VM to come online then gave me a Connection refused in the next task doing setup.

We are using CredSSP.

I wonder if we could retry longer/delayed on Connection refused to hopefully make it survive such intermittent issues better.

It is not unlikely that during the first boot the WinRM service starts, stops and then starts again, causing the "Connection refused", however we should recover from this situation if it appears.

TASK [Gathering Facts] ********************************************************************************
***************************************************************************
Using module file /home/user/ansible.git/lib/ansible/modules/windows/setup.ps1
<1.2.3.101> ESTABLISH WINRM CONNECTION FOR USER: Administrator on PORT 5986 TO 1.2.3.101
Using module file /home/user/ansible.git/lib/ansible/modules/windows/setup.ps1
<1.2.3.103> ESTABLISH WINRM CONNECTION FOR USER: Administrator on PORT 5986 TO 1.2.3.103
Using module file /home/user/ansible.git/lib/ansible/modules/windows/setup.ps1
<1.2.3.102> ESTABLISH WINRM CONNECTION FOR USER: Administrator on PORT 5986 TO 1.2.3.102
EXEC (via pipeline wrapper)
EXEC (via pipeline wrapper)
EXEC (via pipeline wrapper)
The full traceback is:
Traceback (most recent call last):
  File "/home/user/ansible.git/lib/ansible/executor/task_executor.py", line 125, in run
    res = self._execute()
  File "/home/user/ansible.git/lib/ansible/executor/task_executor.py", line 526, in _ex
ecute
    result = self._handler.run(task_vars=variables)
  File "/home/user/ansible.git/lib/ansible/plugins/action/normal.py", line 45, in run
    results = merge_hash(results, self._execute_module(tmp=tmp, task_vars=task_vars, wrap_async=wrap_as
ync))
  File "/home/user/ansible.git/lib/ansible/plugins/action/__init__.py", line 743, in _e
xecute_module
    res = self._low_level_execute_command(cmd, sudoable=sudoable, in_data=in_data)
  File "/home/user/ansible.git/lib/ansible/plugins/action/__init__.py", line 892, in _l
ow_level_execute_command
    rc, stdout, stderr = self._connection.exec_command(cmd, in_data=in_data, sudoable=sudoable)
  File "/home/user/ansible.git/lib/ansible/plugins/connection/winrm.py", line 337, in exec_command
    result = self._winrm_exec(cmd_parts[0], cmd_parts[1:], from_exec=True, stdin_iterator=self._wrapper_payload_stream(payload))
  File "/home/user/ansible.git/lib/ansible/plugins/connection/winrm.py", line 294, in _winrm_exec
    self.protocol.cleanup_command(self.shell_id, command_id)
  File "/usr/lib/python2.7/site-packages/winrm/protocol.py", line 307, in cleanup_command
    res = self.send_message(xmltodict.unparse(req))
  File "/usr/lib/python2.7/site-packages/winrm/protocol.py", line 207, in send_message
    return self.transport.send_message(message)
  File "/usr/lib/python2.7/site-packages/winrm/transport.py", line 184, in send_message
    response = self.session.send(prepared_request, timeout=self.read_timeout_sec)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 609, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 487, in send
    raise ConnectionError(e, request=request)
ConnectionError: HTTPSConnectionPool(host='38.38.12.3', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x38f9a50>: Failed to establish a new connection: [Errno 111] Connection refused',))
fatal: [VM3]: FAILED! => {
    "failed": true,
    "msg": "Unexpected failure during module execution.",
    "stdout": ""
}
ok: [VM1]
ok: [VM2]

This relates to #23320 (more examples from others there)

The playbook looks like this, and it fails on the setup task.

- name: Clone VM
  vmware_guest:
    hostname: '{{ vcenter_ipaddress }}'
    username: '{{ vcenter_username }}'
    password: '{{ vcenter_password }}'
    datacenter: '{{ vcenter_datacenter }}'
    resource_pool: '{{ vcenter_resource_pool }}'
    cluster: '{{ vcenter_cluster }}'
    folder: '{{ vcenter_folder }}'
    template: '{{ vcenter_template }}'
    name: '{{ inventory_hostname_short }}'
    state: poweredon
    validate_certs: no
    networks:
    - name: 'VLAN {{ vcenter_portgroup_prefix }}{{ pod_id }}'
      ip: '{{ ansible_host }}'
      netmask: '{{ vm_netmask }}'
      gateway: '{{ gw_ip }}'
      domain: '{{ domain }}'
      dns_servers: [ '{{ dns_ip }}' ]
    customization:
      autologon: yes
      #fullname: Administrator
      hostname: '{{ windows_shortname }}'
      orgname: '{{ windows_organization }}'
      password: '{{ windows_admin_password }}'
      productid: '{{ windows_product_id }}'
      runonce:
        - powershell.exe -ExecutionPolicy Unrestricted -File C:\Windows\Temp\ConfigureRemotingForAnsible.ps1 -CertValidity
ys 3650 -EnableCredSSP -ForceNewSSLCert
      timezone: 105
  register: vm
  delegate_to: localhost

# Wait for system(s) to become reachable
- name: Wait for VM customizations
  wait_for_connection:
    delay: 240
    sleep: 15
    timeout: 900

- setup:
@ansibot ansibot added affects_2.4 This issue/PR affects Ansible v2.4 bug_report needs_triage Needs a first human triage before being processed. labels Jun 9, 2017
@jctanner jctanner added the windows Windows community label Jun 9, 2017
@jctanner jctanner removed the needs_triage Needs a first human triage before being processed. label Jun 9, 2017
@dagwieers
Copy link
Contributor Author

The same issue appears consistently when installing SCVMM (in async mode). If you have the following tasks in a playbook:

- name: Transfer System-Center ISO
  win_get_url:
    url: '{{ binaries_source }}/mu_system_center_2012_r2_virtual_machine_manager_x86_and_x64_dvd_2913737.iso'
    dest: C:\Windows\Temp\mu_system_center_2012_r2_virtual_machine_manager_x86_and_x64_dvd_2913737.iso
    force: no
    skip_certificate_validation: yes

- name: Mount System-Center ISO image
  win_disk_image:
    image_path: 'C:\Windows\Temp\mu_system_center_2012_r2_virtual_machine_manager_x86_and_x64_dvd_2913737.iso'
    state: present
  register: iso

- name: Run System-Center installer
  win_command: >
    {{ iso.mount_path }}setup.exe /server /i /f "C:\Windows\Temp\VMServer.ini"
    /SqlDBAdminDomain "{{ dc }}" /SqlDBAdminName "{{ windows_admin_user }}" /SqlDBAdminPassword "{{ windows_admin_password }}"
    /VmmServiceDomain "{{ dc }}" /VmmServiceUserName "scvmmsvc" /VmmServiceUserPassword "{{ windows_admin_password }}"
    /IACCEPTSCEULA
  args:
    creates: 'C:\Program Files\Microsoft System Center 2012 R2\Virtual Machine Manager\bin\VmmAdminUi.exe'
  vars:
    ansible_user: '{{ dc }}\{{ windows_admin_user }}'
  when: not vmadminui.stat.exists
  register: systemcenter
  async: 1000
  poll: 15
  ignore_errors: yes

- name: Run System-Center installer (again)
  win_command: >
    {{ iso.mount_path }}setup.exe /server /i /f "C:\Windows\Temp\VMServer.ini"
    /SqlDBAdminDomain "{{ dc }}" /SqlDBAdminName "{{ windows_admin_user }}" /SqlDBAdminPassword "{{ windows_admin_password }}"
    /VmmServiceDomain "{{ dc }}" /VmmServiceUserName "scvmmsvc" /VmmServiceUserPassword "{{ windows_admin_password }}"
    /IACCEPTSCEULA
  args:
    creates: 'C:\Program Files\Microsoft System Center 2012 R2\Virtual Machine Manager\bin\VmmAdminUi.exe'
  vars:
    ansible_user: '{{ dc }}\{{ windows_admin_user }}'
  when: not vmadminui.stat.exists and systemcenter|failed
  async: 1000
  poll: 15

It fails consistently with a Connection refused:

TASK [windows-scvmm-server : Run System-Center installer] ******************************************************************
fatal: [bdsol-aci51-scvmm-01]: FAILED! => {"failed": true, "msg": "credssp: HTTPSConnectionPool(host='38.38.51.3', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x3b6f3d0>: Failed to establish a new connection: [Errno 111] Connection refused',)), ssl: HTTPSConnectionPool(host='38.38.51.3', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x3b65b10>: Failed to establish a new connection: [Errno 111] Connection refused',)), plaintext: HTTPSConnectionPool(host='38.38.51.3', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x3b5d9d0>: Failed to establish a new connection: [Errno 111] Connection refused',))"}
...ignoring

TASK [windows-scvmm-server : Run System-Center installer (again)] **********************************************************
fatal: [bdsol-aci51-scvmm-01]: UNREACHABLE! => {"changed": false, "msg": "credssp: HTTPSConnectionPool(host='38.38.51.3', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x3b68c50>: Failed to establish a new connection: [Errno 111] Connection refused',)), ssl: HTTPSConnectionPool(host='38.38.51.3', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x3b971d0>: Failed to establish a new connection: [Errno 111] Connection refused',)), plaintext: HTTPSConnectionPool(host='38.38.51.3', port=5986): Max retries exceeded with url: /wsman (Caused by NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x3b97350>: Failed to establish a new connection: [Errno 111] Connection refused',))", "unreachable": true}

@dagwieers
Copy link
Contributor Author

dagwieers commented Jun 14, 2017

So I wrote a simple implementation that would retry 5 times for 5 seconds when dealing with Connection refused, and I ended up with reproducible HTTP 500 errors after that (with Windows 2012R2). When I then upgraded WMF/PS 4.0 to WMF/PS 5.1, the HTTP 500 errors were a thing of the past, while the task would then succeed successfully !

But the task would only work successfully if it was run in async mode, if not I would get a probem related to a None value being provided to the ElementTree parser. I haven't looked into that issue.

So it seems that HTTP 500 issues with WinRM (likely due to the WinRM service being restarted) disappear with WMF 5.1 ! (Potentially with other scenarios)

@ansibot ansibot added the support:core This issue/PR relates to code supported by the Ansible Engineering Team. label Jun 29, 2017
@dagwieers dagwieers changed the title WinRM: During use bails out with "Connection refused" WinRM: Bails out with "[Errno 111] Connection refused" Jul 14, 2017
@dagwieers

This comment has been minimized.

@shilpa12345

This comment has been minimized.

@dagwieers

This comment has been minimized.

@ansibot
Copy link
Contributor

ansibot commented Jan 8, 2018

@ashfaqn

This comment has been minimized.

@dagwieers

This comment has been minimized.

@dagwieers
Copy link
Contributor Author

dagwieers commented Oct 7, 2018

I updated my patch to pywinrm to recover from this at: diyan/pywinrm#174

Now you can set reconnection_retries and reconnection_backoff (e.g. resp to 4 retries and 2.0 seconds) to recover from temporary Connection Refused situations. This can recover from e.g. installing SCVMM (which apparently makes WinRM unavailable for a short while). The backoff period is 2, 4, 8, 16 (=30) seconds.

@dagwieers
Copy link
Contributor Author

I also implemented the same solution for pypsrp now at: jborean93/pypsrp#10

@dagwieers
Copy link
Contributor Author

dagwieers commented Oct 9, 2018

Here is a quick-fix for hand-editing your pywinrm installation:

--- a/winrm/protocol.py
+++ b/winrm/protocol.py
@@ -158,6 +158,16 @@ class Transport(object):
         settings = session.merge_environment_settings(url=self.endpoint, proxies={}, stream=None,
                                                       verify=None, cert=None)
 
+        # Retry on connection errors, with a backoff factor
+        retries = requests.packages.urllib3.util.retry.Retry(total=4,
+                                                             connect=4,
+                                                             status=4,
+                                                             read=0,
+                                                             backoff_factor=2.0,
+                                                             status_forcelist=(413, 425, 429, 503))
+        session.mount('http://', requests.adapters.HTTPAdapter(max_retries=retries))
+        session.mount('https://', requests.adapters.HTTPAdapter(max_retries=retries))
+
         # get proxy settings from env
         # FUTURE: allow proxy to be passed in directly to supersede this value
         session.proxies = settings['proxies']

Or your pypsrp installation:

--- a/pypsrp/wsman.py
+++ b/pypsrp/wsman.py
@@ -773,6 +773,18 @@ class _TransportHTTP(object):
         elif self.no_proxy:
             session.proxies = orig_proxy
 
+        # Retry on connection errors, with a backoff factor
+        retries = requests.packages.urllib3.util.retry.Retry(
+            total=4,
+            connect=4,
+            status=4,
+            read=0,
+            backoff_factor=2.0,
+            status_forcelist=(413, 425, 429, 503),
+        )
+        session.mount('http://', requests.adapters.HTTPAdapter(max_retries=retries))
+        session.mount('https://', requests.adapters.HTTPAdapter(max_retries=retries))
+
         # set cert validation config
         session.verify = self.cert_validation

This will implement 4 retries with an exponential back-off of 2.0 seconds.

Please test and report back.

@VladislavPershin

This comment has been minimized.

@dagwieers

This comment has been minimized.

@mayfield2333

This comment has been minimized.

@dagwieers

This comment has been minimized.

@deeco
Copy link

deeco commented Jul 27, 2019

I updated my patch to pywinrm to recover from this at: diyan/pywinrm#174

Now you can set reconnection_retries and reconnection_backoff (e.g. resp to 4 retries and 2.0 seconds) to recover from temporary Connection Refused situations. This can recover from e.g. installing SCVMM (which apparently makes WinRM unavailable for a short while). The backoff period is 2, 4, 8, 16 (=30) seconds.

How to implement or ensure 3-5 retries for winrm playbooks ? getting random 104 errors across deploys in azure

can only see reconnection_retries under psrp options

@ansibot ansibot added the has_pr This issue has an associated PR. label Jul 27, 2019
@ullibo
Copy link

ullibo commented Sep 3, 2019

Thanks to @dagwieers !
I had the same problem with winrm (0.3.0) and ansible (2.8.4). First I increased ansible_winrm_connection_timeout, ansible_winrm_read_timeout_sec and ansible_winrm_operation_timeout_sec but that didn´t solve the connection-retryproblem after a reboot e.g.
After patching ../ansible-venv/lib/python2.7/site-packages/winrm/transport.py (CentOS 7) the reconnect works fine. Hopefully this will be patched in future-releases of pywinrm so it can be changed withing the variables.

@AL71B
Copy link

AL71B commented Dec 31, 2019

@ullibo You mentioned you patched transport.py. What was the patch?

Many Thanks
Alan

@ullibo
Copy link

ullibo commented Jan 17, 2020

@AL71B : the lines between #UKI
.../ansible-venv/lib/python2.7/site-packages/winrm/transport.py

....

    # configure proxies from HTTP/HTTPS_PROXY envvars
    session.trust_env = True
    settings = session.merge_environment_settings(url=self.endpoint, proxies={}, stream=None,
                                                  verify=None, cert=None)

    #UKI
    # Retry on connection errors, with a backoff factor
    retries = requests.packages.urllib3.util.retry.Retry(total=4,
                                                         connect=4,
                                                         status=4,
                                                         read=0,
                                                         backoff_factor=2.0,
                                                         status_forcelist=(413, 425, 429, 503))
    session.mount('http://', requests.adapters.HTTPAdapter(max_retries=retries))
    session.mount('https://', requests.adapters.HTTPAdapter(max_retries=retries))

    # UKI END

    # we're only applying proxies and/or verify from env, other settings are ignored
    session.proxies = settings['proxies']

@marshall-brown
Copy link

Hello everyone. Since SSH can now be installed on Windows my team is in the process of switching over to OpenSSH on Windows as these issues are resolved with switching to SSH. I just wanted to let everyone know to ease their pain with WinRM. Running Ansible with SSH to Windows is a dramatic improvement as with what comes with SSH. Please also check out Mitogen for Ansible, a performance add on for SSH.

Mitogen: https://mitogen.networkgenomics.com/ansible_detailed.html

Read Up: https://docs.microsoft.com/en-us/windows-server/administration/openssh/openssh_overview

GitHub Project: https://github.com/PowerShell/openssh-portable

@jborean93
Copy link
Contributor

Closing as we have a few workaround for winrm, the retry options for psrp and now ssh is a solution. Unfortunately there is not too much else we can do for this issue.

@ansible ansible locked and limited conversation to collaborators Mar 3, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
affects_2.4 This issue/PR affects Ansible v2.4 bug This issue/PR relates to a bug. has_pr This issue has an associated PR. support:core This issue/PR relates to code supported by the Ansible Engineering Team. traceback This issue/PR includes a traceback. windows Windows community
Projects
None yet
Development

Successfully merging a pull request may close this issue.