Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] zos_operator_action_query reported UnicodeDecodeError when the outstanding messages contain special characters #776

Closed
2 tasks done
cdlwhui opened this issue May 22, 2023 · 22 comments
Assignees
Labels
Bug Something isn't working as designed. In Plan Issue has been accepted put into a planned release Waiting for Response Waiting for a response from the issue author

Comments

@cdlwhui
Copy link

cdlwhui commented May 22, 2023

Is there an existing issue for this?

  • There are no existing issues.

Are the dependencies a supported version?

  • The dependencies are supported.

IBM Z Open Automation Utilities

v1.2.2

IBM Enterprise Python

v3.11.x

IBM z/OS Ansible core Version

v1.5.0

ansible-version

v2.12.x

z/OS version

v2.5

Ansible module

zos_operator_action_query

Bug description

It failed when using zos_operator_action_query to check a wtor in the following playbook:

  - name: check wtor
    zos_operator_action_query:
      message_id: GEO045A

At the failure time, the outstanding messages in the sysplex are as below.

3414 R K31      *3414 GEO045A Op-assist OK/NOK: ENSURE ALL     
                WORKLOADS ARE DOWN                             
3026 R TP3      *3026 DSI802A SRVS9    REPLY WITH VALID NCCF   
                SYSTEM OPERATOR COMMAND                        
3177 R ST8      *3177 DSI802A SRVS8    REPLY WITH VALID NCCF   
                SYSTEM OPERATOR COMMAND                        
4847 R ST7      *4847 DSI802A SRVS7    REPLY WITH VALID NCCF   
                SYSTEM OPERATOR COMMAND                        
3193 R ST6      *3193 DSI802A SRVS6    REPLY WITH VALID NCCF   
                SYSTEM OPERATOR COMMAND                        
5252 R ST5      *5252 DSI802A SRVS5    REPLY WITH VALID NCCF   
                SYSTEM OPERATOR COMMAND                        
3800 R ST4      *3800 DSI802A SRVS4    REPLY WITH VALID NCCF   
                SYSTEM OPERATOR COMMAND                        
5770 R ST3      *5770 DSI802A SRVS3    REPLY WITH VALID NCCF   
                SYSTEM OPERATOR COMMAND                        
3814 R ST2      *3814 DSI802A SRVS2    REPLY WITH VALID NCCF        
                SYSTEM OPERATOR COMMAND                             
4997 R ST1      *4997 DSI802A SRVS1    REPLY WITH VALID NCCF        
                SYSTEM OPERATOR COMMAND                             
3424 R K32      *3424 CHKLEVEL2: TESTCASE ACTIVE, REPLY TO THIS     
                WTOR TO CANCEL (V03)                                
3423 R K32      *3423 DSI802A SRV32    REPLY WITH VALID NCCF        
                SYSTEM OPERATOR COMMAND                             
3408 R K31      *3408 CHKLEVEL2: TESTCASE ACTIVE, REPLY TO THIS     
                WTOR TO CANCEL (V03)                                
3407 R K31      *3407 DSI802A SRV31    REPLY WITH VALID NCCF        
                SYSTEM OPERATOR COMMAND                             
3392 R ST3      *3392 CHKLEVEL2: TESTCASE ACTIVE, REPLY TO THIS     
                WTOR TO CANCEL (V03)                                
3158 R ST1      *3158  0-     i0    -      a&}  & } q1}  Q&-        
                &   &  y   {P  { {k  CoF {&  H   4&  }              
                 $    $       -                                     
2982 R ST7      *2982 DFS996I *IMS READY*  IMS7                     
2949 R ST5      *2949 DFS996I *IMS READY*  IMS5                     
2935 R ST8      *2935 DFS996I *IMS READY*  IMS8                     
2966 R ST6      *2966 DFS996I *IMS READY*  IMS6                     
2997 R ST2      *2997 DFS996I *IMS READY*  IMS2                    
3043 R ST1      *3043 DFS996I *IMS READY*  IMS1                    
3040 R ST1      *3040 CHKLEVEL2: TESTCASE ACTIVE, REPLY TO THIS    
                WTOR TO CANCEL (V03)                               
2960 R ST6      *2960 CHKLEVEL2: TESTCASE ACTIVE, REPLY TO THIS    
                WTOR TO CANCEL (V03)                               
2992 R ST2      *2992 CHKLEVEL2: TESTCASE ACTIVE, REPLY TO THIS    
                WTOR TO CANCEL (V03)                               
2929 R ST8      *2929 CHKLEVEL2: TESTCASE ACTIVE, REPLY TO THIS    
                WTOR TO CANCEL (V03)                               
2944 R ST5      *2944 CHKLEVEL2: TESTCASE ACTIVE, REPLY TO THIS    
                WTOR TO CANCEL (V03)                               
3008 R ST4      *3008 CHKLEVEL2: TESTCASE ACTIVE, REPLY TO THIS    
                WTOR TO CANCEL (V03)                               
3024 R TP3      *3024 CHKLEVEL2: TESTCASE ACTIVE, REPLY TO THIS    
                WTOR TO CANCEL (V03)                               
2977 R ST7      *2977 CHKLEVEL2: TESTCASE ACTIVE, REPLY TO THIS    
                WTOR TO CANCEL (V03)                               

The playbook can be run successfully when there are no following messages which should be the root cause. The special characters caused the error and it block the whole playbook running and caused the automation job break.
Is there a way to handle the message containing the special charaters without breaking the task running?

3158 R ST1      *3158  0-     i0    -      a&}  & } q1}  Q&-        
                &   &  y   {P  { {k  CoF {&  H   4&  }              
                 $    $    

Playbook verbosity output.

{
  "changed": false,
  "msg": "An unexpected error occurred: UnicodeDecodeError('utf-8', b'ST1        2023142  01:34:54.00             ISF031I CONSOLE ANSI0000 ACTIVATED\\nST1        2023142  01:34:54.00            -D R,A,S \\nST1        2023142  01:34:54.00             IEE112I 01.34.54 PENDING REQUESTS 929\\n                                           RM=31   IM=0     CEM=0     EM=0     RU=0    IR=0    AMRF\\n                                           ID:R/K     T SYSNAME  JOB ID   MESSAGE TEXT\\n                                                 3414 R K31      STC03501 *3414 GEO045A Op-assist OK/NOK: ENSURE\\n                                                                          ALL WORKLOADS ARE DOWN\\n                                                 3026 R TP3               *3026 DSI802A SRVS9    REPLY WITH VALID\\n                                                                           NCCF SYSTEM OPERATOR COMMAND\\n                                                 3177 R ST8               *3177 DSI802A SRVS8    REPLY WITH VALID\\n                                                                           NCCF SYSTEM OPERATOR COMMAND\\n                                                 4847 R ST7               *4847 DSI802A SRVS7    REPLY WITH VALID\\n                                                                           NCCF SYSTEM OPERATOR COMMAND\\n                                                 3193 R ST6               *3193 DSI802A SRVS6    REPLY WITH VALID\\n                                                                           NCCF SYSTEM OPERATOR COMMAND\\n                                                 5252 R ST5               *5252 DSI802A SRVS5    REPLY WITH VALID\\n                                                                           NCCF SYSTEM OPERATOR COMMAND\\n                                                 3800 R ST4      STC57876 *3800 DSI802A SRVS4    REPLY WITH VALID\\n                                                                           NCCF SYSTEM OPERATOR COMMAND\\n                                                 5770 R ST3               *5770 DSI802A SRVS3    REPLY WITH VALID\\n                                                                           NCCF SYSTEM OPERATOR COMMAND\\n                                                 3814 R ST2               *3814 DSI802A SRVS2    REPLY WITH VALID\\n                                                                           NCCF SYSTEM OPERATOR COMMAND\\n                                                 4997 R ST1               *4997 DSI802A SRVS1    REPLY WITH VALID\\n                                                                           NCCF SYSTEM OPERATOR COMMAND\\n                                                 3424 R K32      STC18246 *3424 CHKLEVEL2: TESTCASE ACTIVE, REPLY\\n                                                                           TO THIS WTOR TO CANCEL (V03)\\n                                                 3423 R K32               *3423 DSI802A SRV32    REPLY WITH VALID\\n                                                                           NCCF SYSTEM OPERATOR COMMAND\\n                                                 3408 R K31      STC03412 *3408 CHKLEVEL2: TESTCASE ACTIVE, REPLY\\n                                                                           TO THIS WTOR TO CANCEL (V03)\\n                                                 3407 R K31               *3407 DSI802A SRV31    REPLY WITH VALID\\n                                                                           NCCF SYSTEM OPERATOR COMMAND\\n                                                 3392 R ST3      STC33784 *3392 CHKLEVEL2: TESTCASE ACTIVE, REPLY\\n                                                                           TO THIS WTOR TO CANCEL (V03)\\n                                                 3158 R ST1               *3158 \\xe50-\\x94  \\x02 i0 \\x97\\xd7\\xd9-\\x90\\x8f\\x11\\x8e\\xcc\\x18a&}\\xd8\\x9c&\\xd8}\\x97q1}\\n                                                                          \\x10\\x18Q&-\\xd8\\xa9\\x87\\x93&\\x90\\xd8\\xbd&\\x90\\xd8y\\xa0\\x10\\xd8{P\\x13\\xd8{\\xd8{k\\x01\\xd8CoF\\xd8{&\\xd8\\xd8H\\n                                                                          \\xa0\\xf8\\xc24&\\xf8\\xd8}\\x87 \\xec\\xf8 \\x10\\xec\\xb0\\xcb\\x9c\\xec\\\\\\xb0\\xac\\xa5\\x18\\\\ \\x18\\xf5\\x12\\xf6\\xe5\\xf8-\\xb4\\xa0\\x10\\xc0\\xb5\\x87\\n                                                                           \\n                                                 2982 R ST7      IMS7PL3  *2982 DFS996I *IMS READY*  IMS7\\n                                                 2949 R ST5      STC95265 *2949 DFS996I *IMS READY*  IMS5\\n                                                 2935 R ST8      IMS8PL3  *2935 DFS996I *IMS READY*  IMS8\\n                                                 2966 R ST6      IMS6PL3  *2966 DFS996I *IMS READY*  IMS6\\n                                                 3013 R ST4      STC95255 *3013 DFS996I *IMS READY*  IMS4\\n                                                 2997 R ST2      STC95225 *2997 DFS996I *IMS READY*  IMS2\\n                                                 3043 R ST1      STC95218 *3043 DFS996I *IMS READY*  IMS1\\n                                                 3040 R ST1      STC90525 *3040 CHKLEVEL2: TESTCASE ACTIVE, REPLY\\n                                                                           TO THIS WTOR TO CANCEL (V03)\\n                                                 2960 R ST6      CHKLEVL  *2960 CHKLEVEL2: TESTCASE ACTIVE, REPLY\\n                                                                           TO THIS WTOR TO CANCEL (V03)\\n                                                 2992 R ST2      STC87177 *2992 CHKLEVEL2: TESTCASE ACTIVE, REPLY\\n                                                                           TO THIS WTOR TO CANCEL (V03)\\n                                                 2929 R ST8      CHKLEVL  *2929 CHKLEVEL2: TESTCASE ACTIVE, REPLY\\n                                                                           TO THIS WTOR TO CANCEL (V03)\\n                                                 2944 R ST5      STC85276 *2944 CHKLEVEL2: TESTCASE ACTIVE, REPLY\\n                                                                           TO THIS WTOR TO CANCEL (V03)\\n                                                 3008 R ST4      STC87793 *3008 CHKLEVEL2: TESTCASE ACTIVE, REPLY\\n                                                                           TO THIS WTOR TO CANCEL (V03)\\n                                                 3024 R TP3      STC88164 *3024 CHKLEVEL2: TESTCASE ACTIVE, REPLY\\n                                                                           TO THIS WTOR TO CANCEL (V03)\\n                                                 2977 R ST7      CHKLEVL  *2977 CHKLEVEL2: TESTCASE ACTIVE, REPLY\\n                                                                           TO THIS WTOR TO CANCEL (V03)\\n', 3743, 3744, 'invalid continuation byte')",
  "invocation": {
    "module_args": {
      "message_id": "GEO045A",
      "system": null,
      "job_name": null,
      "message_filter": null
    }
  },
  "_ansible_no_log": false
}

Ansible configuration.

No response

Contents of the inventory

No response

Contents of group_vars or host_vars

No response

@cdlwhui cdlwhui added the Bug Something isn't working as designed. label May 22, 2023
@ddimatos
Copy link
Collaborator

ddimatos commented Jun 28, 2023

@cdlwhui - do you happen to have a full traceback (-vvvv) log, I wanted to see where in the stack it occurred because we have come across this (not for this module) in ansible-core later versions and are looking into it. A full traceback would really help pinpoint which part of the stack to focus on.

I suppose there is more to this block we can't see, eg a x01 (Start of Header) because the below seems to have been mapped to a UTF-8 code point.

*3158  0-     i0    -      a&}  & } q1}  Q&-        
                &   &  y   {P  { {k  CoF {&  H   4&  }              
                 $    $       -   

Although not a solution and you probably know this, but you could add ignore_errors: true to avoid the playbook failure in this case.

@ddimatos ddimatos added Needs Triage Issue need assessment by a team member(s) Waiting for Response Waiting for a response from the issue author labels Jun 28, 2023
@cdlwhui
Copy link
Author

cdlwhui commented Jun 28, 2023

I do not have full traceback log currently but I can collect it when we hit the issue next time. I think I can also try creating an outstanding message with the same special content above to recreate the problem.
I will provide the log if I can get.

Yes, ignore_errors: true would avoid the playbook failure, but in our playbook the next step will reply the queried outstanding message. So if the query have no normal response, the next step will have to be pending or notify people for manual reply. So we want to see if it could ignore the message containing with special characters and return the other normal message match the query.

@ddimatos
Copy link
Collaborator

ddimatos commented Jun 28, 2023

I could also inject the noted block (below) but those seem to be mapped to a utf-8 code set, its what i can't see that is likely what is causing the issue, for example, you can't see x01.

Any byte stream which cannot be assigned to a UTF-8 text encoding will break the reading of it, this is why I think there is more not visible.

I have an idea where it is happening, we can work from that for now, but if you do get a log that would be super helpful.

*3158  0-     i0    -      a&}  & } q1}  Q&-        
                &   &  y   {P  { {k  CoF {&  H   4&  }              
                 $    $       -   

My assumption without a recreate or backtrace is that this occurred when the response came back and was read by python which encountered a byte that can not be assigned to UTF-8. Thus adding a 'waiting on dependency' tag.

@ddimatos ddimatos added Waiting on dependency This issue requires a change on a dependnecy out of scop for this team but tracked by this team and removed Waiting for Response Waiting for a response from the issue author Needs Triage Issue need assessment by a team member(s) labels Jun 28, 2023
@ddimatos
Copy link
Collaborator

ddimatos commented Jun 28, 2023

@cdlwhui - the output you shared above, it did not obviously come back from Ansible since it broke, how did you get that output? Is that from a x3270? Next time it happens in addition to a traceback could you also run these commands from x3270 d r,a,s and d r,a,jn?

@AnthonyGiorgio
Copy link

I'm curious as to how you retrieved the sysplex messages. Did you issue an operator command somehow and copy and paste the output? Was it from SDSF, ZOAU opercmd, or from the console directly? It appears that a corrupted message is on the queue somehow, awaiting a response.

@ddimatos ddimatos added the Waiting for Response Waiting for a response from the issue author label Jun 28, 2023
@cdlwhui
Copy link
Author

cdlwhui commented Jun 29, 2023

Yes, I had a test to generate the message but it did not hit the issue, query using zos_operator_action_query successed.
The message caused the issue was from one of the old workloads in our enviroment. Alought the visible content between it and the message I generated are the same, the difference is the message I generated can be replied and clear, the message caused the issue can not be replied until the system is shutted down and restarted.

I got the message using 'd r,r' command. I can run the commands d r,a,s and d r,a,jn to get the output when hit next time.

@cdlwhui
Copy link
Author

cdlwhui commented Jun 29, 2023

I'm curious as to how you retrieved the sysplex messages. Did you issue an operator command somehow and copy and paste the output? Was it from SDSF, ZOAU opercmd, or from the console directly? It appears that a corrupted message is on the queue somehow, awaiting a response.

I issued operator command 'd r,r' from the console and copied the output. The corrupted message is on the queue but there is nothing happened when just replying it. It disappeared until the LPAR is shutted down and restarted.

@AnthonyGiorgio
Copy link

Thank you, that's good to know. It means that ZOAU isn't corrupting anything, but instead is just processing whatever is on the message queue.

I think there is a bigger question as to what can be done with "unprintable" characters here. ZOAU opercmd just prints them to the terminal as-is, other than converting NULL (0x0) bytes to EBCDIC spaces (0x40). I think an argument could be made to handle this better from Python, which doesn't like characters that can't be converted into a string.

@ddimatos
Copy link
Collaborator

The teams will be working on resolving this; whenever on z/OS we always have to take into account any byte stream which cannot be assigned to a UTF-8 text encoding will break the contract imposed on us by python which text must be UTF-8 assignable.

Thank you @cdlwhui , and if you encounter this again, the requested information could be of help to us while the teams try to resolve this. We understand the importance of this.

@cdlwhui
Copy link
Author

cdlwhui commented Jun 30, 2023

The teams will be working on resolving this; whenever on z/OS we always have to take into account any byte stream which cannot be assigned to a UTF-8 text encoding will break the contract imposed on us by python which text must be UTF-8 assignable.

Thank you @cdlwhui , and if you encounter this again, the requested information could be of help to us while the teams try to resolve this. We understand the importance of this.

You are welcome.
Sure, I will remember to collect the requested info when hitting this again.

@ddimatos ddimatos added this to the [Backlog] Bugs milestone Jul 12, 2023
@ddimatos ddimatos removed the Waiting for Response Waiting for a response from the issue author label Jul 12, 2023
@ddimatos ddimatos moved this from ⚙ Backlog to 📗In plan in IBM Ansible z/OS Core Collection Sep 26, 2023
@ddimatos ddimatos added the In Plan Issue has been accepted put into a planned release label Sep 26, 2023
@ddimatos
Copy link
Collaborator

ddimatos commented Oct 3, 2023

See the following issues below, they are all related to non-printable UTF-8 chars, essentially chararcetors that don't don't correspond to a UTF-8 value. This is being addressed as noted in ZOAU 1.3 and IBM z/OS Core collection will adopt ZOAU 1.3 in Q1 2024. There is a recreate involving a job submission you can see here. This work is also being tracked in JIRA 9687.

@ddimatos
Copy link
Collaborator

ddimatos commented Jan 3, 2024

@cdlwhui - This work has been scheduled to be tested in iteration 4 (Feb 26th) approximately, reading over the history I don't see that we were able to recreate this, if we knew the unprintable character, we could simulate the response.

When it comes time to evaluate this issue in iteration 4 we have 2 choices, recreate by understanding the unprintable char or if you could take an early development build and evaluate it yourself.

Otherwise, we will have to close this issue and assume since this overlaps with other code we were able to resolve it.

@ddimatos ddimatos added the Waiting for Response Waiting for a response from the issue author label Jan 3, 2024
@cdlwhui
Copy link
Author

cdlwhui commented Jan 3, 2024

Thank you for the update @ddimatos .
Yes, we have not hit this issue again by now. We can take the build and evaluate ourselves but can not guarantee the evaluation complete in iteration 4 as it need to hit the issue.

If you could not recreate, I think it should be ok to close this and we will keep monitoring in our environment with the fix.

@ddimatos ddimatos removed Waiting on dependency This issue requires a change on a dependnecy out of scop for this team but tracked by this team Waiting for Response Waiting for a response from the issue author labels Jan 3, 2024
@ddimatos
Copy link
Collaborator

ddimatos commented Jan 8, 2024

@cdlwhui - all I can think of is putting a WTO on the console using some C then reading them, such a test case would probably have to go into our backlog for the time being.

@AnthonyGiorgio
Copy link

The ZOAU command opercmd "d ssi" sometimes includes unprintable characters in the output. I don't know how consistent that behavior is, but I see at least on my test system.

@rexemin
Copy link
Collaborator

rexemin commented Mar 27, 2024

Was unable to test this on our systems. Display commands didn't return unprintable characters and seems like WTO messages automatically filter any that we may introduce ourselves:

When a program issues a WTO macro, the system translates the text; only standard printable EBCDIC characters are passed to MCS-managed display devices. The EBCDIC characters that can be displayed are listed in z/OS MVS Programming: Assembler Services Guide. All other characters are replaced by blanks.

From the WTO docs under the Parameters section.

For now I would ask you @cdlwhui if you could take an early build of v1.10.0-beta.1 and test on your end, if you have the opportunity.

@rexemin rexemin moved this from 🏗 In progress to ❓ Waiting response in IBM Ansible z/OS Core Collection Mar 27, 2024
@rexemin rexemin added the Waiting for Response Waiting for a response from the issue author label Mar 27, 2024
@AnthonyGiorgio
Copy link

I've managed to get unprintable characters to display to the console, but that's usually because the offending code has a bug and is displaying raw memory or something. I think there is some filtering going on, but it's not at all layers of the stack. ZOAU has recently started filtering opercmd output to remove unprintable characters. There are also some recent APARs that filter all characters coming across the Python interface.

@cdlwhui
Copy link
Author

cdlwhui commented Apr 1, 2024

Was unable to test this on our systems. Display commands didn't return unprintable characters and seems like WTO messages automatically filter any that we may introduce ourselves:

When a program issues a WTO macro, the system translates the text; only standard printable EBCDIC characters are passed to MCS-managed display devices. The EBCDIC characters that can be displayed are listed in z/OS MVS Programming: Assembler Services Guide. All other characters are replaced by blanks.

From the WTO docs under the Parameters section.

For now I would ask you @cdlwhui if you could take an early build of v1.10.0-beta.1 and test on your end, if you have the opportunity.

@rexemin We can take an early build v1.10.0-beta.1 and test on our environment. We have not hit the WTO message for months but it was just the time it occurred again last week. We suspected it might be caused by a RRS test case but has not determined which one. We opened a case to ask for help to identify the cause. If we can recreate it, we will be able to test soon, otherwise, we may need to keep monitoring and test when it occurred again.

For now, I would ask how to get the early build v1.10.0-beta.1? And does it require a specific ZOAU version?

@ddimatos
Copy link
Collaborator

ddimatos commented Apr 3, 2024

@cdlwhui - our dev always has the next release, in this case 1.10.0-beta.1, it still has about 3 weeks of development left for the quarter but is stable, for this offfering, dev is always stable.

You can install directly from our repo with either of these commands:

ansible-galaxy collection install git+https://github.com/ansible-collections/ibm_zos_core.git,dev
ansible-galaxy collection install [email protected]:ansible-collections/ibm_zos_core.git,dev

This version will minimally require zoau 1.3.0.
I hope this helps.

@cdlwhui
Copy link
Author

cdlwhui commented Apr 3, 2024

Thank you @ddimatos. Is it able to install this dev version through collections/requirements.yml in ansible AWX?

@ddimatos
Copy link
Collaborator

ddimatos commented Apr 4, 2024

@cdlwhui - while I have not tried to install a development branch using collections/requirements.yml this is possible, I think you would need to clone our repo and adjust the collections/requirements.yml file to something like:

collections:
  - name: https://github.com/cdlwhui/ansible-collections/ibm_zos_core.git
    type: git
    version: dev

Steps:

  1. git clone [email protected]:ansible-collections/ibm_zos_core.git
  2. git checkout dev
  3. <edit file the file collections/requirments.yml to what is shared above>
  4. git add .
  5. git commit -m "update collections file to point to dev"
  6. git push

I believe the current behavior of collections/requirements.yml is to install the latest release, the dev branch is not released or tagged yet so I don't think you can do that with out current collections/requirements.yml.

@fernandofloresg
Copy link
Collaborator

This issue has been without update from the user for more than 3 months and still waiting on response, hence I'm closing it due to inactivity.

@ddimatos ddimatos moved this from ❓ Waiting response to ❌ Cancelled in IBM Ansible z/OS Core Collection Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Something isn't working as designed. In Plan Issue has been accepted put into a planned release Waiting for Response Waiting for a response from the issue author
Projects
Development

No branches or pull requests

6 participants