Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reading the mainframe files data as text field irrespective of the copybook data type #291

Closed
ghost opened this issue May 14, 2020 · 17 comments
Assignees
Labels
accepted Accepted for implementation enhancement New feature or request

Comments

@ghost
Copy link

ghost commented May 14, 2020

Background

If the data does not match with the data type of copybook while reading with cobrix the invalid data becomes null.

Feature

Can we read all the data by text fields irrespective of the copybook data type so that no data would be lost while reading. It’s just like reading a csv file with all the data type as string

Example [Optional]

A simple example if applicable.

Proposed Solution [Optional]

Solution Ideas
1.
2.
3.

@ghost ghost added the enhancement New feature or request label May 14, 2020
@ghost ghost changed the title Reading the mainframe files data all as text field irrespective of the copybook data type Reading the mainframe files data as text field irrespective of the copybook data type May 14, 2020
@yruslan
Copy link
Collaborator

yruslan commented May 14, 2020

This is a tricky question. The simple answer is no, we currently do not support this. And I'm not sure what is the way to support it that might be helpful.

COBOL has a system of types and formats and some of these types have values that do not have meaning (semantic mapping).

Let's consider 2 approaches.

  1. Decoding data type format. For instance, binary-coded decimal (BCD) can convert HEX values of 54 33 21 to the number 543321. So if the copybook defines the field as being a BCD encoded number Cobrix converts it to numbers. But if a value 54 A3 21 is encountered, it is invalid from BCD perspective since 'A' is not a decimal number. So there is no mapping from that value to a number.

  2. Leaving raw values. If a field is defined as BCD number but we always put raw values to the filed, a BCD number '54 33 21' will be converted to a string having 3 characters 0x54, 0x33, 0x21. And while for any character sequence a string is possible, the field losses its meaning as a number.

  3. Leaving raw numbers only for incorrect values. We can write numbers as numbers when decoding is possible and write raw values otherwise. But in this case, we lose the ability to identify which value is correct.

  4. Additional debugging fields. We have .option("debug", "true") which generates fields having additional values having the original data in HEX encoding. We can add an option to create these fields not as HEX encoding, but as actual raw values. This might be a viable solution.

Please, try the debug option. Let us know if it is sufficient for your use case or you would prefer having raw values as debugging fields.

@ghost
Copy link
Author

ghost commented May 15, 2020

Thanks for the quick reply. I checked the #4 option to make debug=true.while converting the HEX number to ascii do i need to always use the character set as Cp037. As Cp037 is one of the encoder character set for EBCIDIC for IBM, do we have any other EBCIDIC character set that may come.

@ghost
Copy link
Author

ghost commented May 15, 2020

For packed Decimal Comp3 S9(2)V999 for the value 12.000 we are getting the HEX as 00000000012000C. but while converting to string with characterset Cp037, its not working as intended because i suppose i need to look for the field datatype from copybook and accordingly convert to ascii.

i.e here C means its positive value and (999) means 3 digits after decimal. Please suggest are there any other ways to convert the values to the raw value in cobrix

@yruslan
Copy link
Collaborator

yruslan commented May 15, 2020

The raw values are presented exactly the same way as they were in the original file, no encoding conversion is happening.

I'm trying to understand what do you want to achieve. Could you please provide a made-up example? Something like: "for fields having raw values so and so we want values to be so and so".

@ghost
Copy link
Author

ghost commented May 15, 2020

For example my copybook file is as below

01 StudentDetail
10 Name PIC A(10)
10 ID PIC S9(4) COMP
10 Mark PIC S9(10)V999 COMP-3

after enabling the debug option while loading the ebcidic file I am getting the dataframe as below

Name Name_debug ID ID_debug Mark Mark_debug
John D1D6C8D5 1 0001 15.000 0000000015000C

So if you will see the Mark_debug column if I want to reverse engineer the 0000000015000C to 15.00 then how should I know whether the actual ASCII value in MARK as 15.000 or 015.000. Or is it the case that in ebcidic binary format the data wont come like 015.000. i.e can the data come like this in Mark column which is also a valid data and alligned to datatype.

Name Name_debug ID ID_debug Mark Mark_debug
John D1D6C8D5 1 0001 015.000 0000000015000C

@yruslan
Copy link
Collaborator

yruslan commented May 15, 2020

So if you will see the Mark_debug column if I want to reverse engineer the 0000000015000C to 15.00 then how should I know whether the actual ASCII value in MARK as 15.000 or 015.000.

The difference between 15.000 and 015.000 is the matter of interpretation of binary data that comes in. The copybook does not describe whether there should be a leading zero when the value is displayed or not. And since it is not described in the copybook, Cobrix cannot do much about it.

Another way to look at it is that 15.000 and 015.000 are the same values since they semantically map to the same mathematical number 15.

@ghost
Copy link
Author

ghost commented May 15, 2020

Thats correct, but our requirement is to read the raw value as it is. so in this scenario if i will read the mark field as text field then 15.000 and 015.000 as two different thing. So basically we need to read the value as it is without manipulating anything.

@yruslan
Copy link
Collaborator

yruslan commented May 15, 2020

Okay. I have a question to you too. What in the copybook says that 15.000 is modified while 015.000 is the original value?

@ghost
Copy link
Author

ghost commented May 15, 2020

no it is not mentioned, but I am assuming for example if in my data file the value is 015.000 or 15.000 or 0015.000 all the cases the HEX encoded value will be 0000000015000C. So as per my requirement how should I get the actual raw value from the HEX encoded.

@yruslan
Copy link
Collaborator

yruslan commented May 15, 2020

Just a suggestion: 15.000 should should count as the raw value in your example. A number with all trailing zeros removed. Cobrix won't ever unpack COMP-3 encoded value as '015.000'.

But then again, your requirement depends on the interpretation of what do you mean by 'raw value'. And your notion of 'raw value' depends on your requirements. So all completely up to you.

@ghost
Copy link
Author

ghost commented May 15, 2020

Cobrix won't ever unpack COMP-3 encoded value as '015.000'

then it should be fine. As of now with version 2.0.7 I am planning to use debug option and reverse engineer the hex value to the raw value so that i can use the data as it is with out any changes.

If for future instead of providing HEX if we can provide raw value it will be really good. It seems while adding debug fields in addDebugFields function i guess val debugDataType = AlphaNumeric(s"X($size)", size, None, None, None) will keep the raw value. Please suggest.

@yruslan yruslan added the accepted Accepted for implementation label May 15, 2020
@yruslan yruslan self-assigned this May 15, 2020
@yruslan
Copy link
Collaborator

yruslan commented May 15, 2020

Yes, we can add an option to generate raw values for debugging easily, Adding it to the backlog,

AlphaNumeric(s"X($size)", size, None, None, None) will keep the raw value. Please suggest.

Yes, and also you need to change this method to return raw values instead of HEX:

@ghost
Copy link
Author

ghost commented May 15, 2020

Thanks a lot. It was great having discussion with you. Do you want me to close this issue or this will be tracked as part of backlog.

@yruslan
Copy link
Collaborator

yruslan commented May 15, 2020

No problem 😄

Let's leave this issue open. I'll use it to make the change to support raw values.

@ghost
Copy link
Author

ghost commented May 17, 2020

it seems for the packed decimal field 10 Mark PIC S9(10)V999 COMP-3

** Mark Mark_debug**
15.000 0000000015000C

cobrix does not convert to actual HEX. the value in Mark_debug seems like the data is converted to be stored in ASCII environment. Please confirm

@yruslan
Copy link
Collaborator

yruslan commented May 18, 2020

Data in Mark_debug should contain HEX of the raw data without any conversion.

Remember, BCD is a binary encoding format. The notion of encoding (EBCDIC vs ASCII) is not applicable here.

@ghost
Copy link
Author

ghost commented May 18, 2020

Got it, Thanks

yruslan added a commit that referenced this issue May 29, 2020
@yruslan yruslan closed this as completed Jun 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Accepted for implementation enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant