Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUL characters in US ASCII getting converted to space character i.e " " insted of empty value. #481

Closed
rohitavantsa opened this issue Mar 21, 2022 · 10 comments
Labels
question Further information is requested

Comments

@rohitavantsa
Copy link

We have an US ASCII fixed byte length file .
File Contents:
1234 t ----> this row having three spaces
4567NULNULNULf -----> this row is having 3 NUL characters

CopyBook Contents:

01 tablename
05 record_ID PIC x(3)
05 record_status PIC x(3)
05 record_flag PIC x(1)

expected output:

[Row(record_ID='1234', record_status=' ',record_flag='t'),
Row(record_ID='4567',record_status='',record_flag='f')]

Actual Output :
[Row(record_ID='1234', record_status=' ',record_flag='t'),
Row(record_ID='4567',record_status=' ',record_flag='f')]

We are expected an empty value insted we are getting three white spaces. We are seeing the onprem data is an empty value. Can you please help us understand why we are seeing this issue in the scenario.??

@yruslan Can you please help us on this.

@rohitavantsa rohitavantsa added the question Further information is requested label Mar 21, 2022
@yruslan
Copy link
Collaborator

yruslan commented Mar 21, 2022

Hi, thanks for the issue report. Could you please add

  • An example US ASCII with NUL characters.
  • The code snippet you are using to read the file.

Btw, does this option help removing extra spaces: .option("string_trimming_policy", "both") ?

@rohitavantsa
Copy link
Author

rohitavantsa commented Mar 22, 2022

Hi @yruslan

String_trimming_policy is set to none in our case as we need to preserve the spaces while reading the file.

Here are the options we are using to read the file:
spark.read.format('cobol').option('copybookcontents', 'encoding': 'ASCII' , 'ebcdic_code_page':'CP037','string_trimming_policy':'none', 'debug_ignore_file_size':'true').load('filepath')

Please find the Sample file below:
sampleUS-ASCII file.txt

Please open this file in np++ to get the reference to NUL character .

Expected Output:
The row with NULNULNUL should be appeared as an empty string instead of ' ' (three spaces) which we are currently getting in our dataframe. Onprem system is providing this field as '' empty field.

@yruslan
Copy link
Collaborator

yruslan commented Mar 22, 2022

Hi,

Before looking deeper please try:

  • Removing 'ebcdic_code_page':'CP037' since it is applicable only for EBCDIC and
  • Adding .option("improved_null_detection", "true")

ASCII charset is set using this option:
.option("ascii_charset", "US_ASCII") (UTF_8 is the default)
(you can specify a different charset, of course)

@rohitavantsa
Copy link
Author

Sure @yruslan will try this.

@rohitavantsa
Copy link
Author

Hi @yruslan

We have tried removing option 'ebcdic_code_page':'CP037' and added .opt ion("improved_null_detection", "true") but still it not working as we expect.

To be more clear:
the NUL character which i am refering is hex value \x00 which is not getting read properly. While reading we actually expect a empty field but getting a space character. The file which i have give consists the NUL one.
You could actually try that and check is that the normal behavior or we need any kind of fix.

Thanks in advance

@yruslan
Copy link
Collaborator

yruslan commented Mar 23, 2022

Currently, all characters that are lower than 0x20 are replaced by spaces. If all characters in a field are 0x00, and improved_null_detection = true, the field becomes null.

Will check your file. Probably the correct behavior for ASCII would be not replacing lower characters with spaces and always skipping 0x00. This is something that needs to be implemented on our side.

@rohitavantsa
Copy link
Author

Sure thanks.

yruslan added a commit that referenced this issue Mar 24, 2022
yruslan added a commit that referenced this issue Mar 24, 2022
@yruslan
Copy link
Collaborator

yruslan commented Mar 24, 2022

This should be fixed in this branch:
https://github.com/AbsaOSS/cobrix/tree/bugfix/481-ignore-control-characters

You can test it by building that branch.

yruslan added a commit that referenced this issue Mar 24, 2022
@rohitavantsa
Copy link
Author

Thanks @yruslan . This fix is helping us resolve the issue.

@yruslan
Copy link
Collaborator

yruslan commented Mar 25, 2022

Great! It will be released as a new version sometime next week

yruslan added a commit that referenced this issue Mar 25, 2022
@yruslan yruslan closed this as completed Apr 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants