Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing proper numeric(little endian type) values after parsing #740

Open
jaysara opened this issue Jan 10, 2025 · 6 comments
Open

Missing proper numeric(little endian type) values after parsing #740

jaysara opened this issue Jan 10, 2025 · 6 comments
Labels
enhancement New feature or request

Comments

@jaysara
Copy link

jaysara commented Jan 10, 2025

Background

I am trying to parse a fixed record size file with digits presented by little-endian. I modified COMP-5 in the copybook to use COMP-9. I am able to parse all the text fields fine. However some numeric fields are not coming out correctly. The example-copybook for Copybook.
The sample input file is Input File
The Cobrix parse produced file is Cobrix Output
The expected output file from original system is Expected
This is my java code,

    ```

String custInputCopyBook = readCopyBook(configPath);// getCustInputCopyBook(configPath,isLocal);

    Dataset<Row> df1 =  spark.read()
            .format("za.co.absa.cobrix.spark.cobol.source")
            .option("copybook_contents", custInputCopyBook)
            .option("encoding", "ascii")
            .option("schema_retention_policy", "collapse_root")
            .option("record_start_offset", "2")
            .load(inputFile);

    df1.printSchema();

    df1.show(false);
    System.out.println("Count "+df1.count());
    df1.repartition(1).write().mode("overwrite").option("header","true").csv(outputPath);

## Question

Can you please help me identify why some numeric values are missing (coming as null )

The example of missing values are Column 1 ->Row -11,
Column 3 --> Row 11,
Column 4-> Row 10,
Column 5 -> Row 6,
Column 8 -> Row 7 , Row 9
All missing values are only for numeric fields. All characters field seem too populate correctly.

Thanks
@jaysara jaysara added the question Further information is requested label Jan 10, 2025
@yruslan
Copy link
Collaborator

yruslan commented Jan 20, 2025

Hi @jaysara ,

First of all, thanks a lot for such a detailed explanation of the issue and for providing the copybook and examples!

From what I can see, the copybook has PICs like 9(9), e.g. maximum length is 9 digits. But Cobrix encounters numbers that have 10 digits. So it drops it.

I see this is an issue that cannot be worked around. This is because if you use 9(10), Cobrix will use 8-byte long in stead of 4 byte integer.

We can add a special option to tread binary fields to its full extent, and ignore the maximum length for binary fields.

Converting this question to a feature request.

@yruslan yruslan added enhancement New feature or request and removed question Further information is requested labels Jan 20, 2025
@yruslan
Copy link
Collaborator

yruslan commented Jan 20, 2025

Proposed option to add:
(just a proposal, can change)

Current behavior:

.option("binary_numbers", "strict")

New behavior in order to get expected values:

.option("binary_numbers", "full_range")

@jaysara
Copy link
Author

jaysara commented Jan 21, 2025

hi @yruslan Thanks for the reply. I tried with

.option("binary_numbers", "full_range")

However, that has not changed anything. Any specific version of cobrix I should use ? I am using 2.6.9
For Enhancements, how long will that take (approx. ) Is there any temp solution I can implement ?
thanks,

@yruslan
Copy link
Collaborator

yruslan commented Jan 21, 2025

Yes, the feature is not implemented yet. The option is just a proposal.

But after looking at the code, we won't add the option, will just make the default behavior to support the wider range of values. This is almost like a bugfix.

The feature is going to be implemented soon.

@jaysara
Copy link
Author

jaysara commented Jan 21, 2025

Thanks @yruslan . Sorry for pushing this little. It will help with our internal planning timeline point of view. Can you help with what will "soon" mean for this kind of bug fix. Should we plan as 1-week, 1-month or less than that ? Also, will you let us know here when the bug is fixed and new version is available. Thanks again for your help.

@yruslan
Copy link
Collaborator

yruslan commented Jan 21, 2025

Sure. The plan is to implement it this week and release a new version next week. Will let you know when the change is available in the master branch so you can also test it on your use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants