-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Null vs Blank Issue #372
Comments
Hi, Cobrix does distinguish between nulls and empty values. You can achieve this by specifying a value that should be written to a CSV field when the input value is null. df.write
.option("nullValue", "NULL")
.csv("/some/path") |
Hi, Thanks for the reply. I tested the writing mode you recommended, but the problem did not go away. Executing a show on the dataframe created by the read operation and comparing it with the source table on DB2, it was verified that the null cells on the source are read blank in the dataframe. So the error is being read not writing. In fact, since null cells are not created in the dataframe, the option option ("nullValue", "NULL") does not generate null values in the csv written on HDFS. Examble_DB2vsDataframePostRead.xlsx This is read operation: Is it possible that I need to add special options to fix the problem? |
I see. What are the data types of the problematic columns? How their PICs look like in teh copybook? I have a suspicion that it might be the limitation of the mainframe format itself. It might be that empty strings are encoded exactly the same as nulls on the export from DB2. If this is the case, there is nothing we can do. |
Data tyoes and PICs of columns are: |
That's what I was suspected. Fields that are encoded as 'PIC X' don't support the notion of NULLs. So when you export a table from DB2 to a mainframe file, both empty strings and nulls are encoded as "40404040" (40 is space in EBCDIC). If encoded values are different, it might be possible to differentiate it. Could I ask you to run with .option("debug", "true") to see how empty strings and nulls are encoded? |
Is this option valid for spark-cobol version 0.2.5? I have spark 2.2.0 |
No, this option has been introduced quite recently. But you can try |
Unfortunately, with version 2.2.0 I always have this error which I cannot resolve: |
This might be easy to solve. :) |
I did this, but error is |
It's strange. What's your command line? |
I checked. It seems older Spark versions used different ANTLR and they provide it in runtime. But that won't solve nulls issue. Alternatively, you can use a HEX editor to look into the data file and check if nulls are encoded differently than empty strings. |
If i replace FILLER with this format copybook: |
No additional options are necessary here. But what are you trying to achieve? |
I created the copybook file starting from DDL and LOAD file which indicates datatype, number of occupied positions and NULLIF position for each field. I translated NULLIF with FILLER in copybook, and I want and I would like to try to use REDEFINES instead of FILLER, to understand if the problem is the copybook file used. |
Hi, Should I enter additional options for this copybook? |
Hi, Sorry for the late reply. What version of Cobrix are you using? and what exact options are you passing to Spark? This is a very strange error. |
These are dependencies:
Spark read command is: |
Before I take a look. Why do you use Also, you don't need to include scodec and cobol-parser deendencies. They are transitive and will be added automatically. |
Thanks a lot for the tip. I changed the version and the error changed. The mistake is: There are some files in hdfs://path/fileBin that are NOT DIVISIBLE by the RECORD SIZE calculated from the copybook (594 bytes per record)._ I used this command: This copybook file I use is the official one of the DB2 table. For this reason it is strange that it does not match the binary file. |
You can try
to suppress the message to fetch first several records (e.g. df.show()). But it won't work when reading the full file. It's possible that tolls that transfer mainframe files to a PC add additional headers to records (RDW, for instance). So without looking at the actual file it is hard to tell why the data doesn't match the copybook. Maybe the above workaround will give you a clue. I'd strange that you didn't have the same error message when loading via '0.2.5'. I addition what you can do is compare the layout of fields with the copybook or a similar layout produced in the mainframe to ensure they match. |
From hex editor the file size is 626. In fact with a different copybook (using fillers instead of refefines) I read the binary file correctly and the size from cobrix is 626. |
We might include the saded version of ANTLR Runtime in the future version of Cobrix so people could use the latest spark-cobol with older Spark versions. But back to the original issue. Can you tell me if nulls are encoded differently than blanks? Maybe you can send a record (or a HEX dump screenshot) and I'll take a look. I need a HEX dump of a record and a description which field should be null and which should be blank. |
Hi, this is binary file with two records that contain blank and null: This file contains datatype of columns in DB2: This is copybook file with FILLER: and this is copybook file with REDEFINES: Thanks for the support. |
Great. I've tried to load it using the latest Cobrix:
Is this the expected output or are there unexpected blanks? I used the following options:
Here is the same with debugging:
|
Here is Cobrix 2.2.0 that embeds ANTLR and can be used with Spark 2.2.0 :) |
Hi, thanks for the jar. The result is not correct. Here I report the data on DB2 corresponding to the dump file sent previously. How did you calculate the value of record_lenght? |
Hi, The previously submitted dump was inconsistent with the submitted copybook. For this reason the resulting data was incorrect. |
Great, that's the difference we can use to distinguish between banks and nulls.
I can send you a snapshot version to test before releasing a new version. |
Thank you very much. If you can send it, it can be useful to test my data. |
Cool, will send as soon as it is ready |
Here is the snapshot: Use
Use 'pedantic' to make Cobrix complain if there are redundant options. |
Have you had a chance to check the new option? |
Sorry, it seems the fix wasn't included in 2.2.1. Will be included in 2.2.3 |
I'm using the cobrix library to read an EBDCID binary file, obtained as a table extraction on DB2. The reading operation output is writed as CSV file on HDFS.
For parsing the binary file I'm using a copybook file present on HDFS.
Here is the spark reading command:
spark.read.
format (sourceFileFormat).
option ("copybook", copybookStagingHDFSPath).
option ("schema_retention_policy", "collapse_root").
option ("string_trimming_policy", "none").
load (sourceFileStagingHDFSPath + file)
The resulting output on CSV file does not allow the distinction between BLANK ("") and NULL cells but are both treated as empty strings.
Is there a way to treat BLANK ("") value and NULL value differently, in order to have an output corresponding to the data present on the DB2 source database?
The text was updated successfully, but these errors were encountered: