Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validate should throw record length error when record delimiter does not occur in correct location #519

Closed
jordanpadams opened this issue Jun 23, 2022 · 8 comments · Fixed by #589

Comments

@jordanpadams
Copy link
Member

jordanpadams commented Jun 23, 2022

🐛 Describe the bug

Per #518, validate appears to be missing checking that the record delimiter occurs at the end of the expected record length. When reading a record, if the record length is read, and it does not end with the expected record delimiter, validate should throw an error.

📜 To Reproduce

See #518

🕵️ Expected behavior

Validate should throw a record length error

📚 Version of Software Used

2.2.3

🩺 Test Data / Additional context

https://github.com/NASA-PDS/validate/files/8959903/Archive.zip


🦄 Related requirements

⚙️ Engineering Details

Medium severity because it will most likely fail elsewhere in most cases, but it is not at all intuitive what the issue is, and it does not appear that we are properly checking for record lengths.

@al-niessner
Copy link
Contributor

@tloubrieu-jpl

Cannot find the file ECM000XXX_2022173T165156_HK_LR_SOLARWIND_CAL_ASC_RAW010.xml and associated data mentioned in #518. Please attach to this ticket.

@jordanpadams
Copy link
Member Author

@tloubrieu-jpl hopefully adding this to the top of your stack. can you please provide the test data @al-niessner has requested?

@jordanpadams
Copy link
Member Author

@al-niessner
Copy link
Contributor

al-niessner commented Jan 25, 2023

@jordanpadams @tloubrieu-jpl

Seems much has changed. I do not get the errors in #518 nor does the bad table looks the same:

      ERROR  [error.label.table_definition_problem]   line 1: Data Object #1 (Identifier Not Specified): Data object is truncated. Expected bytes as defined by label: 645696 (944 records), Actual bytes remaining in file: 513353 (750 records)

Seems clear enough that something in the table definition and file are not right. I am not sure that we can make sense of it from there either. What I find super interesting and think is the real bug, if I change both 944 references to 750 it runs with 10s of thousands of errors because types are not right. However if we take the file size and divide by the number of rows 513353/944 = 543.8061440677966. That is not rounding error. If I do it for the expected 750, 513353/750 = 684.4706666666667 which is also not rounding error. However, the file size check thinks it is right with XML defined record length of 684 and 750 rows which is 513000 but really is off by 353 bytes. Really? Cannot even blame it on the header because there is no header.

Using python3, loaded the table and then found that CR-NL were used to delimit the file. There are 944 rows with that delimiter. They start off with a length of 541(139), drop to 540(134), climb to 542(447), then plateau at 543(224). The number in () is the number of those rows of that value out of the 944 rows. This brings me to the second problem, why should the table be processed if the XML is not even close to the actual file?

The fact that it blew the size check twice (file then again table) allows it to grab 684 characters. Of course it is going to be wrong. I can put a check before it checks every cell to see if the line is properly terminated (making it a row) before doing cell processing. Is that the desirement of this ticket?

@jordanpadams
Copy link
Member Author

@al-niessner what happens if we don't trigger this file size error? in other words, what if we change the label to say the number of records is 750 (like the error message reads), and keep on trying to validate the file? do we then throw an error that the record delimiter is invalid?

@al-niessner
Copy link
Contributor

al-niessner commented Jan 25, 2023

@al-niessner what happens if we don't trigger this file size error? in other words, what if we change the label to say the number of records is 750 (like the error message reads), and keep on trying to validate the file? do we then throw an error that the record delimiter is invalid?

@jordanpadams
10s of thousands of cell check errors because the string segment does not look like anything.

@al-niessner
Copy link
Contributor

al-niessner commented Jan 25, 2023

@jordanpadams

Here is the first set of offending code:

https://github.com/NASA-PDS/pds4-jparser/blob/5e1137162dfe4e327bba5a393b08162f04c41917/src/main/java/gov/nasa/pds/objectAccess/ByteWiseFileAccessor.java#L191-L201

I take it the first check is to allow for extra bytes at the end of the file. The second allows allows for junk at the end of the file too. Is this desired?

The second error message is far less informative about why it is short on byte. Should its error message be expanded like other check?

Lastly, I have changed the first error set to be this:

          if ((fileSizeMinusOffset / length) * length == fileSizeMinusOffset)
            throw new InvalidTableException(
                "Data object is truncated. Expected bytes as defined by label: " + expectedBytesToRead
                    + " (" + records + " records times " + length + " bytes per record)" + ", Actual bytes in file: "
                    + fileSizeMinusOffset + " (" + (fileSizeMinusOffset / length) + " records times " + length + " bytes per record)");
          if ((fileSizeMinusOffset / records) * records == fileSizeMinusOffset)
              throw new InvalidTableException(
                      "Data object is truncated. Expected bytes as defined by label: " + expectedBytesToRead
                          + " (" + records + " records times " + length + " bytes per record)" + ", Actual bytes in file: "
                          + fileSizeMinusOffset + " (" + records + " records times " + (fileSizeMinusOffset/records) + " bytes per record)");
          throw new InvalidTableException(
                  "Data object is truncated. Expected bytes as defined by label: " + expectedBytesToRead
                      + " (" + records + " records times " + length + " bytes per record)" + ", Actual bytes in file: "
                      + fileSizeMinusOffset + " (" + ((float)(fileSizeMinusOffset) / (float)length) + " records times " + length + " bytes per record)"
                      + " OR (" + records + " records times " + ((float)fileSizeMinusOffset / (float)records) + " bytes per record)");

That bit of code says that if there is an even multiplier for too many rows or different integer length that works those messages give you clean integers all the way around (help with missing row or miscount of length but not both). If there are no integer resolutions it gives this message:

      ERROR  [error.label.table_definition_problem]   line 1: Data Object #1 (Identifier Not Specified): Data object is truncated. Expected bytes as defined by label: 645696 (944 records times 684 bytes per record), Actual bytes in file: 513353 (750.51605 records times 684 bytes per record)  OR (944 records times 543.80615 bytes per record)

Helps the user figure out what is wrong. I think this same block should be used for the not enough bytes short message as well. Let me know what you want.

@miguelp1986
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants