Validate should throw record length error when record delimiter does not occur in correct location #519

jordanpadams · 2022-06-23T16:38:43Z

🐛 Describe the bug

Per #518, validate appears to be missing checking that the record delimiter occurs at the end of the expected record length. When reading a record, if the record length is read, and it does not end with the expected record delimiter, validate should throw an error.

📜 To Reproduce

See #518

🕵️ Expected behavior

Validate should throw a record length error

📚 Version of Software Used

2.2.3

🩺 Test Data / Additional context

https://github.com/NASA-PDS/validate/files/8959903/Archive.zip

🦄 Related requirements

⚙️ Engineering Details

Medium severity because it will most likely fail elsewhere in most cases, but it is not at all intuitive what the issue is, and it does not appear that we are properly checking for record lengths.

al-niessner · 2022-12-22T19:02:06Z

@tloubrieu-jpl

Cannot find the file ECM000XXX_2022173T165156_HK_LR_SOLARWIND_CAL_ASC_RAW010.xml and associated data mentioned in #518. Please attach to this ticket.

jordanpadams · 2023-01-25T14:37:28Z

@tloubrieu-jpl hopefully adding this to the top of your stack. can you please provide the test data @al-niessner has requested?

jordanpadams · 2023-01-25T14:39:26Z

@al-niessner https://github.com/NASA-PDS/validate/files/8959903/Archive.zip

al-niessner · 2023-01-25T18:31:03Z

@jordanpadams @tloubrieu-jpl

Seems much has changed. I do not get the errors in #518 nor does the bad table looks the same:

      ERROR  [error.label.table_definition_problem]   line 1: Data Object #1 (Identifier Not Specified): Data object is truncated. Expected bytes as defined by label: 645696 (944 records), Actual bytes remaining in file: 513353 (750 records)

Seems clear enough that something in the table definition and file are not right. I am not sure that we can make sense of it from there either. What I find super interesting and think is the real bug, if I change both 944 references to 750 it runs with 10s of thousands of errors because types are not right. However if we take the file size and divide by the number of rows 513353/944 = 543.8061440677966. That is not rounding error. If I do it for the expected 750, 513353/750 = 684.4706666666667 which is also not rounding error. However, the file size check thinks it is right with XML defined record length of 684 and 750 rows which is 513000 but really is off by 353 bytes. Really? Cannot even blame it on the header because there is no header.

Using python3, loaded the table and then found that CR-NL were used to delimit the file. There are 944 rows with that delimiter. They start off with a length of 541(139), drop to 540(134), climb to 542(447), then plateau at 543(224). The number in () is the number of those rows of that value out of the 944 rows. This brings me to the second problem, why should the table be processed if the XML is not even close to the actual file?

The fact that it blew the size check twice (file then again table) allows it to grab 684 characters. Of course it is going to be wrong. I can put a check before it checks every cell to see if the line is properly terminated (making it a row) before doing cell processing. Is that the desirement of this ticket?

jordanpadams · 2023-01-25T18:46:53Z

@al-niessner what happens if we don't trigger this file size error? in other words, what if we change the label to say the number of records is 750 (like the error message reads), and keep on trying to validate the file? do we then throw an error that the record delimiter is invalid?

al-niessner · 2023-01-25T19:07:47Z

@al-niessner what happens if we don't trigger this file size error? in other words, what if we change the label to say the number of records is 750 (like the error message reads), and keep on trying to validate the file? do we then throw an error that the record delimiter is invalid?

@jordanpadams
10s of thousands of cell check errors because the string segment does not look like anything.

al-niessner · 2023-01-25T19:20:50Z

@jordanpadams

Here is the first set of offending code:

https://github.com/NASA-PDS/pds4-jparser/blob/5e1137162dfe4e327bba5a393b08162f04c41917/src/main/java/gov/nasa/pds/objectAccess/ByteWiseFileAccessor.java#L191-L201

I take it the first check is to allow for extra bytes at the end of the file. The second allows allows for junk at the end of the file too. Is this desired?

The second error message is far less informative about why it is short on byte. Should its error message be expanded like other check?

Lastly, I have changed the first error set to be this:

          if ((fileSizeMinusOffset / length) * length == fileSizeMinusOffset)
            throw new InvalidTableException(
                "Data object is truncated. Expected bytes as defined by label: " + expectedBytesToRead
                    + " (" + records + " records times " + length + " bytes per record)" + ", Actual bytes in file: "
                    + fileSizeMinusOffset + " (" + (fileSizeMinusOffset / length) + " records times " + length + " bytes per record)");
          if ((fileSizeMinusOffset / records) * records == fileSizeMinusOffset)
              throw new InvalidTableException(
                      "Data object is truncated. Expected bytes as defined by label: " + expectedBytesToRead
                          + " (" + records + " records times " + length + " bytes per record)" + ", Actual bytes in file: "
                          + fileSizeMinusOffset + " (" + records + " records times " + (fileSizeMinusOffset/records) + " bytes per record)");
          throw new InvalidTableException(
                  "Data object is truncated. Expected bytes as defined by label: " + expectedBytesToRead
                      + " (" + records + " records times " + length + " bytes per record)" + ", Actual bytes in file: "
                      + fileSizeMinusOffset + " (" + ((float)(fileSizeMinusOffset) / (float)length) + " records times " + length + " bytes per record)"
                      + " OR (" + records + " records times " + ((float)fileSizeMinusOffset / (float)records) + " bytes per record)");

That bit of code says that if there is an even multiplier for too many rows or different integer length that works those messages give you clean integers all the way around (help with missing row or miscount of length but not both). If there are no integer resolutions it gives this message:

      ERROR  [error.label.table_definition_problem]   line 1: Data Object #1 (Identifier Not Specified): Data object is truncated. Expected bytes as defined by label: 645696 (944 records times 684 bytes per record), Actual bytes in file: 513353 (750.51605 records times 684 bytes per record)  OR (944 records times 543.80615 bytes per record)

Helps the user figure out what is wrong. I think this same block should be used for the not enough bytes short message as well. Let me know what you want.

issue #519: tweak table processing

miguelp1986 · 2023-03-09T21:42:32Z

Testrail: https://cae-testrail.jpl.nasa.gov/testrail/index.php?/cases/view/1273421

jordanpadams added bug Something isn't working needs:triage labels Jun 23, 2022

jordanpadams assigned jordanpadams and ZarehGorjianJPL and unassigned jordanpadams Jun 23, 2022

jordanpadams added B13.0 s.medium and removed needs:triage labels Jun 23, 2022

jordanpadams assigned al-niessner and unassigned ZarehGorjianJPL Dec 20, 2022

jordanpadams added B13.1 p.must-have labels Dec 20, 2022

al-niessner mentioned this issue Dec 21, 2022

B13.1 Fix Must-Have Priority Bugs #578

Closed

jordanpadams added sprint-backlog needs:receivable labels Jan 2, 2023

jordanpadams removed the needs:receivable label Jan 25, 2023

jordanpadams mentioned this issue Jan 25, 2023

issue #1: command-line option to support content validation for every N products #587

Merged

This was referenced Jan 26, 2023

changes needed to support validate #519 NASA-PDS/pds4-jparser#84

Merged

issue #519: tweak table processing #589

Merged

jordanpadams mentioned this issue Jan 31, 2023

Regression in validate no longer enabling CRLF to be embedded within a Table_Character record #593

Closed

jordanpadams closed this as completed in #589 Jan 31, 2023

jordanpadams added a commit that referenced this issue Jan 31, 2023

Merge pull request #589 from NASA-PDS/issue_519

8e83040

issue #519: tweak table processing

jordanpadams removed the sprint-backlog label Feb 23, 2023

miguelp1986 added the i&t.done label Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validate should throw record length error when record delimiter does not occur in correct location #519

Validate should throw record length error when record delimiter does not occur in correct location #519

jordanpadams commented Jun 23, 2022 •

edited

Loading

al-niessner commented Dec 22, 2022

jordanpadams commented Jan 25, 2023

jordanpadams commented Jan 25, 2023

al-niessner commented Jan 25, 2023 •

edited

Loading

jordanpadams commented Jan 25, 2023

al-niessner commented Jan 25, 2023 •

edited

Loading

al-niessner commented Jan 25, 2023 •

edited

Loading

miguelp1986 commented Mar 9, 2023

Validate should throw record length error when record delimiter does not occur in correct location #519

Validate should throw record length error when record delimiter does not occur in correct location #519

Comments

jordanpadams commented Jun 23, 2022 • edited Loading

🐛 Describe the bug

📜 To Reproduce

🕵️ Expected behavior

📚 Version of Software Used

🩺 Test Data / Additional context

🦄 Related requirements

⚙️ Engineering Details

al-niessner commented Dec 22, 2022

jordanpadams commented Jan 25, 2023

jordanpadams commented Jan 25, 2023

al-niessner commented Jan 25, 2023 • edited Loading

jordanpadams commented Jan 25, 2023

al-niessner commented Jan 25, 2023 • edited Loading

al-niessner commented Jan 25, 2023 • edited Loading

miguelp1986 commented Mar 9, 2023

jordanpadams commented Jun 23, 2022 •

edited

Loading

al-niessner commented Jan 25, 2023 •

edited

Loading

al-niessner commented Jan 25, 2023 •

edited

Loading

al-niessner commented Jan 25, 2023 •

edited

Loading