Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Newline in fixed length COBOL source files #59

Closed
wiztigers opened this issue Aug 13, 2015 · 11 comments
Closed

Newline in fixed length COBOL source files #59

wiztigers opened this issue Aug 13, 2015 · 11 comments
Assignees
Labels
Bug rfc Specifications are not complete. Comment are welcomed. Tools

Comments

@wiztigers
Copy link
Contributor

Seen in a production source file : some lines contain newline characters (although normally these cannot exist in such sources, it seems this one was manually/hexadecimally edited).
This ruins the indexing, and results in missing PeriodSeparator errors on the first "half" of the line (the second "half" being after the newline character).

@wiztigers wiztigers added Bug To Analyze Tests Our tests for CI/CD labels Aug 13, 2015
@wiztigers wiztigers added this to the alpha milestone Aug 13, 2015
@wiztigers wiztigers added Tools and removed Tests Our tests for CI/CD labels Aug 13, 2015
@laurentprudhon
Copy link
Contributor

PROBLEM 1 :
Some COBOL programs on the mainframe contain non-printable EBCDIC characters in alphanumericLiterals.
Such alphanumericLiterals are used in value clauses of the DATA DIVISION to initialize tables of bytes. They are not interpreted as a character string at runtime, but as table of numeric elements.
If the text of these programs is converted to another character set, the character representation of the alphanumericLiteral may be preserved, but the numeric values in the table change, and the program does not work.
There is NO SOLUTION to this problem, because we can not know which parts of the program text have been encoded as EBCDIC strings with this goal in mind.
We can not know when we must preserve the textual representation, and when we must preserve the numeric representation.
This pattern must be strictly forbidden in all our programs : if a field is initialized as text, it must be interpreted as text at runtime.

@laurentprudhon
Copy link
Contributor

PROBLEM 2 :
In the context described in PROBLEM1, we can find one EBCDIC character which maps to Unicode endOfLine character (\r or \n) in the middle of an alphanumericLiteral.
When the class TextDocument reads the Stream of Unicode chars from an ASCII source file (with explicit line endings), it can not know if this endOfLine char really signals an end of line, or if it is a char inside an literal.
Other languages avoid this problem by forbidding endofLine chars in literals, and defining an special char sequence, for example "\r\n" to represent the forbidden characters for the compiler.
As there is no such escaping char sequences in Cobol, there is NO perfect SOLUTION to this problem either.

@laurentprudhon
Copy link
Contributor

FIX ?
Here is the best thing we could do to reduce the probability of PROBLEM2 occurence :

  • drop the support for files with Unix/Linux-style single character line endings
  • Windows-style two characters end of lines \r\n become mandatory
    Then if only one endOfLine char is found in an alphanumericLiteral, we will know it is not the end of the line. But the problem remains if the alphanumericLiteral contains the sequence of chars \r\n.

I implemented and tested this fix with success on our sample files.
But do you think it is worth dropping the support for unix/linux style source files ?
I wait for your answers to commit this fix ...

prudholu added a commit that referenced this issue Aug 13, 2015
… I don't know if it is a good idea to merge this fix into master
@wiztigers
Copy link
Contributor Author

Dropping support of UNIX-style files without correcting the bug would indeed be disappointing.

As I understand the problem, TextDocument takes a Stream of chars. This stream is built by other objects, which take various source file formats as their input. Couldn't we :

  • either give TextDocument another (non-character based) way to discriminate between lines ? Like a custom-created Stream object, or an array of line Streams, ...
  • or modify the File > Stream conversion objects to return a special way to differentiate lines (for example EOF, two EOF in a row being the end of the source file) independent from the input format, this way of discrimination being of course known by TextDocument ?

@smedilol
Copy link
Contributor

RDZ have the same problem an can't interpret correctly such a file.

But ...
Cobol files located Inside partitionned data contains a fixed number of characters.
For our organization it's 80 chars.

So instead of looking for line endings chars, I think it's better to parse 80 chars and consider this as a whole line.

Of course this should be one behavior of the parser and must be configurable (use fixed line length or use line endings char).

@wiztigers
Copy link
Contributor Author

Yup, but this 80 chars limit has no sense in free format, and the sexay thing is TextDocument doesn't currently know about file formats.
Wouldn't the fact of including this notion in TextDocument break the SOLID principle ?

@wiztigers wiztigers added the rfc Specifications are not complete. Comment are welcomed. label Aug 14, 2015
@smedilol
Copy link
Contributor

Maybe one solution is to have 2 implementations of ITextDocument:

  • the current one with line ending chars
  • a new one for fixed line length

@laurentprudhon
Copy link
Contributor

Our friend Regis is right here : the idea was to restrict the knowledge of the text storage format (encoding and line endings) to the File namespace, ie for now the CobolFile class. The CobolFile implementation noramlizes the input as a Stream of Unicode chars with \r, \n, or \r\n line endings. The later phases of the compiler, notably the Text namespace / TextDocument class don't need to worry about the storage format anymore.
The consequence of this choice is that we need a line ending character (or character sequence), and that we can not allow such character (or character sequence) in character literals in our parser, while the original Cobol specification has no such restriction.
But iun fact the architecture we choose for our compiler to read files from disk does not matter : this limitation will always be present if we want to allow free fromat Cobol programs in our visual text editor in memory. All the text editors from Eclipse or other IDEs will internally detect a line ending - an dispay it on screen - if they find such line ending characters in the string representing one ouf our program lines.
After thinking a bit more about that, I devised a different fix for issue, which I committed this morning in the same issue-59 branch :
We recognized above that we will anyway be unable to support Unicode line ending chars in alphanumericLiterals in interactive editing scenarios.
And we know that the EBCDIC alphanumeric literals containing non printable characters will be broken anyway by the Unicode conversion, because the developer relied on explicitely on the numeric code representing them in the original EBCDIC character set.
So I propose the following solution :

  • restore support for single \r and \n characters as line endings in TextDocument (revert to the previous version of the file)
  • update CobolFile class : when reading a fixed length line, if we encounter a line ending character after Unicode conversion of an original EBCDIC character, replace it on the fly with a question mark '?' char

Document clearly two restrictions of our compiler :

  • because of the internal conversion of the program text to Unicode characters in .Net or Java, we do not support alphanumeric literals containing non printable EBCDIC characters
  • because of the feature allowing free text format and variable line length, we do not support alphanumeric literals containing line ending characters

NB : when we say we do not support these two cases, it will only have an impact if we generate Cobol from a TypeCobol program and then compile it with the IBM compiler. For Cobol code analysis in memory, it has no impact.
In the two cases above, the solution is to modifiy the original EBCDIC program text before using our tool :

  • initialize numeric tables directly with numbers instead of their corresponding chars
  • set line ending chars individually Inside alphanumeric literals, for exemple with reference modification

@laurentprudhon
Copy link
Contributor

NB : this new fix won't resolve the problems found on our sample files in ASCII format, because is corrects the EBCDIC to Unicode conversion process, which has already been executed before in this case.
The solution for our test suite is simply to manually replace the offending line ending characters in the source file with question marks characters to mimic the new behavior of the CobolFile class.

@laurentprudhon
Copy link
Contributor

Sorry, I can not push the new commit today, because it appears that I can't reach the Github server while using the VPN -> I will push it Monday

prudholu added a commit that referenced this issue Aug 17, 2015
…sue #59 :

•restore support for single \r and \n characters as line endings in TextDocument (revert to the previous version of the file)
•update CobolFile class : when reading a fixed length line, if we encounter a line ending character after Unicode conversion of an original EBCDIC character, replace it on the fly with a question mark '?' char

Document clearly two restrictions of our compiler :
•because of the internal conversion of the program text to Unicode characters in .Net or Java, we do not support alphanumeric literals containing non printable EBCDIC characters
•because of the feature allowing free text format and variable line length, we do not support alphanumeric literals containing line ending characters

NB : when we say we do not support these two cases, it will only have an impact if we generate Cobol from a TypeCobol program and then compile it with the IBM compiler. For Cobol code analysis in memory, it has no impact.
 In the two cases above, the solution is to modifiy the original EBCDIC program text before using our tool :
•initialize numeric tables directly with numbers instead of their corresponding chars
•set line ending chars individually Inside alphanumeric literals, for exemple with reference modification
@wiztigers
Copy link
Contributor Author

@prudholu largely solved the problem, and I added the identified restrictions to the appropriate wiki page.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug rfc Specifications are not complete. Comment are welcomed. Tools
Projects
None yet
Development

No branches or pull requests

4 participants