Newline in fixed length COBOL source files #59

wiztigers · 2015-08-13T15:04:21Z

Seen in a production source file : some lines contain newline characters (although normally these cannot exist in such sources, it seems this one was manually/hexadecimally edited).
This ruins the indexing, and results in missing PeriodSeparator errors on the first "half" of the line (the second "half" being after the newline character).

The text was updated successfully, but these errors were encountered:

laurentprudhon · 2015-08-13T16:43:12Z

PROBLEM 1 :
Some COBOL programs on the mainframe contain non-printable EBCDIC characters in alphanumericLiterals.
Such alphanumericLiterals are used in value clauses of the DATA DIVISION to initialize tables of bytes. They are not interpreted as a character string at runtime, but as table of numeric elements.
If the text of these programs is converted to another character set, the character representation of the alphanumericLiteral may be preserved, but the numeric values in the table change, and the program does not work.
There is NO SOLUTION to this problem, because we can not know which parts of the program text have been encoded as EBCDIC strings with this goal in mind.
We can not know when we must preserve the textual representation, and when we must preserve the numeric representation.
This pattern must be strictly forbidden in all our programs : if a field is initialized as text, it must be interpreted as text at runtime.

laurentprudhon · 2015-08-13T16:51:06Z

PROBLEM 2 :
In the context described in PROBLEM1, we can find one EBCDIC character which maps to Unicode endOfLine character (\r or \n) in the middle of an alphanumericLiteral.
When the class TextDocument reads the Stream of Unicode chars from an ASCII source file (with explicit line endings), it can not know if this endOfLine char really signals an end of line, or if it is a char inside an literal.
Other languages avoid this problem by forbidding endofLine chars in literals, and defining an special char sequence, for example "\r\n" to represent the forbidden characters for the compiler.
As there is no such escaping char sequences in Cobol, there is NO perfect SOLUTION to this problem either.

laurentprudhon · 2015-08-13T16:58:09Z

FIX ?
Here is the best thing we could do to reduce the probability of PROBLEM2 occurence :

drop the support for files with Unix/Linux-style single character line endings
Windows-style two characters end of lines \r\n become mandatory
Then if only one endOfLine char is found in an alphanumericLiteral, we will know it is not the end of the line. But the problem remains if the alphanumericLiteral contains the sequence of chars \r\n.

I implemented and tested this fix with success on our sample files.
But do you think it is worth dropping the support for unix/linux style source files ?
I wait for your answers to commit this fix ...

… I don't know if it is a good idea to merge this fix into master

wiztigers · 2015-08-14T07:37:39Z

Dropping support of UNIX-style files without correcting the bug would indeed be disappointing.

As I understand the problem, TextDocument takes a Stream of chars. This stream is built by other objects, which take various source file formats as their input. Couldn't we :

either give TextDocument another (non-character based) way to discriminate between lines ? Like a custom-created Stream object, or an array of line Streams, ...
or modify the File > Stream conversion objects to return a special way to differentiate lines (for example EOF, two EOF in a row being the end of the source file) independent from the input format, this way of discrimination being of course known by TextDocument ?

smedilol · 2015-08-14T07:39:36Z

RDZ have the same problem an can't interpret correctly such a file.

But ...
Cobol files located Inside partitionned data contains a fixed number of characters.
For our organization it's 80 chars.

So instead of looking for line endings chars, I think it's better to parse 80 chars and consider this as a whole line.

Of course this should be one behavior of the parser and must be configurable (use fixed line length or use line endings char).

wiztigers · 2015-08-14T07:43:03Z

Yup, but this 80 chars limit has no sense in free format, and the sexay thing is TextDocument doesn't currently know about file formats.
Wouldn't the fact of including this notion in TextDocument break the SOLID principle ?

smedilol · 2015-08-14T07:51:16Z

Maybe one solution is to have 2 implementations of ITextDocument:

the current one with line ending chars
a new one for fixed line length

laurentprudhon · 2015-08-14T08:18:33Z

Our friend Regis is right here : the idea was to restrict the knowledge of the text storage format (encoding and line endings) to the File namespace, ie for now the CobolFile class. The CobolFile implementation noramlizes the input as a Stream of Unicode chars with \r, \n, or \r\n line endings. The later phases of the compiler, notably the Text namespace / TextDocument class don't need to worry about the storage format anymore.
The consequence of this choice is that we need a line ending character (or character sequence), and that we can not allow such character (or character sequence) in character literals in our parser, while the original Cobol specification has no such restriction.
But iun fact the architecture we choose for our compiler to read files from disk does not matter : this limitation will always be present if we want to allow free fromat Cobol programs in our visual text editor in memory. All the text editors from Eclipse or other IDEs will internally detect a line ending - an dispay it on screen - if they find such line ending characters in the string representing one ouf our program lines.
After thinking a bit more about that, I devised a different fix for issue, which I committed this morning in the same issue-59 branch :
We recognized above that we will anyway be unable to support Unicode line ending chars in alphanumericLiterals in interactive editing scenarios.
And we know that the EBCDIC alphanumeric literals containing non printable characters will be broken anyway by the Unicode conversion, because the developer relied on explicitely on the numeric code representing them in the original EBCDIC character set.
So I propose the following solution :

restore support for single \r and \n characters as line endings in TextDocument (revert to the previous version of the file)
update CobolFile class : when reading a fixed length line, if we encounter a line ending character after Unicode conversion of an original EBCDIC character, replace it on the fly with a question mark '?' char

Document clearly two restrictions of our compiler :

because of the internal conversion of the program text to Unicode characters in .Net or Java, we do not support alphanumeric literals containing non printable EBCDIC characters
because of the feature allowing free text format and variable line length, we do not support alphanumeric literals containing line ending characters

NB : when we say we do not support these two cases, it will only have an impact if we generate Cobol from a TypeCobol program and then compile it with the IBM compiler. For Cobol code analysis in memory, it has no impact.
In the two cases above, the solution is to modifiy the original EBCDIC program text before using our tool :

initialize numeric tables directly with numbers instead of their corresponding chars
set line ending chars individually Inside alphanumeric literals, for exemple with reference modification

laurentprudhon · 2015-08-14T08:22:35Z

NB : this new fix won't resolve the problems found on our sample files in ASCII format, because is corrects the EBCDIC to Unicode conversion process, which has already been executed before in this case.
The solution for our test suite is simply to manually replace the offending line ending characters in the source file with question marks characters to mimic the new behavior of the CobolFile class.

laurentprudhon · 2015-08-14T08:29:46Z

Sorry, I can not push the new commit today, because it appears that I can't reach the Github server while using the VPN -> I will push it Monday

…sue #59 : •restore support for single \r and \n characters as line endings in TextDocument (revert to the previous version of the file) •update CobolFile class : when reading a fixed length line, if we encounter a line ending character after Unicode conversion of an original EBCDIC character, replace it on the fly with a question mark '?' char Document clearly two restrictions of our compiler : •because of the internal conversion of the program text to Unicode characters in .Net or Java, we do not support alphanumeric literals containing non printable EBCDIC characters •because of the feature allowing free text format and variable line length, we do not support alphanumeric literals containing line ending characters NB : when we say we do not support these two cases, it will only have an impact if we generate Cobol from a TypeCobol program and then compile it with the IBM compiler. For Cobol code analysis in memory, it has no impact. In the two cases above, the solution is to modifiy the original EBCDIC program text before using our tool : •initialize numeric tables directly with numbers instead of their corresponding chars •set line ending chars individually Inside alphanumeric literals, for exemple with reference modification

wiztigers · 2015-12-10T15:10:32Z

@prudholu largely solved the problem, and I added the identified restrictions to the appropriate wiki page.

wiztigers added Bug To Analyze Tests Our tests for CI/CD labels Aug 13, 2015

wiztigers assigned prudholu Aug 13, 2015

wiztigers added this to the alpha milestone Aug 13, 2015

wiztigers added Tools and removed Tests Our tests for CI/CD labels Aug 13, 2015

prudholu added a commit that referenced this issue Aug 13, 2015

Proposed fix for issue #59, but as said in the comment of this issue,…

a8bb22a

… I don't know if it is a good idea to merge this fix into master

wiztigers added the rfc Specifications are not complete. Comment are welcomed. label Aug 14, 2015

wiztigers removed the To Analyze label Dec 10, 2015

wiztigers closed this as completed Dec 10, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Newline in fixed length COBOL source files #59

Newline in fixed length COBOL source files #59

wiztigers commented Aug 13, 2015

laurentprudhon commented Aug 13, 2015

laurentprudhon commented Aug 13, 2015

laurentprudhon commented Aug 13, 2015

wiztigers commented Aug 14, 2015

smedilol commented Aug 14, 2015

wiztigers commented Aug 14, 2015

smedilol commented Aug 14, 2015

laurentprudhon commented Aug 14, 2015

laurentprudhon commented Aug 14, 2015

laurentprudhon commented Aug 14, 2015

wiztigers commented Dec 10, 2015

Newline in fixed length COBOL source files #59

Newline in fixed length COBOL source files #59

Comments

wiztigers commented Aug 13, 2015

laurentprudhon commented Aug 13, 2015

laurentprudhon commented Aug 13, 2015

laurentprudhon commented Aug 13, 2015

wiztigers commented Aug 14, 2015

smedilol commented Aug 14, 2015

wiztigers commented Aug 14, 2015

smedilol commented Aug 14, 2015

laurentprudhon commented Aug 14, 2015

laurentprudhon commented Aug 14, 2015

laurentprudhon commented Aug 14, 2015

wiztigers commented Dec 10, 2015