Automatically detect (TDL) file encoding from first line #169

goodmami · 2018-08-28T03:29:04Z

TDL files seem to follow the emacs convention of having the first line specify the encoding, e.g.:

# -*- coding: utf-8 -*-

But often there's this odd string:

;;; Hey, emacs(1), this is -*- Mode: TDL; Coding: utf-8; -*- got it?

If we could detect these encoding statements, we could (re)open the file with the specified encoding, instead of making the user specify it themselves. Other settings on this line can probably be ignored, such as Mode: TDL and indent-tabs-mode: nil.

This may be useful for non-TDL files as well, so it could be implemented in delphin.util. Here's some pseudocode:

def detect_encoding(filename, default_encoding='utf-8', comment_char=b';'):
    # Open filename in binary mode and read one line
    # If the line has a comment_char and b"coding:", extract the encoding
    # If the encoding is valid (e.g., using codecs.lookup()) then select it for use
    # Otherwise read and check the second line for b"coding:" and select it if it's valid
    # If there's a BOM: if a non-UTF-8 encoding is selected, raise an error, otherwise select UTF-8
    # Finally select the default encoding if none has been selected yet
    # Close filename
    # Return the selected encoding

Also see here: https://www.python.org/dev/peps/pep-0263/#defining-the-encoding

Edit: changed function to return encoding instead of an open file object

The text was updated successfully, but these errors were encountered:

goodmami · 2018-09-13T03:53:59Z

@mcmillanmajora I think this is pretty straightforward, but let me know if anything is unclear.

mcmillanmajora · 2018-09-14T22:15:44Z

For extracting the encoding from the line, can I assume that the encoding will be bounded by "coding:" and a comment_char, or are there other cases as well?

goodmami · 2018-09-14T22:33:42Z

Not quite. The comment_char is just for the first non-space character on the line. The ; you see after the encoding is just a normal semicolon acting as a delimiter. E.g., if the comment_char were #, you might see:

# -*- coding: utf-8; mode: tdl -*-

Also, I think those -*- things are just decorative. Also note that the line might have encoding: instead of coding:, but both end with coding: so it's easy to match both. The encoding's name shouldn't contain any spaces, so after matching coding: followed by optional spaces, look for the longest sequence of characters valid for encoding names. The PEP I linked to has a nice regular expression for these. Also note that the re module can work with bytestrings instead of (regular) unicode strings, so you can use it on a file opened in binary mode.

mcmillanmajora · 2018-09-15T01:02:41Z

Okay, I think I've got a solution. What's the best way to go about setting up tests for this? I'm not entirely sure how to test it without making a bunch of test case files.

goodmami · 2018-09-15T01:14:42Z

Create a PyTest fixture that setups up a temporary directory and writes some files to it, e.g., one with the ; coding: ... comment, one without, one with a BOM, etc., and make sure your function detects all of them as you'd expect. See here: https://docs.pytest.org/en/latest/fixture.html

And this might be helpful if the documentation is confusing: https://stackoverflow.com/questions/36070031/creating-a-temporary-directory-in-pytest

goodmami · 2018-09-15T01:15:07Z

Btw I already have a solution for #172, so don't worry about that part.

goodmami · 2018-09-15T03:20:50Z

Sorry I rushed those comments before I went for dinner. To clarify, the pytest fixture should create a temporary directory and write the files each time the tests are run, and they will be cleaned up afterwards. Except for creating the files, much of that should be automatically handled by pytest. There will not be any files that are permanently created and checked into the repo.

goodmami · 2018-09-18T19:34:08Z

Here's some more info on using tmpdir with pytest: https://docs.pytest.org/en/latest/tmpdir.html

You can also see how it's used in the mrs_semi_test.py unit test file. It looks like magic because the tmpdir parameter is just there without being defined or imported in the file, but it is replaced with the appropriate object when pytest runs the test.

Here's some example test files:

empty file
no comment (default to UTF-8?)
```
avm := *top*.
```

encoding comment: utf-8

; coding: utf-8
a := character & [ ORTH "あ" ].

encoding comment: utf-8 variant 1

; -*- mode: tdl; encoding: UTF-8; foo: bar -*-
a := character & [ ORTH "あ" ].

encoding comment: utf-8 variant 2 (non-TDL)
```
# coding: utf-8
a="あ"
```

encoding comment: latin-1

; coding: iso-8859-1
a := character & [ ORTH "á" ].

invalid 1 (wrong attribute name)
```
; encode: iso-8859-1
á
```
invalid 2 (invalid encoding)
```
; coding: foo
```

goodmami · 2018-09-27T19:41:13Z

@mcmillanmajora did you have some code to commit? If so please submit a PR and we can fix remaining issues there.

Note: temporary files opened with pytest appear to default to an ascii codecs so there's a unicode error when trying to write the non-ascii characters to the temporary files.

Unicode error fixed; was missing a fixture declaration.

goodmami added this to the v0.9.0 milestone Aug 28, 2018

goodmami assigned mcmillanmajora Aug 28, 2018

goodmami mentioned this issue Sep 13, 2018

Open TDL file with encoding #172

Closed

mcmillanmajora added a commit that referenced this issue Oct 1, 2018

Initial commit for issue #169

ea039e1

Note: temporary files opened with pytest appear to default to an ascii codecs so there's a unicode error when trying to write the non-ascii characters to the temporary files.

mcmillanmajora added a commit that referenced this issue Oct 1, 2018

Update to initial commit for issue #169

47a9146

Unicode error fixed; was missing a fixture declaration.

mcmillanmajora added a commit that referenced this issue Oct 1, 2018

Updated CHANGELOG for issue #169

dc9cd8a

goodmami closed this as completed in 2b779bc Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically detect (TDL) file encoding from first line #169

Automatically detect (TDL) file encoding from first line #169

goodmami commented Aug 28, 2018 •

edited

Loading

goodmami commented Sep 13, 2018

mcmillanmajora commented Sep 14, 2018

goodmami commented Sep 14, 2018

mcmillanmajora commented Sep 15, 2018

goodmami commented Sep 15, 2018

goodmami commented Sep 15, 2018

goodmami commented Sep 15, 2018

goodmami commented Sep 18, 2018

goodmami commented Sep 27, 2018

Automatically detect (TDL) file encoding from first line #169

Automatically detect (TDL) file encoding from first line #169

Comments

goodmami commented Aug 28, 2018 • edited Loading

goodmami commented Sep 13, 2018

mcmillanmajora commented Sep 14, 2018

goodmami commented Sep 14, 2018

mcmillanmajora commented Sep 15, 2018

goodmami commented Sep 15, 2018

goodmami commented Sep 15, 2018

goodmami commented Sep 15, 2018

goodmami commented Sep 18, 2018

goodmami commented Sep 27, 2018

goodmami commented Aug 28, 2018 •

edited

Loading