Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically detect (TDL) file encoding from first line #169

Closed
goodmami opened this issue Aug 28, 2018 · 9 comments
Closed

Automatically detect (TDL) file encoding from first line #169

goodmami opened this issue Aug 28, 2018 · 9 comments
Assignees
Milestone

Comments

@goodmami
Copy link
Member

goodmami commented Aug 28, 2018

TDL files seem to follow the emacs convention of having the first line specify the encoding, e.g.:

# -*- coding: utf-8 -*-

But often there's this odd string:

;;; Hey, emacs(1), this is -*- Mode: TDL; Coding: utf-8; -*- got it?

If we could detect these encoding statements, we could (re)open the file with the specified encoding, instead of making the user specify it themselves. Other settings on this line can probably be ignored, such as Mode: TDL and indent-tabs-mode: nil.

This may be useful for non-TDL files as well, so it could be implemented in delphin.util. Here's some pseudocode:

def detect_encoding(filename, default_encoding='utf-8', comment_char=b';'):
    # Open filename in binary mode and read one line
    # If the line has a comment_char and b"coding:", extract the encoding
    # If the encoding is valid (e.g., using codecs.lookup()) then select it for use
    # Otherwise read and check the second line for b"coding:" and select it if it's valid
    # If there's a BOM: if a non-UTF-8 encoding is selected, raise an error, otherwise select UTF-8
    # Finally select the default encoding if none has been selected yet
    # Close filename
    # Return the selected encoding

Also see here: https://www.python.org/dev/peps/pep-0263/#defining-the-encoding

Edit: changed function to return encoding instead of an open file object

@goodmami goodmami added this to the v0.9.0 milestone Aug 28, 2018
@goodmami
Copy link
Member Author

@mcmillanmajora I think this is pretty straightforward, but let me know if anything is unclear.

@mcmillanmajora
Copy link
Contributor

For extracting the encoding from the line, can I assume that the encoding will be bounded by "coding:" and a comment_char, or are there other cases as well?

@goodmami
Copy link
Member Author

Not quite. The comment_char is just for the first non-space character on the line. The ; you see after the encoding is just a normal semicolon acting as a delimiter. E.g., if the comment_char were #, you might see:

# -*- coding: utf-8; mode: tdl -*-

Also, I think those -*- things are just decorative. Also note that the line might have encoding: instead of coding:, but both end with coding: so it's easy to match both. The encoding's name shouldn't contain any spaces, so after matching coding: followed by optional spaces, look for the longest sequence of characters valid for encoding names. The PEP I linked to has a nice regular expression for these. Also note that the re module can work with bytestrings instead of (regular) unicode strings, so you can use it on a file opened in binary mode.

@mcmillanmajora
Copy link
Contributor

Okay, I think I've got a solution. What's the best way to go about setting up tests for this? I'm not entirely sure how to test it without making a bunch of test case files.

@goodmami
Copy link
Member Author

Create a PyTest fixture that setups up a temporary directory and writes some files to it, e.g., one with the ; coding: ... comment, one without, one with a BOM, etc., and make sure your function detects all of them as you'd expect. See here: https://docs.pytest.org/en/latest/fixture.html

And this might be helpful if the documentation is confusing: https://stackoverflow.com/questions/36070031/creating-a-temporary-directory-in-pytest

@goodmami
Copy link
Member Author

Btw I already have a solution for #172, so don't worry about that part.

@goodmami
Copy link
Member Author

Sorry I rushed those comments before I went for dinner. To clarify, the pytest fixture should create a temporary directory and write the files each time the tests are run, and they will be cleaned up afterwards. Except for creating the files, much of that should be automatically handled by pytest. There will not be any files that are permanently created and checked into the repo.

@goodmami
Copy link
Member Author

Here's some more info on using tmpdir with pytest: https://docs.pytest.org/en/latest/tmpdir.html

You can also see how it's used in the mrs_semi_test.py unit test file. It looks like magic because the tmpdir parameter is just there without being defined or imported in the file, but it is replaced with the appropriate object when pytest runs the test.

Here's some example test files:

  • empty file

  • no comment (default to UTF-8?)

    avm := *top*.
    
  • encoding comment: utf-8

    ; coding: utf-8
    a := character & [ ORTH "あ" ].
    
  • encoding comment: utf-8 variant 1

    ; -*- mode: tdl; encoding: UTF-8; foo: bar -*-
    a := character & [ ORTH "あ" ].
    
  • encoding comment: utf-8 variant 2 (non-TDL)

    # coding: utf-8
    a="あ"
    
  • encoding comment: latin-1

    ; coding: iso-8859-1
    a := character & [ ORTH "á" ].
    
  • invalid 1 (wrong attribute name)

    ; encode: iso-8859-1
    á
    
  • invalid 2 (invalid encoding)

    ; coding: foo
    

@goodmami
Copy link
Member Author

@mcmillanmajora did you have some code to commit? If so please submit a PR and we can fix remaining issues there.

mcmillanmajora added a commit that referenced this issue Oct 1, 2018
Note: temporary files opened with pytest appear to default to an ascii
codecs so there's a unicode error when trying to write the non-ascii
characters to the temporary files.
mcmillanmajora added a commit that referenced this issue Oct 1, 2018
Unicode error fixed; was missing a fixture declaration.
mcmillanmajora added a commit that referenced this issue Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants