Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tdl.parse() should allow a filename or open file as its argument #104

Closed
goodmami opened this issue Apr 18, 2017 · 8 comments
Closed

tdl.parse() should allow a filename or open file as its argument #104

goodmami opened this issue Apr 18, 2017 · 8 comments
Assignees
Milestone

Comments

@goodmami
Copy link
Member

goodmami commented Apr 18, 2017

delphin.tdl.parse() could be more user-friendly if can take a filename as its argument in addition to open file objects.

Note that currently tdl.parse() can work with any argument that iterates lines:

  • open file objects
  • io.StringIO-wrapped strings
  • list of strings

I don't think the last one is really necessary, but if we do a check like the following, it will be excluded:

def parse(f):
    if hasattr(f, 'read'):
        return _parse(f)
    else:
        with open(f) as fh:
            return _parse(fh)

Alternatively the check could be for string types, e.g.,

from delphin.util import stringtypes
...
    if isinstance(f, stringtypes):
        with open(f) as fh:
            ...

Update:

  • Don't worry about tdl.lex() for now.
@goodmami goodmami modified the milestone: v0.7.0 May 17, 2017
@goodmami goodmami modified the milestones: v0.7.0, v0.8.0 Apr 8, 2018
@goodmami goodmami changed the title tdl.parse() and tdl.lex() only allow open file-like objects; this is too restrictive tdl.parse() should allow a filename or open file as its argument Aug 2, 2018
mcmillanmajora added a commit that referenced this issue Aug 8, 2018
@goodmami
Copy link
Member Author

It seems we didn't test this adequately. Calling tdl.parse() with a filename argument leads to the following error:

Traceback (most recent call last):
  File "tt.py", line 3, in <module>
    list(tdl.parse('/home/goodmami/grammars/erg-trunk/lexicon.tdl'))
  File "/home/goodmami/repos/pydelphin/delphin/tdl.py", line 265, in _parse
    for line_no, event, data in lex(f):
  File "/home/goodmami/repos/pydelphin/delphin/tdl.py", line 201, in lex
    lines = enumerate(stream)
ValueError: I/O operation on closed file

The bug is a bit sneaky:

def parse(f):
    if hasattr(f, 'read'):
        return _parse(f)
    else:
        with open(f) as fh:
            return _parse(fh)

The _parse() function returns a generator which doesn't evaluate until it is iterated over, but the else case returns the generator instead of yielding its values, so the file gets closed as soon as the function returns (i.e., before anything is parsed). There are several ways around this, but one of the nicer ones (yield from) is not available until we remove Python 2 support.

@goodmami goodmami reopened this Aug 28, 2018
@mcmillanmajora
Copy link
Contributor

mcmillanmajora commented Sep 4, 2018

I've been having trouble recreating this error. The tdl file I have parses without errors with both python 2 and 3 as long as the comment strings are properly placed in the file.

@goodmami
Copy link
Member Author

goodmami commented Sep 4, 2018

This is not a Python 2 vs 3 issue, as it affects both. If you just run tdl.parse(filename) it returns a generator object without error, but that generator hasn't actually parsed anything yet. See below:

>>> from delphin import tdl
>>> defs = tdl.parse('/home/goodmami/grammars/erg-trunk/fundamentals.tdl')
>>> defs  # no error
<generator object _parse at 0x7fa4d3bc2af0>
>>> next(defs)  # once you run the generator, you find the file is closed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "delphin/tdl.py", line 465, in _parse
    for line_no, event, data in lex(f):
  File "delphin/tdl.py", line 393, in lex
    lines = enumerate(stream)
ValueError: I/O operation on closed file

Does the above code (up to next(defs), and using an actual filename on your machine) work for you?

@mcmillanmajora
Copy link
Contributor

Okay, now I'm getting the error! Thanks.

From what I've been reading, it seems that the with keyword is the problem. Do you foresee any issues with changing the else call to:

else:
    fh = open(f, 'r')
    return _parse(fh)

@goodmami
Copy link
Member Author

goodmami commented Sep 4, 2018

The with statement closes the file when execution exits its scope (e.g., after the return). What you propose leaves the file handle open after the function call, which causes a file handle leak. Rather than returning after creating the generator, the parse() function should not return until the generator has been exhausted. After that, the file is closed and the function returns. This can be done by making parse() itself a generator, rather than just returning one. Once this is done, be careful of the next bug: if a function yields values, it cannot also return non-None values, and there's more than one return statement in parse().

@goodmami
Copy link
Member Author

goodmami commented Sep 5, 2018

I should clarify: Python closes open files when there are no more references to the open file object. If you store that fh in some place that persists after the function returns, then the file stays open, otherwise the file will be closed when garbage collection picks up the object without any references (sometime after the function returns). But it is better practice to use the with statement so that it is clear when the file is closed.

And were my hints above about turning parse() into a generator enough to go on, or would you like some clearer instructions?

@mcmillanmajora
Copy link
Contributor

I don't think I understand your last comment about also returning non-None values.

My understanding at the moment is that the items from the _parse() generator need to be iteratively yielded in a new parse() generator rather than returning in both the file argument case and the string argument case. Is this correct?

@goodmami
Copy link
Member Author

goodmami commented Sep 8, 2018

I don't think I understand your last comment about also returning non-None values.

Generators use yield to define when a value should be emitted (when the generator is iterated over), and execution resumes at that point in the next iteration. A return statement may be used in a generator to stop iteration early. In Python 2, this return may not have any value after it:

>>> def gen1():
...   yield 1
... 
>>> def gen2():
...   yield 1
...   return  # this is exactly equivalent to gen1()
...   yield 2  # this value is never yielded
... 
>>> def gen3():
...   yield 1
...   return None  # this is also equivalent, but Python2 doesn't like it
... 
  File "<stdin>", line 3
SyntaxError: 'return' with argument inside generator

In regular functions, reaching the end of the function without a return statement is the same as one ending with the bare return, which is the same as ending with return None. They all return None. In Python 3, you may have return X in a generator with any value of X, but it will still stop iteration early (like a bare return in Python 2), and the X will become the value of the StopIteration exception that is raised:

>>> def gen4():
...   yield 1
...   return 1
... 
>>> g = gen4()
>>> next(g)
1
>>> next(g)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration: 1

So look again at parse():

def parse(f):
    if hasattr(f, 'read'):
        return _parse(f)
    else:
        with open(f) as fh:
            return _parse(fh)

If you change that second return to some kind of yield, then the function becomes a generator function and the first return will raise a StopIteration with the value of _parse(f), so you'll need to change both to yield somehow.

My understanding at the moment is that the items from the _parse() generator need to be iteratively yielded in a new parse() generator rather than returning in both the file argument case and the string argument case. Is this correct?

Yes. So the returns in both spots will need to be changed to something like:

    ...
    for event in _parse(fh):
        yield event

If we drop Python 2 support (which I'm thinking we could do for the next release... I'm getting tired of supporting both), then there's a better form:

    ...
    yield from _parse(fh)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants