Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File containing a single unescaped " out-of-sample is read incorrectly #1036

Closed
st-pasha opened this issue May 11, 2018 · 7 comments · Fixed by #2708
Closed

File containing a single unescaped " out-of-sample is read incorrectly #1036

st-pasha opened this issue May 11, 2018 · 7 comments · Fixed by #2708
Assignees
Labels
bug Any bugs / errors in datatable; however for severe bugs use [segfault] label fread Issues related to parsing any input files via fread function
Milestone

Comments

@st-pasha
Copy link
Contributor

st-pasha commented May 11, 2018

When fill=False, this example raises an error:

>>> src = "A,B,C\n" + "q,f,r\n" * 100 + "foo,\"bar,bza\n" + "a,bb,ccc\n" * 200
>>> dt.fread(src, fill=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/pasha/github/datatable/datatable/fread.py", line 69, in fread
    return freader.read()
  File "/Users/pasha/github/datatable/datatable/fread.py", line 715, in read
    _dt = core.gread(self)
RuntimeError: Too few fields on line 102: expected 3 but found only 2 (with sep=','). Set fill=True to ignore this error.  <<foo,"bar,bza>>

even though the file can be read with QR=3 (and in fact if we move problematic line to the beginning of the file, it will be parsed correctly).

When fill=True the file does not throw an error, but returns a 101x3 Frame where the last row contains value "bar,bza\na,bb,ccc\n... with the content of the rest of the file.

In addition, verbose mode shows

  Estimated number of rows: 2419 / 6.00 = 404
  Initial alloc = 444 rows (404 + 9%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
  All rows were sampled since file is small so we know nrows=99 exactly

Obviously not all rows were sampled, and therefore nrows=99 is a gross underestimation (there should be 301 rows).

@st-pasha st-pasha added bug Any bugs / errors in datatable; however for severe bugs use [segfault] label fread Issues related to parsing any input files via fread function labels May 11, 2018
@st-pasha st-pasha self-assigned this May 11, 2018
@st-pasha st-pasha mentioned this issue Jan 4, 2020
27 tasks
@st-pasha st-pasha removed their assignment Sep 24, 2020
@pradkrish
Copy link
Collaborator

pradkrish commented Oct 5, 2020

When I run this example on main branch with fill=True, it does return a 301*3 frame.

By the way, how did you run it in verbose mode? I did python -v filename.py and did not see the info you mentioned above.

@st-pasha
Copy link
Contributor Author

st-pasha commented Oct 5, 2020

"verbose mode" refers to supplying parameter verbose=True to fread:

dt.fread(src, fill=False, verbose=True)

@st-pasha
Copy link
Contributor Author

st-pasha commented Oct 5, 2020

Indeed, this example seem to be working correctly now. So closing the issue.

@st-pasha st-pasha closed this as completed Oct 5, 2020
@st-pasha st-pasha added this to the Release 1.0.0 milestone Oct 5, 2020
@st-pasha st-pasha self-assigned this Oct 5, 2020
@st-pasha
Copy link
Contributor Author

st-pasha commented Oct 5, 2020

Ok, I just tried creating a test example for this, and it turns out it's not quite working.
See test-fread-issues.py::test_issue1036.

Even though the dataset is created with the correct shape, the value bza from line 101 is somehow missing. Instead there is None.

@st-pasha st-pasha reopened this Oct 5, 2020
@st-pasha st-pasha removed their assignment Oct 5, 2020
@pradkrish
Copy link
Collaborator

@st-pasha I don't see test-fread-issues.py::test_issue1036 on the main branch, are you sure you have added that example?

@st-pasha
Copy link
Contributor Author

st-pasha commented Nov 4, 2020

You're right, I forgot to push that branch earlier. See #2730

@st-pasha
Copy link
Contributor Author

Closed in #2709

@st-pasha st-pasha linked a pull request Jun 30, 2021 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Any bugs / errors in datatable; however for severe bugs use [segfault] label fread Issues related to parsing any input files via fread function
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants