-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Read Stata dta file incrementally #9493
Conversation
@kshedden this is a good idea. See the related issue #9496. We want to try to follow a common pattern for the invocations of this. These other 3 io libs that do this already have a pretty well-defined interface (e.g. they take So if you'd take a look and follow this common pattern would be great (separate issue is actually to consolidate this into a base-class). You are welcome to actually start there if its easier. We can then build off your effort. lmk and we can then examine your proposal in more detail. cc @bashtage |
raise ValueError("For format versions < 117, `convert_categoricals` must be set to False. See docstring for details.") | ||
|
||
first_chunk = False | ||
if not hasattr(self, '_not_first_chunk'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding attributes as a substitute for a bool
doesnt' seem like a good idea. Why not just add _first_chunk=True
in __init__
? If not using the iterator it will just be ignored.
It makes sense to replace this method just to avoid have quote a bit of repeated code. Maybe 'chunksize=None` would indicate the read the entire file? |
typical we have chunksize=None, iterator=False as the default arguments then it's easy to provide an iterator if requested or simply return the entire file if not |
Thanks for the review. I think I have addressed most of the comments. The interfaces to read_stata and StataReader should now match read_table and TextFileReader, at least in terms of features that are common to both file types. Comments in the current StataReader source state that it is not possible to read the value labels from the end of the dta file in a pre 117 version file without first reading through all the data. It's not clear to me that this is the case. Using the record size, it is possible to calculate the total size of the data in bytes, and jump from the data location to the value labels location. I did this and all the tests pass. However I don't know all the ins and outs of Stata files. so it is possible that I am missing something here. I still need to work on docstrings. |
Correction: I think it is possible if that file has variable length strings that this won't work. This said, we should raise on variable length strings since these aren't supported. In fact, a test that verified the error would be a good idea. http://www.stata.com/help.cgi?dta 5.10 and 5.11 I think correct reading value labels would require scanning backward for the start of the block: |
There was already a function called "_read_strls" that appears to try to read the variable length strings. However this test suite seems to never enter this function. I'll try to deal with this. strls are not mentioned in the stata format 115 docs (http://www.stata.com/help.cgi?dta_115), so I'm guessing that they were introduced in format 117. We don't have a problem seeking to the value labels section in 117 files, the comments in the code that mentioned this limitation were explicit about it being only an issue for format <=115. So putting this all together it looks like we might be safe jumping to the value labels based on the size of the data table (the part with fixed length records) in format <=115, and in format 117 the position of the value labels is explicitly given in the header. |
Follow-up: the strl reader seems not to work and there are checks elsewhere that raise if strls are present. Do we still want to keep _read_strls? |
I would remove the code if it both can't be hit and is broken. |
I got read_strls working, and cleaned up the docstrings. I don't have anything else on my to-do list for this, so anyone who was waiting to review this more carefully can now proceed. |
|
|
||
# legacy | ||
@Appender(_data_method_doc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be marked for eventual deprecation? data()
as a method doesn't make much sense, and read()
is far clearer. (data
as a property would be more reasonable).
The for the two docs hint at the deprecation.
6ba3997
to
84b683f
Compare
The vbench output is below. Am I supposed to commit something for this?
|
|
||
reader = pd.read_stata('stata.dta', chunksize=1000) | ||
for df in reader: | ||
do_something(df) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make this a code-block as this will actually execute (and error)
@kshedden vbench is good. Really just a check to see how optimization worked (or if an issue). Good on that. Just some minor stylist points. |
Does anyone know about Stata date formats like "%tdD_m_Y"? We are not currently converting these to dates (they remain as int32) because the check for date formats needs to match "%td" exactly. It appears to convert correctly if we treat the format as if it were simply "%td". |
It appears that %tdD_m_Y should be treated the same as %td. The additional characters are not relevant for reading the file, and are confusing our date conversion routines. So I am now stripping them out. |
@kshedden can you squash this down to a single commit. I think ok to merge after that. |
@@ -55,6 +55,8 @@ performance improvements along with a large number of bug fixes. | |||
|
|||
Highlights include: | |||
|
|||
- Allow Stata files to be read incrementally, support for long strings in Stata files issue(:`9493`) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a ref link back to the docs
99235ab
to
0f22d8d
Compare
Remove testing code Use partition in null_terminate Manage warnings better in test Further warning management in testing; add skip_data argument Major refactoring to address code review Fix strl reading, templatize docstrings Fix bug in attaching docstring Add new test file Add release note Call read instead of data when calling pandas.read_stata various small issues following code review Improve performance of %td processing Docs edit (minor)
0f22d8d
to
709c034
Compare
ENH Read Stata dta file incrementally
@kshedden thanks for this! pls review the built docs (take 1-2 hours) here: http://pandas-docs.github.io/pandas-docs-travis/ for any corrections. pls submit a follow up pr if necessary. |
This PR adds a get_chunk method to StataReader to allow files to be read incrementally. This is quite useful when processing large files that can barely fit into memory.
The interface is modeled after the analogous method in TextFileReader.
There are some limitations when incrementally converting categoricals in pre 117 version files. I have a work-around that involves reading the file through once to get the label information, then re-reading and converting using a new function called convert_categoricals.
In a fair amount of testing, I always get identical results using get_chunk and reading the file completely.
I left the existing data(...) method for reading the complete file intact. Potentially it could be changed to simply call get_chunk.
A few additional issues: