-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DataFrame - Add chunksize to from_records() #13818
Comments
Can you say a bit more about why this would be useful to you? Typically |
@TomAugspurger - I can do my best. The idea of using a cursor like object with a
Thanks |
Thanks, that's perfectly reasonable. The other option is to chunk your dataset before passing it to stream = ...
dfs = (pd.DataFrame(chunk, ...) for chunk in toolz.partition_all(chunksize, stream)) Is adding the parameter worth the extra complexity / maintenance? |
@TomAugspurger - I think it would be worth it in the long run. All the supported datasets should probably allow for chunking not just a couple. |
what is the actual use case for this? if you have a rec-array by-definition its in memory, so not sure how this helps at all. |
There are other 3rd party objects out there that implement iterator and generator objects. For example, spatial data, there is an arcpy.da.SearchCursor. This object is your standard iterator object over a table, but I may not want to load the whole table into memory at once to do my work, but rather, I want to piece it together much like with csv files. The concept here is to process large generator/iterators into smaller pieces to be more efficient when it comes to memory consumption. |
and u can easily do that why should pandas add this very narrow case |
@jreback because iterators and generators are common ways to get data and it allows a more generic method to load data into dataframes. Can you provide guidance on how to implement it if it is very easy? |
@achapkowski this would be extremely inefficient, but I suppose a usecase exists. I said its easy to do externally.
|
@jreback but aren't you just loading the whole dataset into memory then? |
of course. a Dataframe is a fixed size. Expanding it requires reallocattion and copying, which is quite expensive. |
you can see #5902 if you want to see the discussion |
But the chunksize, like in the csv file returns an iterator of dataframe of size x, why couldn't that be done for and iterator object? |
@achapkowski one is I have with adding it is that all the That, plus the fact that using something like |
@TomAugspurger @achapkowski API things apart, the way pandas create a DataFrame from an iterator/generator is putting the iterator contents on a list, and then build the DataFrame from that list. As jreback points you could read the discussion on #5902. |
@tinproject - nrows parameters seems like the value will only take say 5000 rows from the datasource then stop reading it. Is that not correct? |
I just came across a use case where this feature is useful, In my case, I am using Ijson to read a large json file and want to convert it into a dataframe to be written to a DB. At no point in my workflow, I want the whole data to be preserved in memory.
Few solutions I could think of
|
Code Sample, a copy-pastable example if possible
Currently the from_records does not take a chunksize.
Enhancement
I would like to see the chunksize options like in read_csv()
output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 2.7.10.final.0
python-bits: 32
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.16.1
nose: 1.3.7
Cython: None
numpy: 1.9.2
scipy: 0.15.1
statsmodels: None
IPython: 4.2.0
sphinx: None
patsy: None
dateutil: 2.5.3
pytz: 2016.4
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9.2
apiclient: 1.5.1
sqlalchemy: None
pymysql: None
psycopg2: 2.6.2 (dt dec pq3 ext)
The text was updated successfully, but these errors were encountered: