-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem selecting major_axis of a panel in HDFstore from a list if data_columns are used. #5717
Comments
how did you create this store? pls put a complete code example that I can reproduce (with a link to the store) |
sorry, but I cannot share the vendor's data. The id's are strings. Its just end of day finance data that is generated from the vendors csv's. I create a dataframe set the index and create panel. I append using the following code:
|
and give me an example of the selection code exactly why do u have data_columns? are u planning on selecting based on close? (say?) |
so the data is standard (open high low close volume) data. I will do something such as close >5 or volume > 1e6. only term I am trying right now is major_axis. my list is a list of strings for the id. [major_axis=['N03TRC-S-US' 'M89FLH-S-US' 'P7KT40-S-US' 'TXX3S4-S-US' 'QDX491-S-US' |
give me a sample of the df before you created the panel |
each df looks like this:
|
also, thanks so much for the help |
so if you don't use data_columns then this will all work I have never used (nor is tested), data_columns with a Panel (I have never and I guess noone else has!) you can use a multi-indexed frame, which is basically the same idea |
Thank you. I will regenerate my files and test again. I appreciate it. (I had thought it would be some string issue.) |
ok...that was a real bug....should be good to go now |
It works now but It takes up quite a bit of memory. I needed to do this.
|
you have to understand how this works if you put in a lot of ids, then you are going to end up selecting the ENTIRE list then doing a reindex. no way around this, its 'sort of' a bug in numexpr (really has to do with ow things are selected). So you have several options. chunk thru your list, limited the id selection to about 30 at a time (e.g. do multiple selections and conat them together). This is prob the fastest. chunk thru like you are doing (it will select the entire chunk) so that's your max memory you have to really think carefully how you are using this. I use panels a lot like this, but I append data one date at a time, and more importantly select generally all of the data for a single date at a time. you have other options that might make sense.e.g. you can store multiple tables that only have a limited id set int hem (e.g. AB, CD, EF)...etc store by letter. not exactly sure how you are using this, so you may have to experiment (also not sure of the size of what you are doing, and why a temporary increase in memory actually matters). |
Thank you very much. That makes sense. Essentially, I am building a number of different quant finance models to combine together. The id's are relatively random. (They are made by my data provider with no rhyme or reason.) Essentially, I need to choose id, then choose dates, then build model. I have inputs that I map to my data provider ids then I pull the data. I am not sure the optimal structure for this. I could do whatever more experienced people think are best. |
the size is on the order of 70mm rows so a temporary increase in memory could crash the server its running on, especially if I am trying to run multiple at the same time |
just select a liminted (< 30) number of ids at a time, then you will have no problem. if you need to select more, then just do multiple queries and concat. |
Thanks. that works. Its slower than I hoped, but it definitely works. |
Is there anyway to incorporate major_axis in a meaningful way? It seems like everything gets stored into a giant pytables table object. By the way, it turned out being much faster to just store in a dataframe and get my dataframe and reindex myself. |
not sure what you mean - can u elaborate? |
on both points:
|
there is no iteration at all; a panel looks almost exactly like a frame; just flattened in 2-d. |
Maybe I am just confused. I thought that there's iteration when reading it out because its stored in one giant table under the group. It just iterates through hdf table. |
you can store a table in many ways; you could do it like you suggest, but their is not benefit. since I haven't seen comparable code it is impossible for me to guess what is your best setup |
Is there any way to use Term to select membership in a large list for a Panel in HDFstore?
I have a hdfstore of the following format: https://gist.github.com/MichaelWS/7997225
My major axis is "id" and minor axis is "date". I am trying to give a large python list into Term and select if an id is in it. This works for a small list and does not work for a larger one. This fails for a list of 338 items.
Here is the traceback:
The text was updated successfully, but these errors were encountered: