-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: issue in HDFStore with too many selectors in a where #2755
Conversation
@jreback I read about the facility to map object->(e.g.)int16 in io.rst. Will this happen through a dtype header argument to HDFStore.put/append? This is crucial for my project where I have incoming data rates of ~12Gb/hr at 2 bytes per data point. Storing these integers using 8 bytes (int64) is the greatest barrier for me using HDFStore instead of a custom-made interface to PyTables. EDIT: I see now that this note is in reference to separate PR #2708. Thanks |
Thanks @jreback My workflow: the heavy (raw or preprocessed) data
the metadata and computed tables
For computed tables, I have a derived class of HDFStore that catches a failed getitem and looks up the appropriate compute function. This way you can empty your 'cache' and have the results reconstructed. Ideally these functions should report the git SHA1 of the repo they live in along with parameters used somewhere (not implemented). the resourcesSingle machine with 12 cores, 32GB of ram, data on SATA3-connected spinning disks. As to how I can help: solo developer for this application (scientific, not compsci background :/), grok python, functional bias, willing to contribute to pandas on HDFStore, relational extensions, interval algebra and spectral tools and currently working out how the git workflow. what next?Sorry for the long post. I contemplated enhancing HDFStore in Q3 2012 and am extremely happy to see your progress (and grateful for this work!). Please tell me if you'd like to move some of these points for private / pydata list discussion. |
sounds interesting....prob should move this discussion off-list....contact me [email protected] here's some thoughts about data organization which is of course the key thing in PyTables: your 'channels' are an index with a small number of columns. this makes sense to group each I have found that have very 'wide' tables, meaning number of columns reduces performance; try out with the dtypes PR; I think you should be able to read in and store your tables in What I mean about manipulations is essentially this: if you do an operation which (possibily) introduces I believe blosc only actually has either an on/off mode. It doesn't actually do variable compression (I read Always glad to have help on expanding HDFStore; I use it on a daily basis, but I don't really Jeff |
Please do not move this conversation to a private location. I would like to be involved and have a somewhat similar workflow to meteore and am continuing to hammer on pandas pytables interface. |
you are right! |
merged, thank you sirs |
this is very hard to reproduce, and can give a ValueError or just crash (as somewhere in numexpr or hdf5 it has trouble parsing the expression, and can run out of memory), easiest solution is just to limit the number of selectors
just to be clear, you really have to do something like this:
select.('df',[ Term('index',['A','B','C'......]) ])
where the term has more than 31 selectors (and less than 61) to produce this error and only sometimes
it triggers.
This doesn't change functionaility at all, as if there are more than the specified number of selectors, it just
does a filter (eg. bring in the whole table and then just reindex), and this is really only an issue
if you actually try to specifiy many vaues that the index can be (which usually isn't a string anyhow)
but I did hit it! so its a 'buglet'
heres the issue in PyTables (but it is actually somewhere in numexpr)
http://sourceforge.net/mailarchive/message.php?msg_id=30390757
added cleanup code to force removal of output files if the testing is interrupted by ctrl-c
added dotted (attribute) acces to stores, e.g. store.df == store['df']