BUG: issue in HDFStore with too many selectors in a where #2755

jreback · 2013-01-25T19:38:53Z

this is very hard to reproduce, and can give a ValueError or just crash (as somewhere in numexpr or hdf5 it has trouble parsing the expression, and can run out of memory), easiest solution is just to limit the number of selectors

just to be clear, you really have to do something like this:

select.('df',[ Term('index',['A','B','C'......]) ])
where the term has more than 31 selectors (and less than 61) to produce this error and only sometimes
it triggers.

This doesn't change functionaility at all, as if there are more than the specified number of selectors, it just
does a filter (eg. bring in the whole table and then just reindex), and this is really only an issue
if you actually try to specifiy many vaues that the index can be (which usually isn't a string anyhow)

but I did hit it! so its a 'buglet'

heres the issue in PyTables (but it is actually somewhere in numexpr)
http://sourceforge.net/mailarchive/message.php?msg_id=30390757
added cleanup code to force removal of output files if the testing is interrupted by ctrl-c
added dotted (attribute) acces to stores, e.g. store.df == store['df']

alvorithm · 2013-02-07T09:15:29Z

@jreback I read about the facility to map object->(e.g.)int16 in io.rst.

Will this happen through a dtype header argument to HDFStore.put/append?

This is crucial for my project where I have incoming data rates of ~12Gb/hr at 2 bytes per data point. Storing these integers using 8 bytes (int64) is the greatest barrier for me using HDFStore instead of a custom-made interface to PyTables.

EDIT: I see now that this note is in reference to separate PR #2708. Thanks

jreback · 2013-02-07T11:13:53Z

if you install PR #2708, then PyTables WILL support int16 (it will store whatever dtypes are passed in). Keep in mind that certain operations on ints may upcast (see Issue #2794). I did a pseudo-example in #2759. Can you give a mini-example of your workflow?

alvorithm · 2013-02-07T12:29:13Z

Thanks @jreback
I am currently checking PR #2708. Where would I have to look to help improve support of int16 re. upcasting?

My workflow:

the heavy (raw or preprocessed) data

I am ingesting (hopefully via HDFStore) up to daily ~70Gb of binary interleaved 20 or 30kHz 64-channel electrophysiology data in a HDF5 file (flat group structure). These voltages are int16.
the "primary key" to these time series is a time column spanning up to a day with 20 or 30kHz resolution.
some of the channels can be meaningfully grouped in fours, but only rarely will operations occur on rows across four columns (median/avg); most of the processing is on a single time series, the whole column.
the most frequent operation is to obtain ranges of channel data as defined by the time column (which is for now just int64, but should eventually become integer-based datetime64's with microsecond resolution). Better yet if this time index, which is regularly spaced, could be stored generatively (and serialized as a leaf attribute instead of a whole column; this is something the indirection through HDFStore could support).
I got the impression that append_to|select_as_multiple could help in this scenario: all channels referenced to a single index.
A fairly frequent operation is to run a multithreaded (via FFTW) FFTs or FFT-based convolution, and cache it in another HDF5 file for future use
- at the time each channel is a carray and operations are on a single chunk, which taxes memory too much (also casts occur in FFTW or when calling numexpr).
- blosc compression attains factor 1.8, does not improve going from complevel 5 to complevel 9. Blosc is way faster than anything else and speed is more important here than compression.
values are all sacrosanct, i.e. because experimental, once written they should not be modified. In a sense what I need is a column database.
another operation is downsampling by factor 4.
most useful operations on this raw-data: thresholding, chunk-wise averages/SDs.

the metadata and computed tables

need to store far more (relationally, dtype-wise) complex tables that are either metadata or computation results, but these can go elsewhere, even to SQL stores if needed. The 'typical worst' of these tables may have about 1E7 rows and 2-5 columns.
but these tables are constantly supplemented with synchronized computed columns, so again I think the right approach is to add these computed columns referenced to the original table as selector.

For computed tables, I have a derived class of HDFStore that catches a failed getitem and looks up the appropriate compute function. This way you can empty your 'cache' and have the results reconstructed. Ideally these functions should report the git SHA1 of the repo they live in along with parameters used somewhere (not implemented).

the resources

Single machine with 12 cores, 32GB of ram, data on SATA3-connected spinning disks.

As to how I can help: solo developer for this application (scientific, not compsci background :/), grok python, functional bias, willing to contribute to pandas on HDFStore, relational extensions, interval algebra and spectral tools and currently working out how the git workflow.

what next?

Sorry for the long post. I contemplated enhancing HDFStore in Q3 2012 and am extremely happy to see your progress (and grateful for this work!). Please tell me if you'd like to move some of these points for private / pydata list discussion.

jreback · 2013-02-07T13:03:39Z

sounds interesting....prob should move this discussion off-list....contact me [email protected]

here's some thoughts about data organization which is of course the key thing in PyTables:

your 'channels' are an index with a small number of columns. this makes sense to group each
in a single node. that way its easy to select by time range or read in all.

I have found that have very 'wide' tables, meaning number of columns reduces performance;
that is the reason for append_to_multiple/select_from_multiple - to essentially have a 'sharded'
setup

try out with the dtypes PR; I think you should be able to read in and store your tables in
int16.

What I mean about manipulations is essentially this: if you do an operation which (possibily) introduces
a nan, then there will be upcasting (because pandas doesn't currently have interger NaN, and must
cast to float for various things). I have tried to mitigate this, but depending on exactly what you are
doing you still may get upcasting. That said, there are various ways to deal with this. Best to
take a smallish set of data and try out your workflow, checking at each step.

I believe blosc only actually has either an on/off mode. It doesn't actually do variable compression (I read
this somewhere on pytables website - don't really remember); but it is a great compressor (of
course the underlying data really determines how it much it helps).

Always glad to have help on expanding HDFStore; I use it on a daily basis, but I don't really
do out-of-core type computations (nor do I use integer types), so all feedback/changes are welcomed.

Jeff

testing is interrupted

…ore['df'])

scottkidder · 2013-02-07T20:32:46Z

Please do not move this conversation to a private location. I would like to be involved and have a somewhat similar workflow to meteore and am continuing to hammer on pandas pytables interface.

jreback · 2013-02-07T20:45:26Z

you are right!

jreback · 2013-02-10T21:28:37Z

@wesm I think you can merge this PR then cherry pick meteore changes from #2824 after

wesm · 2013-02-10T22:11:11Z

merged, thank you sirs

jreback mentioned this pull request Jan 29, 2013

TST: HDFStore tests don't clean up files after KeyboardInterrupt #2769

Closed

BUG: issue in PyTables with too many selectors in a where

27e8dea

jreback mentioned this pull request Feb 7, 2013

ENH: PyTables Enhancements for future #2391

Closed

jreback added 3 commits February 7, 2013 13:33

TST: use ensure_clean contextmanager to avoid spewing temporary files if

1dbe01d

testing is interrupted

DOC: added DataTypes section to HDFStore

eb2c048

ENH: provide dotted (attribute) access in stores (e.g. store.df == st…

7065ff0

…ore['df'])

wesm added a commit that referenced this pull request Feb 10, 2013

Merge PR #2755

1511d3f

wesm merged commit 7065ff0 into pandas-dev:master Feb 10, 2013

jreback mentioned this pull request Dec 17, 2013

Problem selecting major_axis of a panel in HDFstore from a list if data_columns are used. #5717

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: issue in HDFStore with too many selectors in a where #2755

BUG: issue in HDFStore with too many selectors in a where #2755

jreback commented Jan 25, 2013

alvorithm commented Feb 7, 2013

jreback commented Feb 7, 2013

alvorithm commented Feb 7, 2013

jreback commented Feb 7, 2013

scottkidder commented Feb 7, 2013

jreback commented Feb 7, 2013

jreback commented Feb 10, 2013

wesm commented Feb 10, 2013

BUG: issue in HDFStore with too many selectors in a where #2755

BUG: issue in HDFStore with too many selectors in a where #2755

Conversation

jreback commented Jan 25, 2013

alvorithm commented Feb 7, 2013

jreback commented Feb 7, 2013

alvorithm commented Feb 7, 2013

the heavy (raw or preprocessed) data

the metadata and computed tables

the resources

what next?

jreback commented Feb 7, 2013

scottkidder commented Feb 7, 2013

jreback commented Feb 7, 2013

jreback commented Feb 10, 2013

wesm commented Feb 10, 2013