Problem selecting major_axis of a panel in HDFstore from a list if data_columns are used. #5717

MichaelWS · 2013-12-17T00:36:14Z

Is there any way to use Term to select membership in a large list for a Panel in HDFstore?

I have a hdfstore of the following format: https://gist.github.com/MichaelWS/7997225

My major axis is "id" and minor axis is "date". I am trying to give a large python list into Term and select if an id is in it. This works for a small list and does not work for a larger one. This fails for a list of 338 items.

store = pd.HDFStore(store_path, "r")
terms = []
if not members is None:
    terms.append(Term("major_axis", "=", members))
if not additional_terms is None:
    terms += additional_terms
if len(terms) > 0:
    panel = store.select(country, where=terms)
else:
    panel = store.select(country)
store.close()

Here is the traceback:

 File "/home/michael/repos/franklin/factset/reference_data.py", line 1037, in get_frame
  panel = store.select(country, where=terms)
  File "/local/install/pandas_test/lib/python2.7/site- packages/pandas/io/pytables.py", line 637, in select
  auto_close=auto_close).get_values()
  File "/local/install/pandas_test/lib/python2.7/site-packages/pandas/io/pytables.py", line 1311, in get_values
results = self.func(self.start, self.stop)
  File "/local/install/pandas_test/lib/python2.7/site-packages/pandas/io/pytables.py", line 626, in func
columns=columns, **kwargs)
  File "/local/install/pandas_test/lib/python2.7/site-packages/pandas/io/pytables.py", line 3453, in read
sorted_values, items, tuple(N), take_labels)
  File "/local/install/pandas_test/lib/python2.7/site-packages/pandas/core/reshape.py", line 1046, in block2d_to_blocknd
pvalues[i].flat[mask] = values[:, i]
 IndexError: too many indices

The text was updated successfully, but these errors were encountered:

jreback · 2013-12-17T00:43:23Z

how did you create this store? pls put a complete code example that I can reproduce (with a link to the store)

MichaelWS · 2013-12-17T01:02:27Z

sorry, but I cannot share the vendor's data. The id's are strings.

Its just end of day finance data that is generated from the vendors csv's. I create a dataframe set the index and create panel. I append using the following code:

        df = df.set_index("date").sort_index()
        panel = df.reset_index().set_index(["id", "date"]).to_panel()
        h5.append(country, panel, format="t",
                         data_columns=["liq", "close", "volume"])

jreback · 2013-12-17T01:06:00Z

and give me an example of the selection code exactly

why do u have data_columns? are u planning on selecting based on close? (say?)

MichaelWS · 2013-12-17T01:14:41Z

so the data is standard (open high low close volume) data. I will do something such as close >5 or volume > 1e6. only term I am trying right now is major_axis. my list is a list of strings for the id.
Here is the string representation of my terms:

[major_axis=['N03TRC-S-US' 'M89FLH-S-US' 'P7KT40-S-US' 'TXX3S4-S-US' 'QDX491-S-US'
'VGZC4T-S-US' 'FLDXDF-S-US' 'V84XXD-S-US' 'RDQ6VC-S-US' 'R02HRR-S-US'
'N03H31-S-US' 'F5YW44-S-US' 'Q4YYS0-S-US' 'QBP1J6-S-US' 'JQXLB4-S-US'
'MMRHSQ-S-US' 'C33HXG-S-US' 'P7LW50-S-US' 'MZ0PNB-S-US' 'D78CJ8-S-US'
'BDMWDS-S-US' 'LTSZHS-S-US' 'V3LF0L-S-US' 'M3X6SN-S-US' 'FXJPVX-S-US'
'GGK810-S-US' 'VFT77X-S-US' 'DBV92C-S-US' 'R6XL7S-S-US' 'MBKTVK-S-US'
'L6CX6R-S-US' 'TWYMJG-S-US' 'JKD4B7-S-US' 'NCHX4Q-S-US' 'MVKY14-S-US'
'KQL6C8-S-US' 'XGFGMQ-S-US' 'G6RHZ4-S-US' 'XKSSLM-S-US' 'VK7WP0-S-US'
'J8BDXS-S-US' 'XMYSCH-S-US' 'MS999V-S-US' 'RBBJ96-S-US' 'JDQZCY-S-US'
'J9FZW7-S-US' 'S5GXXR-S-US' 'T6DTWR-S-US' 'GTYN0N-S-US' 'JNPC95-S-US'
'T1PR8N-S-US' 'VBDLSS-S-US' 'NFMJ99-S-US' 'TF1J5S-S-US' 'K15V5Q-S-US'
'X9RCXK-S-US' 'SW3WLF-S-US' 'D52CVG-S-US' 'G5LTXN-S-US' 'JH4QZ3-S-US'
'MT931R-S-US' 'D3X1KC-S-US' 'R17LLM-S-US' 'Q1KJQ2-S-US' 'Q8DVNB-S-US'
'BRWRT6-S-US' 'DBQ20D-S-US' 'T8GLHL-S-US' 'WCC86H-S-US' 'R85KLC-S-US'
'PQJ6YX-S-US' 'Q3TRMZ-S-US' 'TJF1JW-S-US' 'P8WJT7-S-US' 'H547YV-S-US'
'S48T3X-S-US' 'K14DFC-S-US' 'T0JCXJ-S-US' 'M647FF-S-US' 'FGX1S8-S-US'
'R1CXVP-S-US' 'B97HDW-S-US' 'S5G28Y-S-US' 'W38FV3-S-US' 'G25QK2-S-US'
'HG5T14-S-US' 'KGTC07-S-US' 'XF3QYH-S-US' 'P321KV-S-US' 'HG5HFV-S-US'
'JS74P6-S-US' 'V601BM-S-US' 'DY0D0T-S-US' 'LKTW6W-S-US' 'PJSZYL-S-US'
'MCPQD9-S-US' 'T7HSNM-S-US' 'N5N6M6-S-US' 'KQQPYH-S-US' 'N91ZBG-S-US'
'BYJPFC-S-US' 'F1F0GY-S-US' 'TTHQS7-S-US' 'NV2X6X-S-US' 'DF3K29-S-US'
'RWHXPV-S-US' 'HVVK2Q-S-US' 'DCZH6D-S-US' 'RWC9XJ-S-US' 'BRWF4W-S-US'
'C7R3T7-S-US' 'HF1GMZ-S-US' 'L9QXLC-S-US' 'WX72B4-S-US' 'SJWX0Y-S-US'
'C6M0W9-S-US' 'K6TLGQ-S-US' 'R4SVX3-S-US' 'QQ57WH-S-US' 'RGW4H5-S-US'
'X69BS3-S-US' 'QSFJYG-S-US' 'DWPYXW-S-US' 'K8YTSW-S-US' 'JS255J-S-US'
'WC6XYF-S-US' 'G4G3T4-S-US' 'XP3C6W-S-US' 'R4N5ZQ-S-US' 'J1C29R-S-US'
'X7PFZK-S-US' 'NV30F0-S-US' 'PZFFN1-S-US' 'GC3T03-S-US' 'QR4XC8-S-US'
'FJ1F7G-S-US' 'N9757B-S-US' 'D1RR34-S-US' 'P8RDYC-S-US' 'VYQXV7-S-US'
'DBNSL6-S-US' 'JFZJY4-S-US' 'BL9X3Z-S-US' 'NHQ08T-S-US' 'NT47N2-S-US'
'LQCKJX-S-US' 'GNCWJJ-S-US' 'RPP607-S-US' 'TVR88D-S-US' 'W8XCG8-S-US'
'L2Y6QM-S-US' 'B16ZFN-S-US' 'JLGFVH-S-US' 'LHLM3W-S-US' 'SPG28F-S-US'
'KHZRFD-S-US' 'VT5YZY-S-US' 'VCRDJ0-S-US' 'NK0GFS-S-US' 'J5WH3J-S-US'
'TCWFBX-S-US' 'VMP45M-S-US' 'LY1BXZ-S-US' 'FPVR4P-S-US' 'LQCV25-S-US'
'B98H9T-S-US' 'MQ18S4-S-US' 'QCS8TB-S-US' 'G7RJ37-S-US' 'D656C6-S-US'
'NK00LC-S-US' 'NMF5YX-S-US' 'T1T2MS-S-US' 'KFMLXR-S-US' 'L8N68Q-S-US'
'KFJ9MN-S-US' 'KPMGNC-S-US' 'R5RF8V-S-US' 'FMGYDF-S-US' 'T2YG2Z-S-US'
'J9GHVQ-S-US' 'DZ3TQ4-S-US' 'FXJMQV-S-US' 'GFJKQD-S-US' 'T56GMP-S-US'
'XF9TK6-S-US' 'W002HZ-S-US' 'X7GP26-S-US' 'LZ5XSC-S-US' 'PN3GCR-S-US'
'TKPN21-S-US' 'WSV8FS-S-US' 'W926PT-S-US' 'PKTM88-S-US' 'L9QYND-S-US'
'WW7382-S-US' 'VWM15D-S-US' 'TN04F4-S-US' 'FBB0WZ-S-US' 'LFBF80-S-US'
'FWHNRV-S-US' 'KT230F-S-US' 'PKYBBR-S-US' 'P5FR84-S-US' 'P2W5CB-S-US'
'B987ZQ-S-US' 'MJD4RS-S-US' 'NNL4JN-S-US' 'X9VCMC-S-US' 'T1NHXH-S-US'
'J6271X-S-US' 'GX96DR-S-US' 'HWZGPH-S-US' 'H2W031-S-US' 'VLCFKF-S-US'
'SPB6XT-S-US' 'RT7JFS-S-US' 'CX2VS5-S-US' 'JLL842-S-US' 'B1BQL5-S-US'
'GHWBCK-S-US' 'B3NC6D-S-US' 'LTWRTD-S-US' 'WW5CZG-S-US' 'NNP23D-S-US'
'Q9S6F2-S-US' 'X6BVNJ-S-US' 'LRJH5R-S-US' 'JJ8RRZ-S-US' 'XN2JJ0-S-US'
'PHJ4Z6-S-US' 'SC8V1L-S-US' 'F4NDZ5-S-US' 'JMRX4L-S-US' 'K6S3NB-S-US'
'GB43G9-S-US' 'HCYM26-S-US' 'FS549P-S-US' 'SVZRQK-S-US' 'WPK8Y2-S-US'
'V9FMMN-S-US' 'XJPTTR-S-US' 'MR4BW7-S-US' 'HGC6B6-S-US' 'DWW3RQ-S-US'
'H0QZ0Y-S-US' 'V3LWY1-S-US' 'KQKRHV-S-US' 'KHRMVQ-S-US' 'LY23CP-S-US'
'H9SM9B-S-US' 'QLMMTM-S-US' 'P0X865-S-US' 'QLL3Y6-S-US' 'DR9KSW-S-US'
'XHJHJN-S-US' 'GX4W4P-S-US' 'D50Z54-S-US' 'BVFQD9-S-US' 'JBFL6Y-S-US'
'KHRJNM-S-US' 'J0JKSY-S-US' 'DN6FKS-S-US' 'FMG5YQ-S-US' 'WPJ3PZ-S-US'
'CHNTDT-S-US' 'BL5JSW-S-US' 'DKJ2GF-S-US' 'M518NK-S-US' 'VCSCBX-S-US'
'DYZ4XW-S-US' 'VVFQYB-S-US' 'R1C7K1-S-US' 'HNZZTB-S-US' 'GPBWRP-S-US'
'VFZRT3-S-US' 'D9KVWC-S-US' 'BWCC05-S-US' 'T6BMPP-S-US' 'VKDB45-S-US'
'XGSM09-S-US' 'GW5WZK-S-US' 'THD6V0-S-US' 'X6FXHD-S-US' 'RCGDTY-S-US'
'G7VKY3-S-US' 'C5QPWS-S-US' 'H8WVGD-S-US' 'VJCJHB-S-US' 'RFV7N7-S-US'
'SHRWC6-S-US' 'WC0FKB-S-US' 'P21B22-S-US' 'MN48BW-S-US' 'VS43B3-S-US'
'F3NL57-S-US' 'SKWSYY-S-US' 'Q4ZSBS-S-US' 'QMRMKG-S-US' 'NJ1MJS-S-US'
'LY6WH5-S-US' 'X7FN7C-S-US' 'PJQNHD-S-US' 'TWS2Z8-S-US' 'L9RFKV-S-US'
'S8TF9W-S-US' 'GZDYTG-S-US' 'J5XTNS-S-US' 'QH5T7J-S-US' 'D4ZW23-S-US'
'SP9R0C-S-US' 'JXZR9V-S-US' 'WQPBSZ-S-US' 'FGW52F-S-US' 'JS65V9-S-US'
'HXZ9HF-S-US' 'MJBXDM-S-US' 'WYJW4F-S-US' 'PQCHZH-S-US' 'BXMDGT-S-US'
'HVQ5RM-S-US' 'X9R5PJ-S-US' 'W0ZHJJ-S-US' 'HG6Y66-S-US' 'QBS6K5-S-US'
'STV928-S-US' 'CHSKKB-S-US' 'PQ5T45-S-US' 'SJT3QB-S-US' 'CGLVJW-S-US'
'NJZYG9-S-US' 'H59PH3-S-US' 'L9L9V2-S-US' 'CGGG4R-S-US' 'QTFPBP-S-US'
'MG3J1H-S-US' 'PKY98Q-S-US' 'QLQSVL-S-US']]

jreback · 2013-12-17T01:17:50Z

give me a sample of the df before you created the panel
as in your code above

MichaelWS · 2013-12-17T01:25:29Z

each df looks like this:
https://gist.github.com/MichaelWS/7998356

        print df.head()
        panel = df.reset_index().set_index(["id", "date"]).to_panel()

MichaelWS · 2013-12-17T01:25:44Z

also, thanks so much for the help

jreback · 2013-12-17T01:34:30Z

so if you don't use data_columns then this will all work

I have never used (nor is tested), data_columns with a Panel (I have never and I guess noone else has!)

you can use a multi-indexed frame, which is basically the same idea

MichaelWS · 2013-12-17T02:00:44Z

Thank you. I will regenerate my files and test again. I appreciate it. (I had thought it would be some string issue.)

jreback · 2013-12-17T02:36:13Z

ok...that was a real bug....should be good to go now

MichaelWS · 2013-12-17T02:38:05Z

It works now but It takes up quite a bit of memory. I needed to do this.

    panels = store.select(country, where=terms, chunksize=100000)
    for panel in panels:
        out.append(panel.to_frame().reset_index())
    df = pd.concat(out, ignore_index=True)

jreback · 2013-12-17T02:48:08Z

you have to understand how this works

if you put in a lot of ids, then you are going to end up selecting the ENTIRE list then doing a reindex. no way around this, its 'sort of' a bug in numexpr (really has to do with ow things are selected).

So you have several options.

chunk thru your list, limited the id selection to about 30 at a time (e.g. do multiple selections and conat them together). This is prob the fastest.

chunk thru like you are doing (it will select the entire chunk) so that's your max memory

you have to really think carefully how you are using this.

I use panels a lot like this, but I append data one date at a time, and more importantly select generally all of the data for a single date at a time.

you have other options that might make sense.e.g. you can store multiple tables that only have a limited id set int hem (e.g. AB, CD, EF)...etc store by letter.

not exactly sure how you are using this, so you may have to experiment (also not sure of the size of what you are doing, and why a temporary increase in memory actually matters).

MichaelWS · 2013-12-17T03:02:20Z

Thank you very much. That makes sense.

Essentially, I am building a number of different quant finance models to combine together. The id's are relatively random. (They are made by my data provider with no rhyme or reason.) Essentially, I need to choose id, then choose dates, then build model. I have inputs that I map to my data provider ids then I pull the data. I am not sure the optimal structure for this. I could do whatever more experienced people think are best.

MichaelWS · 2013-12-17T03:08:04Z

the size is on the order of 70mm rows so a temporary increase in memory could crash the server its running on, especially if I am trying to run multiple at the same time

jreback · 2013-12-17T03:11:08Z

just select a liminted (< 30) number of ids at a time, then you will have no problem. if you need to select more, then just do multiple queries and concat.

jreback · 2013-12-17T03:13:03Z

#2755

MichaelWS · 2013-12-17T05:54:44Z

Thanks. that works. Its slower than I hoped, but it definitely works.

MichaelWS · 2013-12-19T03:36:20Z

Is there anyway to incorporate major_axis in a meaningful way? It seems like everything gets stored into a giant pytables table object.

By the way, it turned out being much faster to just store in a dataframe and get my dataframe and reindex myself.

jreback · 2013-12-19T03:48:17Z

not sure what you mean - can u elaborate?

MichaelWS · 2013-12-19T04:39:17Z

on both points:
1)My point about major_axis in panel essentially, it looks like if you create a panel it just stores it in one table it pytables for the group. Isn't it just iterating over the table? its seems to me to lose the benefit of the axis. Shouldn't the axes be separated so you don't have to iterate over all the data in an axis that doesn't match your call?

I switched back from a panel to a dataframe and I am simply querying the data on non string parameter's and reindexing myself on the strings afterwards. The speed difference is 5x for me to do it this way versus chunking through strings with the other parameters.

jreback · 2013-12-19T06:39:41Z

there is no iteration at all; a panel looks almost exactly like a frame; just flattened in 2-d.

MichaelWS · 2013-12-19T15:49:39Z

Maybe I am just confused. I thought that there's iteration when reading it out because its stored in one giant table under the group. It just iterates through hdf table.
I am not super knowledgeable when it comes to hdf5 but my essential question (and this may be for Francesc or hdfgroup), is could we store a table for each major axis of a panel? It seems like you could reindex on the major axis in a much faster way this way. Also, I may be wrongfully assuming people will query on their major axis most, then second axis, and so on.

jreback · 2013-12-19T16:14:16Z

you can store a table in many ways; you could do it like you suggest, but their is not benefit.

since I haven't seen comparable code it is impossible for me to guess what is your best setup

jreback mentioned this issue Dec 17, 2013

BUG: In a HDFStore, correctly handle data_columns with a Panel (GH5717) #5718

Merged

jreback closed this as completed in #5718 Dec 17, 2013

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem selecting major_axis of a panel in HDFstore from a list if data_columns are used. #5717

Problem selecting major_axis of a panel in HDFstore from a list if data_columns are used. #5717

MichaelWS commented Dec 17, 2013

jreback commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

jreback commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

jreback commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

jreback commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

jreback commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

jreback commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

jreback commented Dec 17, 2013

jreback commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

MichaelWS commented Dec 19, 2013

jreback commented Dec 19, 2013

MichaelWS commented Dec 19, 2013

jreback commented Dec 19, 2013

MichaelWS commented Dec 19, 2013

jreback commented Dec 19, 2013

Problem selecting major_axis of a panel in HDFstore from a list if data_columns are used. #5717

Problem selecting major_axis of a panel in HDFstore from a list if data_columns are used. #5717

Comments

MichaelWS commented Dec 17, 2013

jreback commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

jreback commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

jreback commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

jreback commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

jreback commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

jreback commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

jreback commented Dec 17, 2013

jreback commented Dec 17, 2013

MichaelWS commented Dec 17, 2013

MichaelWS commented Dec 19, 2013

jreback commented Dec 19, 2013

MichaelWS commented Dec 19, 2013

jreback commented Dec 19, 2013

MichaelWS commented Dec 19, 2013

jreback commented Dec 19, 2013