Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem selecting major_axis of a panel in HDFstore from a list if data_columns are used. #5717

Closed
MichaelWS opened this issue Dec 17, 2013 · 23 comments · Fixed by #5718
Closed
Labels
Bug IO HDF5 read_hdf, HDFStore
Milestone

Comments

@MichaelWS
Copy link
Contributor

Is there any way to use Term to select membership in a large list for a Panel in HDFstore?

I have a hdfstore of the following format: https://gist.github.com/MichaelWS/7997225

My major axis is "id" and minor axis is "date". I am trying to give a large python list into Term and select if an id is in it. This works for a small list and does not work for a larger one. This fails for a list of 338 items.

store = pd.HDFStore(store_path, "r")
terms = []
if not members is None:
    terms.append(Term("major_axis", "=", members))
if not additional_terms is None:
    terms += additional_terms
if len(terms) > 0:
    panel = store.select(country, where=terms)
else:
    panel = store.select(country)
store.close()

Here is the traceback:

 File "/home/michael/repos/franklin/factset/reference_data.py", line 1037, in get_frame
  panel = store.select(country, where=terms)
  File "/local/install/pandas_test/lib/python2.7/site- packages/pandas/io/pytables.py", line 637, in select
  auto_close=auto_close).get_values()
  File "/local/install/pandas_test/lib/python2.7/site-packages/pandas/io/pytables.py", line 1311, in get_values
results = self.func(self.start, self.stop)
  File "/local/install/pandas_test/lib/python2.7/site-packages/pandas/io/pytables.py", line 626, in func
columns=columns, **kwargs)
  File "/local/install/pandas_test/lib/python2.7/site-packages/pandas/io/pytables.py", line 3453, in read
sorted_values, items, tuple(N), take_labels)
  File "/local/install/pandas_test/lib/python2.7/site-packages/pandas/core/reshape.py", line 1046, in block2d_to_blocknd
pvalues[i].flat[mask] = values[:, i]
 IndexError: too many indices
@jreback
Copy link
Contributor

jreback commented Dec 17, 2013

how did you create this store? pls put a complete code example that I can reproduce (with a link to the store)

@MichaelWS
Copy link
Contributor Author

sorry, but I cannot share the vendor's data. The id's are strings.

Its just end of day finance data that is generated from the vendors csv's. I create a dataframe set the index and create panel. I append using the following code:

        df = df.set_index("date").sort_index()
        panel = df.reset_index().set_index(["id", "date"]).to_panel()
        h5.append(country, panel, format="t",
                         data_columns=["liq", "close", "volume"])

@jreback
Copy link
Contributor

jreback commented Dec 17, 2013

and give me an example of the selection code exactly

why do u have data_columns? are u planning on selecting based on close? (say?)

@MichaelWS
Copy link
Contributor Author

so the data is standard (open high low close volume) data. I will do something such as close >5 or volume > 1e6. only term I am trying right now is major_axis. my list is a list of strings for the id.
Here is the string representation of my terms:

[major_axis=['N03TRC-S-US' 'M89FLH-S-US' 'P7KT40-S-US' 'TXX3S4-S-US' 'QDX491-S-US'
 'VGZC4T-S-US' 'FLDXDF-S-US' 'V84XXD-S-US' 'RDQ6VC-S-US' 'R02HRR-S-US'
 'N03H31-S-US' 'F5YW44-S-US' 'Q4YYS0-S-US' 'QBP1J6-S-US' 'JQXLB4-S-US'
 'MMRHSQ-S-US' 'C33HXG-S-US' 'P7LW50-S-US' 'MZ0PNB-S-US' 'D78CJ8-S-US'
 'BDMWDS-S-US' 'LTSZHS-S-US' 'V3LF0L-S-US' 'M3X6SN-S-US' 'FXJPVX-S-US'
 'GGK810-S-US' 'VFT77X-S-US' 'DBV92C-S-US' 'R6XL7S-S-US' 'MBKTVK-S-US'
 'L6CX6R-S-US' 'TWYMJG-S-US' 'JKD4B7-S-US' 'NCHX4Q-S-US' 'MVKY14-S-US'
 'KQL6C8-S-US' 'XGFGMQ-S-US' 'G6RHZ4-S-US' 'XKSSLM-S-US' 'VK7WP0-S-US'
 'J8BDXS-S-US' 'XMYSCH-S-US' 'MS999V-S-US' 'RBBJ96-S-US' 'JDQZCY-S-US'
 'J9FZW7-S-US' 'S5GXXR-S-US' 'T6DTWR-S-US' 'GTYN0N-S-US' 'JNPC95-S-US'
 'T1PR8N-S-US' 'VBDLSS-S-US' 'NFMJ99-S-US' 'TF1J5S-S-US' 'K15V5Q-S-US'
 'X9RCXK-S-US' 'SW3WLF-S-US' 'D52CVG-S-US' 'G5LTXN-S-US' 'JH4QZ3-S-US'
 'MT931R-S-US' 'D3X1KC-S-US' 'R17LLM-S-US' 'Q1KJQ2-S-US' 'Q8DVNB-S-US'
 'BRWRT6-S-US' 'DBQ20D-S-US' 'T8GLHL-S-US' 'WCC86H-S-US' 'R85KLC-S-US'
 'PQJ6YX-S-US' 'Q3TRMZ-S-US' 'TJF1JW-S-US' 'P8WJT7-S-US' 'H547YV-S-US'
 'S48T3X-S-US' 'K14DFC-S-US' 'T0JCXJ-S-US' 'M647FF-S-US' 'FGX1S8-S-US'
 'R1CXVP-S-US' 'B97HDW-S-US' 'S5G28Y-S-US' 'W38FV3-S-US' 'G25QK2-S-US'
 'HG5T14-S-US' 'KGTC07-S-US' 'XF3QYH-S-US' 'P321KV-S-US' 'HG5HFV-S-US'
 'JS74P6-S-US' 'V601BM-S-US' 'DY0D0T-S-US' 'LKTW6W-S-US' 'PJSZYL-S-US'
 'MCPQD9-S-US' 'T7HSNM-S-US' 'N5N6M6-S-US' 'KQQPYH-S-US' 'N91ZBG-S-US'
 'BYJPFC-S-US' 'F1F0GY-S-US' 'TTHQS7-S-US' 'NV2X6X-S-US' 'DF3K29-S-US'
 'RWHXPV-S-US' 'HVVK2Q-S-US' 'DCZH6D-S-US' 'RWC9XJ-S-US' 'BRWF4W-S-US'
 'C7R3T7-S-US' 'HF1GMZ-S-US' 'L9QXLC-S-US' 'WX72B4-S-US' 'SJWX0Y-S-US'
 'C6M0W9-S-US' 'K6TLGQ-S-US' 'R4SVX3-S-US' 'QQ57WH-S-US' 'RGW4H5-S-US'
 'X69BS3-S-US' 'QSFJYG-S-US' 'DWPYXW-S-US' 'K8YTSW-S-US' 'JS255J-S-US'
 'WC6XYF-S-US' 'G4G3T4-S-US' 'XP3C6W-S-US' 'R4N5ZQ-S-US' 'J1C29R-S-US'
 'X7PFZK-S-US' 'NV30F0-S-US' 'PZFFN1-S-US' 'GC3T03-S-US' 'QR4XC8-S-US'
 'FJ1F7G-S-US' 'N9757B-S-US' 'D1RR34-S-US' 'P8RDYC-S-US' 'VYQXV7-S-US'
 'DBNSL6-S-US' 'JFZJY4-S-US' 'BL9X3Z-S-US' 'NHQ08T-S-US' 'NT47N2-S-US'
 'LQCKJX-S-US' 'GNCWJJ-S-US' 'RPP607-S-US' 'TVR88D-S-US' 'W8XCG8-S-US'
 'L2Y6QM-S-US' 'B16ZFN-S-US' 'JLGFVH-S-US' 'LHLM3W-S-US' 'SPG28F-S-US'
 'KHZRFD-S-US' 'VT5YZY-S-US' 'VCRDJ0-S-US' 'NK0GFS-S-US' 'J5WH3J-S-US'
 'TCWFBX-S-US' 'VMP45M-S-US' 'LY1BXZ-S-US' 'FPVR4P-S-US' 'LQCV25-S-US'
 'B98H9T-S-US' 'MQ18S4-S-US' 'QCS8TB-S-US' 'G7RJ37-S-US' 'D656C6-S-US'
 'NK00LC-S-US' 'NMF5YX-S-US' 'T1T2MS-S-US' 'KFMLXR-S-US' 'L8N68Q-S-US'
 'KFJ9MN-S-US' 'KPMGNC-S-US' 'R5RF8V-S-US' 'FMGYDF-S-US' 'T2YG2Z-S-US'
 'J9GHVQ-S-US' 'DZ3TQ4-S-US' 'FXJMQV-S-US' 'GFJKQD-S-US' 'T56GMP-S-US'
 'XF9TK6-S-US' 'W002HZ-S-US' 'X7GP26-S-US' 'LZ5XSC-S-US' 'PN3GCR-S-US'
 'TKPN21-S-US' 'WSV8FS-S-US' 'W926PT-S-US' 'PKTM88-S-US' 'L9QYND-S-US'
 'WW7382-S-US' 'VWM15D-S-US' 'TN04F4-S-US' 'FBB0WZ-S-US' 'LFBF80-S-US'
 'FWHNRV-S-US' 'KT230F-S-US' 'PKYBBR-S-US' 'P5FR84-S-US' 'P2W5CB-S-US'
 'B987ZQ-S-US' 'MJD4RS-S-US' 'NNL4JN-S-US' 'X9VCMC-S-US' 'T1NHXH-S-US'
 'J6271X-S-US' 'GX96DR-S-US' 'HWZGPH-S-US' 'H2W031-S-US' 'VLCFKF-S-US'
 'SPB6XT-S-US' 'RT7JFS-S-US' 'CX2VS5-S-US' 'JLL842-S-US' 'B1BQL5-S-US'
 'GHWBCK-S-US' 'B3NC6D-S-US' 'LTWRTD-S-US' 'WW5CZG-S-US' 'NNP23D-S-US'
 'Q9S6F2-S-US' 'X6BVNJ-S-US' 'LRJH5R-S-US' 'JJ8RRZ-S-US' 'XN2JJ0-S-US'
 'PHJ4Z6-S-US' 'SC8V1L-S-US' 'F4NDZ5-S-US' 'JMRX4L-S-US' 'K6S3NB-S-US'
 'GB43G9-S-US' 'HCYM26-S-US' 'FS549P-S-US' 'SVZRQK-S-US' 'WPK8Y2-S-US'
 'V9FMMN-S-US' 'XJPTTR-S-US' 'MR4BW7-S-US' 'HGC6B6-S-US' 'DWW3RQ-S-US'
 'H0QZ0Y-S-US' 'V3LWY1-S-US' 'KQKRHV-S-US' 'KHRMVQ-S-US' 'LY23CP-S-US'
 'H9SM9B-S-US' 'QLMMTM-S-US' 'P0X865-S-US' 'QLL3Y6-S-US' 'DR9KSW-S-US'
 'XHJHJN-S-US' 'GX4W4P-S-US' 'D50Z54-S-US' 'BVFQD9-S-US' 'JBFL6Y-S-US'
 'KHRJNM-S-US' 'J0JKSY-S-US' 'DN6FKS-S-US' 'FMG5YQ-S-US' 'WPJ3PZ-S-US'
 'CHNTDT-S-US' 'BL5JSW-S-US' 'DKJ2GF-S-US' 'M518NK-S-US' 'VCSCBX-S-US'
 'DYZ4XW-S-US' 'VVFQYB-S-US' 'R1C7K1-S-US' 'HNZZTB-S-US' 'GPBWRP-S-US'
 'VFZRT3-S-US' 'D9KVWC-S-US' 'BWCC05-S-US' 'T6BMPP-S-US' 'VKDB45-S-US'
 'XGSM09-S-US' 'GW5WZK-S-US' 'THD6V0-S-US' 'X6FXHD-S-US' 'RCGDTY-S-US'
 'G7VKY3-S-US' 'C5QPWS-S-US' 'H8WVGD-S-US' 'VJCJHB-S-US' 'RFV7N7-S-US'
 'SHRWC6-S-US' 'WC0FKB-S-US' 'P21B22-S-US' 'MN48BW-S-US' 'VS43B3-S-US'
 'F3NL57-S-US' 'SKWSYY-S-US' 'Q4ZSBS-S-US' 'QMRMKG-S-US' 'NJ1MJS-S-US'
 'LY6WH5-S-US' 'X7FN7C-S-US' 'PJQNHD-S-US' 'TWS2Z8-S-US' 'L9RFKV-S-US'
 'S8TF9W-S-US' 'GZDYTG-S-US' 'J5XTNS-S-US' 'QH5T7J-S-US' 'D4ZW23-S-US'
 'SP9R0C-S-US' 'JXZR9V-S-US' 'WQPBSZ-S-US' 'FGW52F-S-US' 'JS65V9-S-US'
 'HXZ9HF-S-US' 'MJBXDM-S-US' 'WYJW4F-S-US' 'PQCHZH-S-US' 'BXMDGT-S-US'
 'HVQ5RM-S-US' 'X9R5PJ-S-US' 'W0ZHJJ-S-US' 'HG6Y66-S-US' 'QBS6K5-S-US'
 'STV928-S-US' 'CHSKKB-S-US' 'PQ5T45-S-US' 'SJT3QB-S-US' 'CGLVJW-S-US'
 'NJZYG9-S-US' 'H59PH3-S-US' 'L9L9V2-S-US' 'CGGG4R-S-US' 'QTFPBP-S-US'
 'MG3J1H-S-US' 'PKY98Q-S-US' 'QLQSVL-S-US']]

@jreback
Copy link
Contributor

jreback commented Dec 17, 2013

give me a sample of the df before you created the panel
as in your code above

@MichaelWS
Copy link
Contributor Author

each df looks like this:
https://gist.github.com/MichaelWS/7998356

        print df.head()
        panel = df.reset_index().set_index(["id", "date"]).to_panel()

@MichaelWS
Copy link
Contributor Author

also, thanks so much for the help

@jreback
Copy link
Contributor

jreback commented Dec 17, 2013

so if you don't use data_columns then this will all work

I have never used (nor is tested), data_columns with a Panel (I have never and I guess noone else has!)

you can use a multi-indexed frame, which is basically the same idea

@MichaelWS
Copy link
Contributor Author

Thank you. I will regenerate my files and test again. I appreciate it. (I had thought it would be some string issue.)

@jreback
Copy link
Contributor

jreback commented Dec 17, 2013

ok...that was a real bug....should be good to go now

@MichaelWS
Copy link
Contributor Author

It works now but It takes up quite a bit of memory. I needed to do this.

    panels = store.select(country, where=terms, chunksize=100000)
    for panel in panels:
        out.append(panel.to_frame().reset_index())
    df = pd.concat(out, ignore_index=True)

@jreback
Copy link
Contributor

jreback commented Dec 17, 2013

you have to understand how this works

if you put in a lot of ids, then you are going to end up selecting the ENTIRE list then doing a reindex. no way around this, its 'sort of' a bug in numexpr (really has to do with ow things are selected).

So you have several options.

chunk thru your list, limited the id selection to about 30 at a time (e.g. do multiple selections and conat them together). This is prob the fastest.

chunk thru like you are doing (it will select the entire chunk) so that's your max memory

you have to really think carefully how you are using this.

I use panels a lot like this, but I append data one date at a time, and more importantly select generally all of the data for a single date at a time.

you have other options that might make sense.e.g. you can store multiple tables that only have a limited id set int hem (e.g. AB, CD, EF)...etc store by letter.

not exactly sure how you are using this, so you may have to experiment (also not sure of the size of what you are doing, and why a temporary increase in memory actually matters).

@MichaelWS
Copy link
Contributor Author

Thank you very much. That makes sense.

Essentially, I am building a number of different quant finance models to combine together. The id's are relatively random. (They are made by my data provider with no rhyme or reason.) Essentially, I need to choose id, then choose dates, then build model. I have inputs that I map to my data provider ids then I pull the data. I am not sure the optimal structure for this. I could do whatever more experienced people think are best.

@MichaelWS
Copy link
Contributor Author

the size is on the order of 70mm rows so a temporary increase in memory could crash the server its running on, especially if I am trying to run multiple at the same time

@jreback
Copy link
Contributor

jreback commented Dec 17, 2013

just select a liminted (< 30) number of ids at a time, then you will have no problem. if you need to select more, then just do multiple queries and concat.

@jreback
Copy link
Contributor

jreback commented Dec 17, 2013

#2755

@MichaelWS
Copy link
Contributor Author

Thanks. that works. Its slower than I hoped, but it definitely works.

@MichaelWS
Copy link
Contributor Author

Is there anyway to incorporate major_axis in a meaningful way? It seems like everything gets stored into a giant pytables table object.

By the way, it turned out being much faster to just store in a dataframe and get my dataframe and reindex myself.

@jreback
Copy link
Contributor

jreback commented Dec 19, 2013

not sure what you mean - can u elaborate?

@MichaelWS
Copy link
Contributor Author

on both points:
1)My point about major_axis in panel essentially, it looks like if you create a panel it just stores it in one table it pytables for the group. Isn't it just iterating over the table? its seems to me to lose the benefit of the axis. Shouldn't the axes be separated so you don't have to iterate over all the data in an axis that doesn't match your call?

  1. I switched back from a panel to a dataframe and I am simply querying the data on non string parameter's and reindexing myself on the strings afterwards. The speed difference is 5x for me to do it this way versus chunking through strings with the other parameters.

@jreback
Copy link
Contributor

jreback commented Dec 19, 2013

there is no iteration at all; a panel looks almost exactly like a frame; just flattened in 2-d.

@MichaelWS
Copy link
Contributor Author

Maybe I am just confused. I thought that there's iteration when reading it out because its stored in one giant table under the group. It just iterates through hdf table.
I am not super knowledgeable when it comes to hdf5 but my essential question (and this may be for Francesc or hdfgroup), is could we store a table for each major axis of a panel? It seems like you could reindex on the major axis in a much faster way this way. Also, I may be wrongfully assuming people will query on their major axis most, then second axis, and so on.

@jreback
Copy link
Contributor

jreback commented Dec 19, 2013

you can store a table in many ways; you could do it like you suggest, but their is not benefit.

since I haven't seen comparable code it is impossible for me to guess what is your best setup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants