-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removed block context from objToJSON #27166
Conversation
One failing test I see locally has to do with date time precision (change doesn't go down to NS); still debugging |
If it’s rounding to 6 digits that suggests it is casting to pydatetime somewhere along the way |
Yea so the fix here isn't too far off; the problem is more or less that the prior block context would be responsible for letting the parser know the dtype of objects as they were being converted to JSON. After removal there is some code that infers this but does it in one pass for the entire 2D frame which basically means this gets read in as If you write out only one datetime column this works fine, and probably would for a homgenous frame of datetimes, just not when mixed with other types. Solution isn't far off - just need a better way of setting the dtype within the context that an object is getting written out. Stay tuned... |
@WillAyd this converts to object before writing things out? tell me this is not so. |
@jreback not quite. Probably best explained via code. This actually works on this branch: In [7]: df = pd.DataFrame([[pd.Timestamp('20130101 20:43:42.123456789'),
...: pd.Timestamp('20130101 20:43:42.123456789')]])
In [8]: df.to_json(date_unit='ns')
Out[8]: '{"0":{"0":1357073022123456789},"1":{"0":1357073022123456789}}' This doesn't In [9]: df = pd.DataFrame([[pd.Timestamp('20130101 20:43:42.123456789'),
...: 'astring']])
In [10]: df.to_json(date_unit='ns')
Out[10]: '{"0":{"0":1357073022123456000},"1":{"0":"astring"}}' So the problem is that during conversion to JSON there is a call on the 2D object of In [7]: df = pd.DataFrame([[pd.Timestamp('20130101 20:43:42.123456789'),
...: pd.Timestamp('20130101 20:43:42.123456789')]])
In [14]: df.to_numpy()
Out[14]:
array([['2013-01-01T20:43:42.123456789', '2013-01-01T20:43:42.123456789']],
dtype='datetime64[ns]') Whereas the call with the mixed dtype frame obviously would be So there's just some extra tweaks that need to be made to probably set the dtype for the writing context per column rather than on a 2D object prior to writing out the individual arrays |
@WillAyd you answered my question.
why are we operating with a 2-D object at all here? This needs to column by column process (if you don't use blocks). you should never have a 2-D object. |
Yep just an artifact of the old way of doing it I suppose (diff here is practically all pure removal of block manager). So will have to add a few lines of code here to do the correct column by column. Hoping to get to that over next day or so but even with that added in this should net a significant amount of complexity reduction! |
oh no question it reduces complexity just want to do it column by column |
Still WIP and latest PR will cause segfaults but I think I had a breakthrough on how to better implement. Right now there's a mix of block and data frame usage. What I'm aiming to do is just have well defined serialization rules for DataFrame which would eliminate need for block altogether. This was previously done for SPLIT orientation and I have it done for COLUMNS (which is the default). Just need to get RECORDS and VALUES orients working (table is actually RECORDS behind the scenes) Also added some documentation to the C code in case anyone else finds useful |
OK all green here so I think the general approach is reviewable. There's still a lot more than can be done but trying to balance review complexity versus change. Right now some regressions introduced that I'm taking a look at. I'm not surprised by regressions with before after ratio
[9bab81e0] [f60a1398]
<stash~1^2> <json-remove-blocks>
+ 214±8ms 9.64±0.02s 44.99 io.json.ToJSON.time_float_int_str_lines('index')
+ 169±1ms 5.71±0.04s 33.71 io.json.ToJSON.time_float_int_lines('index')
+ 168±4ms 5.64±0.02s 33.52 io.json.ToJSON.time_floats_with_dt_index_lines('columns')
+ 175±8ms 5.63±0.01s 32.16 io.json.ToJSON.time_floats_with_dt_index_lines('index')
+ 172±7ms 5.22±0.04s 30.36 io.json.ToJSON.time_floats_with_int_idex_lines('index')
+ 98.9±3ms 2.36±0.1s 23.89 io.json.ToJSON.time_delta_int_tstamp('split')
+ 136±5ms 220±3ms 1.61 io.json.ToJSON.time_float_int_str('columns')
+ 139±5ms 219±20ms 1.58 io.json.ToJSON.time_float_int('columns')
+ 131±4ms 193±5ms 1.47 io.json.ToJSON.time_delta_int_tstamp('columns')
+ 107±1ms 120±6ms 1.12 io.json.ToJSON.time_float_int_str('split')
- 135±6ms 106±9ms 0.79 io.json.ToJSON.time_floats_with_dt_index('split')
- 125±3ms 95.5±3ms 0.77 io.json.ToJSON.time_floats_with_int_index('split') |
int index = GET_TC(tc)->index - 1; | ||
|
||
// Also TODO: return a value here rather than having NpyArr_getLabel modify output buf | ||
NpyArr_getLabel(obj, tc, outLen, index, GET_TC(tc)->columnLabels); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It took me a while to grok the use of this function and I find it rather unreadable but leaving for a subsequent PR. With ujson key / value pairs you can override the FOO_iterGetName method to provide the key for a particular object, which would be the idiomatic approach.
I guess as a result of blocks not mapping cleanly to key / value pairs the historical approach was to encode the labels separately and use this dedicated method to write directly to the output buffer rather than returning a char *
for ujson to encode into the buffer automatically.
It would take a rather large effort so leaving to a separate PR but would like to use ujson iteration instead of manually encoding all values up front
Nice! I'm looking forward to seeing this go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok didn't really see where you are calling iterrows, but you can simply call itertuples if you need for much better perf. what are the current benchmarks?
def setup(self): | ||
N = 10 ** 5 | ||
ncols = 5 | ||
index = date_range("20000101", periods=N, freq="H") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use params here instead? (you can leave setup like this), but the methods then get the params name
GET_TC(tc)->itemValue = PyObject_GetAttrString(obj, "columns"); | ||
} else if (index == 1) { | ||
} else if (index == 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are these integers referring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SPLIT format has custom iteration to write out (in order):
- 0: the index
- 1: the columns
- 2: the data
This was already in place so here I just wrapped in a conditional statement to have the other orients pass through this context as well. Should certainly refactor to align iteration with other orients (I think better as follow up but will see)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
k, yeah odd that these are used here (but sure could have been original)
Audited the benchmarks and noticed they didn't have complete coverage. Added better in the last PR. So just to keep things current the latest code I have introduces serious regressions. Those are listed below: . before after ratio
[28317f5a] [8b226a69]
<master> <json-remove-blocks>
+ 80.9±2ms 11.6±0.03s 143.57 io.json.ToJSON.time_delta_int_tstamp('values')
+ 106±8ms 11.7±0.08s 110.04 io.json.ToJSON.time_delta_int_tstamp('records')
+ 128±5ms 11.6±0.01s 90.91 io.json.ToJSON.time_delta_int_tstamp('index')
+ 137±10ms 9.18±0.04s 67.05 io.json.ToJSON.time_float_int_str('index')
+ 77.5±3ms 4.98±0.01s 64.22 io.json.ToJSON.time_floats_with_dt_index('values')
+ 87.9±3ms 5.23±0.01s 59.55 io.json.ToJSON.time_floats_with_int_index('records')
+ 88.3±5ms 5.12±0.01s 57.95 io.json.ToJSON.time_float_int('values')
+ 99.4±0.6ms 5.30±0.01s 53.33 io.json.ToJSON.time_floats_with_int_index('index')
+ 114±9ms 5.49±0.02s 48.05 io.json.ToJSON.time_float_int('records')
+ 127±6ms 5.89±0.01s 46.53 io.json.ToJSONLines.time_floats_with_int_index_lines
+ 90.9±3ms 2.40±0.03s 26.37 io.json.ToJSON.time_delta_int_tstamp('split')
+ 108±1ms 164±2ms 1.52 io.json.ToJSON.time_delta_int_tstamp('columns')
+ 118±10ms 164±1ms 1.39 io.json.ToJSON.time_float_int_str('columns')
+ 91.4±1ms 121±1ms 1.32 io.json.ToJSON.time_floats_with_int_index('columns')
+ 135±2ms 172±3ms 1.27 io.json.ToJSON.time_float_int('columns')
+ 137M 160M 1.17 io.json.ReadJSONLines.peakmem_read_json_lines_concat('datetime')
+ 137M 160M 1.17 io.json.ReadJSONLines.peakmem_read_json_lines_concat('int')
+ 174M 198M 1.14 io.json.ReadJSONLines.peakmem_read_json_lines('datetime')
+ 174M 198M 1.13 io.json.ReadJSONLines.peakmem_read_json_lines('int') Most of the bottlenecks here are values / records / index oriented which perhaps not surprisingly are all items that iterate by rows. This obviously isn't production ready but I think this basis is a lot simpler so will keep updating as I find optimization opportunities to get us back to where we were |
This is getting rather unmanageable to do at once; splitting up in smaller PRs |
git diff upstream/master -u -- "*.py" | flake8 --diff
Getting one failure locally which will check later but pushing to see on other platforms