-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove Encoding of values in char** For Labels #27618
Conversation
can you give an example of the datetime inputs that are trouble? |
Sure. I'm not sure how to best handle this: >>> df = pd.DataFrame([[1]], columns=[pd.Timestamp("today")])
>>> df.to_json(date_format="iso") While also supporting: >>> df.columns = df.columns.astype(object)
>>> df.to_json(date_format="iso") In a concise manner. AFAICT in the first example inspecting the object during iteration with |
Right now the code handles neither and simply provides out the default str, which I guess is epoch for both cases |
break; | ||
} | ||
|
||
PyObject *str = PyObject_Str(item); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel last comment...here is where I think that check needs to go. We are getting the str
of the object but when passed in as a DTA the introspection is different than when passed in an object array. Wondering if we have shared utilities to already handle that ambiguity I could leverage instead of trying to do it all in this extension
Off the top of my head, I'd guess that in the non-object case the array you're iterating over is a |
Thanks! Sounds in the direction I’ll give that a look
…Sent from my iPhone
On Jul 26, 2019, at 7:00 PM, jbrockmendel ***@***.***> wrote:
AFAICT in the first example inspecting the object during iteration with PyDateTime_Check doesn't recognize it as a datetime whereas when passed in as an object it does
Off the top of my head, I'd guess that in the non-object case the array you're iterating over is a ndarray[datetime64[ns]] and not a DatetimeArray/DatetimeIndex. As a result, the objects you get when iterating are datetime64 objects, which do not subclass datetime. The check for that would be PyObject_TypeCheck(obj, &PyDatetimeArrType_Type) (at least that's how we do it in tslibs.util).
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Sorry should clarify - so if the columns are all datetimes it appears that Do we have one function that can arbitrary convert objects from either as appropriate to an isoformat that you know of? |
In python/cython I would probably just use |
{ \ | ||
Buffer_Realloc((__enc), (__len));\ | ||
} \ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it matter that this is moved from the .h file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is intentional and matches the ujson source. It essentially reverts part of #17857
memcpy(enc->offset, labels[idx], sizeof(char) * (*outLen)); | ||
enc->offset += *outLen; | ||
*outLen = 0; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like replacing this function with the simpler version below is orthogonal ot everything else. am i reading it wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not orthogonal just simplified. Previously since the labels would all be encoded into a buffer this function would help return the unencoded label. Instead now all labels are stored unencoded in the char **
and can easily be accessed by index, leaving the ultimate encoding up to ujson
How much of the difficulty here could be avoided by working in cython rather than C? Moving the whole file to cython would be too much, but supposedly it is possible to make cython export a .h file so it can be used in a c file. If I could figure out how to make this work, would that make things easier here? |
Good question. I don't think we would want to move the entire extension to Cython since if done correctly this should theoretically be a very small extension that dictates key / value pairs for pandas objects. Right now there is too much custom logic here though (ex: date time handling) which if decoupled should be moved somewhere else for shared C functions (whether Cython or hand coded), so the majority can be removed but I don't consider entire replacement a goal |
Actually this could close #20500 as well. I can add a test and whatsnew for that (though again wouldn’t be round trippable) |
Any objections to merging this? As is this intentionally causes a regression when writing out a DTI but I think resolvable in a follow up. This refactor should also enable some of the items linked above. I'm catching a flight back to the US tomorrow so should have spare time to focus on follow ups. Worst case I think could revert this before 1.0.0 if those never come to fruition |
what's the DTI regression? |
Takes 1-4x as long to write out index / columns containing 100k date times. Actual benchmarks are shown in Appendix A here #27618 (comment) The reason is that this change constructs a Timestamp object during serialization for each label, whether or not that label is stuck in an object array or a DTI. A more performant approach would be to check for a DTI and dispatch to a vectorized stringification for the labels, which I think would just be better done in a follow up |
Seems totally reasonable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with the slowdown if we think this is more maintainable.
@@ -166,6 +166,8 @@ void *initObjToJSON(void) | |||
cls_index = (PyTypeObject *)PyObject_GetAttrString(mod_pandas, "Index"); | |||
cls_series = | |||
(PyTypeObject *)PyObject_GetAttrString(mod_pandas, "Series"); | |||
cls_timestamp = PyObject_GetAttrString(mod_pandas, "Timestamp"); | |||
cls_timedelta = PyObject_GetAttrString(mod_pandas, "Timedelta"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this indentation off?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just matching the block but generally indentation is off in the C files. We don’t have any linting here so I think a good follow up generally for extension modules.
We have cpplint but I don’t think that really covers C code. gnu indent might be worth looking at
Sounds good. Merging for now will formalize follow ups over next few days. Worst case can always revert |
* master: (40 commits) DOC: Fix GL01 and GL02 errors in the docstrings (pandas-dev#27988) Remove Encoding of values in char** For Labels (pandas-dev#27618) TYPING: more type hints for io.formats.printing (pandas-dev#27765) TST: fix compression tests when run without virtualenv/condaenv (pandas-dev#28051) DOC: Start 0.25.2 (pandas-dev#28111) DOC: Fix docstrings lack of punctuation (pandas-dev#28031) DOC: Remove alias for numpy.random.randn from the docs (pandas-dev#28082) DOC: update GroupBy.head()/tail() documentation (pandas-dev#27844) BUG: timedelta merge asof with tolerance (pandas-dev#27650) BUG: Series.rename raises error on values accepted by Series construc… (pandas-dev#27814) Preserve index when setting new column on empty dataframe. (pandas-dev#26471) BUG: Fixed groupby quantile for listlike q (pandas-dev#27827) BUG: iter with readonly values, closes pandas-dev#28055 (pandas-dev#28074) TST: non-strict xfail for period test (pandas-dev#28072) DOC: Update whatsnew (pandas-dev#28073) CI: disable codecov (pandas-dev#28065) CI: Set SHA for codecov upload (pandas-dev#28067) BUG: Correct the previous bug fixing on xlim for plotting (pandas-dev#28059) CI: Add pip dependence explicitly (pandas-dev#28008) DOC: Change document code prun in a row (pandas-dev#28029) ...
In reviewing this module there is a shared function for object keys and values which encodes objects into a separate buffer and subsequently indexes off of that. Instead of encoding values in a buffer I've updated that function to be a char ** pointing to string representations of the labels (or index / columns, rather).
This is arguably a pre-cursor to:
to_json
method #12004 to add indent support (tried this previously but vendoring ujson updates but didn't work because of this limitation)The only downside here I haven't been able to figure out is how to deal with date formatting. Right now all labels are written as epochs. I'm sure there is a way to handle but I wasn't clear on what the best way to convert arbitrary input (i.e. object or datetime dtypes) into ISO formats by element where applicable.
cc @jbrockmendel in case you have insight on that