Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Encoding of values in char** For Labels #27618

Merged
merged 31 commits into from
Aug 23, 2019

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Jul 26, 2019

In reviewing this module there is a shared function for object keys and values which encodes objects into a separate buffer and subsequently indexes off of that. Instead of encoding values in a buffer I've updated that function to be a char ** pointing to string representations of the labels (or index / columns, rather).

This is arguably a pre-cursor to:

  1. Disentangle ujson from _lib/src/datetime #19486 to disentangle date time functions from JSON
  2. add indent support to to_json method #12004 to add indent support (tried this previously but vendoring ujson updates but didn't work because of this limitation)
  3. INT: the json C code should not deal with blocks #27164 because the various formats may need column / index labels at different points in time, so encoding up front makes that very difficult to reuse

The only downside here I haven't been able to figure out is how to deal with date formatting. Right now all labels are written as epochs. I'm sure there is a way to handle but I wasn't clear on what the best way to convert arbitrary input (i.e. object or datetime dtypes) into ISO formats by element where applicable.

cc @jbrockmendel in case you have insight on that

@WillAyd WillAyd added IO JSON read_json, to_json, json_normalize Clean labels Jul 26, 2019
@jbrockmendel
Copy link
Member

can you give an example of the datetime inputs that are trouble?

@WillAyd
Copy link
Member Author

WillAyd commented Jul 26, 2019

Sure. I'm not sure how to best handle this:

>>> df = pd.DataFrame([[1]], columns=[pd.Timestamp("today")])
>>> df.to_json(date_format="iso")

While also supporting:

>>> df.columns = df.columns.astype(object)
>>> df.to_json(date_format="iso")

In a concise manner. AFAICT in the first example inspecting the object during iteration with PyDateTime_Check doesn't recognize it as a datetime whereas when passed in as an object it does

@WillAyd
Copy link
Member Author

WillAyd commented Jul 26, 2019

Right now the code handles neither and simply provides out the default str, which I guess is epoch for both cases

break;
}

PyObject *str = PyObject_Str(item);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jbrockmendel last comment...here is where I think that check needs to go. We are getting the str of the object but when passed in as a DTA the introspection is different than when passed in an object array. Wondering if we have shared utilities to already handle that ambiguity I could leverage instead of trying to do it all in this extension

@WillAyd WillAyd changed the title Replace Label Buffers with char * Pointers in toJSON Remove Encoding of values in char** For Labels Jul 27, 2019
@jbrockmendel
Copy link
Member

AFAICT in the first example inspecting the object during iteration with PyDateTime_Check doesn't recognize it as a datetime whereas when passed in as an object it does

Off the top of my head, I'd guess that in the non-object case the array you're iterating over is a ndarray[datetime64[ns]] and not a DatetimeArray/DatetimeIndex. As a result, the objects you get when iterating are datetime64 objects, which do not subclass datetime. The check for that would be PyObject_TypeCheck(obj, &PyDatetimeArrType_Type) (at least that's how we do it in tslibs.util).

@WillAyd
Copy link
Member Author

WillAyd commented Jul 27, 2019 via email

@WillAyd
Copy link
Member Author

WillAyd commented Jul 27, 2019

Sorry should clarify - so if the columns are all datetimes it appears that PyTypeNum_ISDATETIME(type_num) is true (this was in the pre-existing code as well). Otherwise if mixed objects then PyDateTime_Check and PyDate_Check would work.

Do we have one function that can arbitrary convert objects from either as appropriate to an isoformat that you know of?

@jbrockmendel
Copy link
Member

Do we have one function that can arbitrary convert objects from either as appropriate to an isoformat that you know of?

In python/cython I would probably just use Timestamp(obj).isoformat(). Not sure how big a hassle that would be to do in the c file

@pep8speaks
Copy link

pep8speaks commented Aug 7, 2019

Hello @WillAyd! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-23 13:09:40 UTC

{ \
Buffer_Realloc((__enc), (__len));\
} \

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it matter that this is moved from the .h file?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is intentional and matches the ujson source. It essentially reverts part of #17857

memcpy(enc->offset, labels[idx], sizeof(char) * (*outLen));
enc->offset += *outLen;
*outLen = 0;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like replacing this function with the simpler version below is orthogonal ot everything else. am i reading it wrong?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not orthogonal just simplified. Previously since the labels would all be encoded into a buffer this function would help return the unencoded label. Instead now all labels are stored unencoded in the char ** and can easily be accessed by index, leaving the ultimate encoding up to ujson

@jbrockmendel
Copy link
Member

How much of the difficulty here could be avoided by working in cython rather than C? Moving the whole file to cython would be too much, but supposedly it is possible to make cython export a .h file so it can be used in a c file. If I could figure out how to make this work, would that make things easier here?

@WillAyd
Copy link
Member Author

WillAyd commented Aug 8, 2019

Good question. I don't think we would want to move the entire extension to Cython since if done correctly this should theoretically be a very small extension that dictates key / value pairs for pandas objects.

Right now there is too much custom logic here though (ex: date time handling) which if decoupled should be moved somewhere else for shared C functions (whether Cython or hand coded), so the majority can be removed but I don't consider entire replacement a goal

@WillAyd
Copy link
Member Author

WillAyd commented Aug 8, 2019

Actually this could close #20500 as well. I can add a test and whatsnew for that (though again wouldn’t be round trippable)

@WillAyd
Copy link
Member Author

WillAyd commented Aug 23, 2019

Any objections to merging this? As is this intentionally causes a regression when writing out a DTI but I think resolvable in a follow up. This refactor should also enable some of the items linked above.

I'm catching a flight back to the US tomorrow so should have spare time to focus on follow ups. Worst case I think could revert this before 1.0.0 if those never come to fruition

@WillAyd WillAyd added this to the 1.0 milestone Aug 23, 2019
@jbrockmendel
Copy link
Member

what's the DTI regression?

@WillAyd
Copy link
Member Author

WillAyd commented Aug 23, 2019

Takes 1-4x as long to write out index / columns containing 100k date times. Actual benchmarks are shown in Appendix A here #27618 (comment)

The reason is that this change constructs a Timestamp object during serialization for each label, whether or not that label is stuck in an object array or a DTI.

A more performant approach would be to check for a DTI and dispatch to a vectorized stringification for the labels, which I think would just be better done in a follow up

@jbrockmendel
Copy link
Member

Seems totally reasonable.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with the slowdown if we think this is more maintainable.

@@ -166,6 +166,8 @@ void *initObjToJSON(void)
cls_index = (PyTypeObject *)PyObject_GetAttrString(mod_pandas, "Index");
cls_series =
(PyTypeObject *)PyObject_GetAttrString(mod_pandas, "Series");
cls_timestamp = PyObject_GetAttrString(mod_pandas, "Timestamp");
cls_timedelta = PyObject_GetAttrString(mod_pandas, "Timedelta");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this indentation off?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just matching the block but generally indentation is off in the C files. We don’t have any linting here so I think a good follow up generally for extension modules.

We have cpplint but I don’t think that really covers C code. gnu indent might be worth looking at

@WillAyd
Copy link
Member Author

WillAyd commented Aug 23, 2019

Sounds good. Merging for now will formalize follow ups over next few days. Worst case can always revert

@WillAyd WillAyd merged commit d75ee70 into pandas-dev:master Aug 23, 2019
@WillAyd WillAyd deleted the remove-buf-writes branch August 23, 2019 22:38
galuhsahid pushed a commit to galuhsahid/pandas that referenced this pull request Aug 25, 2019
galuhsahid added a commit to galuhsahid/pandas that referenced this pull request Aug 25, 2019
* master: (40 commits)
  DOC: Fix GL01 and GL02 errors in the docstrings (pandas-dev#27988)
  Remove Encoding of values in char** For Labels (pandas-dev#27618)
  TYPING: more type hints for io.formats.printing (pandas-dev#27765)
  TST: fix compression tests when run without virtualenv/condaenv (pandas-dev#28051)
  DOC: Start 0.25.2 (pandas-dev#28111)
  DOC: Fix docstrings lack of punctuation (pandas-dev#28031)
  DOC: Remove alias for numpy.random.randn from the docs (pandas-dev#28082)
  DOC: update GroupBy.head()/tail() documentation (pandas-dev#27844)
  BUG: timedelta merge asof with tolerance (pandas-dev#27650)
  BUG: Series.rename raises error on values accepted by Series construc… (pandas-dev#27814)
  Preserve index when setting new column on empty dataframe. (pandas-dev#26471)
  BUG: Fixed groupby quantile for listlike q (pandas-dev#27827)
  BUG: iter with readonly values, closes pandas-dev#28055 (pandas-dev#28074)
  TST: non-strict xfail for period test (pandas-dev#28072)
  DOC: Update whatsnew (pandas-dev#28073)
  CI: disable codecov (pandas-dev#28065)
  CI: Set SHA for codecov upload (pandas-dev#28067)
  BUG: Correct the previous bug fixing on xlim for plotting (pandas-dev#28059)
  CI: Add pip dependence explicitly (pandas-dev#28008)
  DOC: Change document code prun in a row (pandas-dev#28029)
  ...
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
@rushabh-v rushabh-v mentioned this pull request Jan 25, 2020
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Clean IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame.to_json() produces malformed JSON when DataFrame contains tuples as column
4 participants