Remove Encoding of values in char** For Labels #27618

WillAyd · 2019-07-26T23:35:30Z

closes DataFrame.to_json() produces malformed JSON when DataFrame contains tuples as column #20500

In reviewing this module there is a shared function for object keys and values which encodes objects into a separate buffer and subsequently indexes off of that. Instead of encoding values in a buffer I've updated that function to be a char ** pointing to string representations of the labels (or index / columns, rather).

This is arguably a pre-cursor to:

Disentangle ujson from _lib/src/datetime #19486 to disentangle date time functions from JSON
add indent support to to_json method #12004 to add indent support (tried this previously but vendoring ujson updates but didn't work because of this limitation)
INT: the json C code should not deal with blocks #27164 because the various formats may need column / index labels at different points in time, so encoding up front makes that very difficult to reuse

The only downside here I haven't been able to figure out is how to deal with date formatting. Right now all labels are written as epochs. I'm sure there is a way to handle but I wasn't clear on what the best way to convert arbitrary input (i.e. object or datetime dtypes) into ISO formats by element where applicable.

cc @jbrockmendel in case you have insight on that

jbrockmendel · 2019-07-26T23:38:52Z

can you give an example of the datetime inputs that are trouble?

WillAyd · 2019-07-26T23:43:26Z

Sure. I'm not sure how to best handle this:

>>> df = pd.DataFrame([[1]], columns=[pd.Timestamp("today")])
>>> df.to_json(date_format="iso")

While also supporting:

>>> df.columns = df.columns.astype(object)
>>> df.to_json(date_format="iso")

In a concise manner. AFAICT in the first example inspecting the object during iteration with PyDateTime_Check doesn't recognize it as a datetime whereas when passed in as an object it does

WillAyd · 2019-07-26T23:44:08Z

Right now the code handles neither and simply provides out the default str, which I guess is epoch for both cases

WillAyd · 2019-07-26T23:45:30Z

pandas/_libs/src/ujson/python/objToJSON.c

+	  break;
+	}
+
+	PyObject *str = PyObject_Str(item);


@jbrockmendel last comment...here is where I think that check needs to go. We are getting the str of the object but when passed in as a DTA the introspection is different than when passed in an object array. Wondering if we have shared utilities to already handle that ambiguity I could leverage instead of trying to do it all in this extension

jbrockmendel · 2019-07-27T02:00:40Z

AFAICT in the first example inspecting the object during iteration with PyDateTime_Check doesn't recognize it as a datetime whereas when passed in as an object it does

Off the top of my head, I'd guess that in the non-object case the array you're iterating over is a ndarray[datetime64[ns]] and not a DatetimeArray/DatetimeIndex. As a result, the objects you get when iterating are datetime64 objects, which do not subclass datetime. The check for that would be PyObject_TypeCheck(obj, &PyDatetimeArrType_Type) (at least that's how we do it in tslibs.util).

WillAyd · 2019-07-27T02:09:34Z

Thanks! Sounds in the direction I’ll give that a look

…

Sent from my iPhone

On Jul 26, 2019, at 7:00 PM, jbrockmendel ***@***.***> wrote: AFAICT in the first example inspecting the object during iteration with PyDateTime_Check doesn't recognize it as a datetime whereas when passed in as an object it does Off the top of my head, I'd guess that in the non-object case the array you're iterating over is a ndarray[datetime64[ns]] and not a DatetimeArray/DatetimeIndex. As a result, the objects you get when iterating are datetime64 objects, which do not subclass datetime. The check for that would be PyObject_TypeCheck(obj, &PyDatetimeArrType_Type) (at least that's how we do it in tslibs.util). — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or mute the thread.

WillAyd · 2019-07-27T03:35:25Z

Sorry should clarify - so if the columns are all datetimes it appears that PyTypeNum_ISDATETIME(type_num) is true (this was in the pre-existing code as well). Otherwise if mixed objects then PyDateTime_Check and PyDate_Check would work.

Do we have one function that can arbitrary convert objects from either as appropriate to an isoformat that you know of?

jbrockmendel · 2019-07-27T03:52:04Z

Do we have one function that can arbitrary convert objects from either as appropriate to an isoformat that you know of?

In python/cython I would probably just use Timestamp(obj).isoformat(). Not sure how big a hassle that would be to do in the c file

pep8speaks · 2019-08-07T19:13:09Z

Hello @WillAyd! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-08-23 13:09:40 UTC

jbrockmendel · 2019-08-08T15:58:39Z

pandas/_libs/src/ujson/lib/ultrajsonenc.c

+    {   \
+      Buffer_Realloc((__enc), (__len));\
+    }   \
+


does it matter that this is moved from the .h file?

No this is intentional and matches the ujson source. It essentially reverts part of #17857

jbrockmendel · 2019-08-08T16:01:54Z

pandas/_libs/src/ujson/python/objToJSON.c

-    memcpy(enc->offset, labels[idx], sizeof(char) * (*outLen));
-    enc->offset += *outLen;
-    *outLen = 0;
-}


it looks like replacing this function with the simpler version below is orthogonal ot everything else. am i reading it wrong?

Not orthogonal just simplified. Previously since the labels would all be encoded into a buffer this function would help return the unencoded label. Instead now all labels are stored unencoded in the char ** and can easily be accessed by index, leaving the ultimate encoding up to ujson

jbrockmendel · 2019-08-08T16:06:29Z

How much of the difficulty here could be avoided by working in cython rather than C? Moving the whole file to cython would be too much, but supposedly it is possible to make cython export a .h file so it can be used in a c file. If I could figure out how to make this work, would that make things easier here?

WillAyd · 2019-08-08T16:10:20Z

Good question. I don't think we would want to move the entire extension to Cython since if done correctly this should theoretically be a very small extension that dictates key / value pairs for pandas objects.

Right now there is too much custom logic here though (ex: date time handling) which if decoupled should be moved somewhere else for shared C functions (whether Cython or hand coded), so the majority can be removed but I don't consider entire replacement a goal

WillAyd · 2019-08-08T16:25:57Z

Actually this could close #20500 as well. I can add a test and whatsnew for that (though again wouldn’t be round trippable)

WillAyd · 2019-08-23T12:20:42Z

Any objections to merging this? As is this intentionally causes a regression when writing out a DTI but I think resolvable in a follow up. This refactor should also enable some of the items linked above.

I'm catching a flight back to the US tomorrow so should have spare time to focus on follow ups. Worst case I think could revert this before 1.0.0 if those never come to fruition

jbrockmendel · 2019-08-23T15:31:29Z

what's the DTI regression?

WillAyd · 2019-08-23T15:35:35Z

Takes 1-4x as long to write out index / columns containing 100k date times. Actual benchmarks are shown in Appendix A here #27618 (comment)

The reason is that this change constructs a Timestamp object during serialization for each label, whether or not that label is stuck in an object array or a DTI.

A more performant approach would be to check for a DTI and dispatch to a vectorized stringification for the labels, which I think would just be better done in a follow up

jbrockmendel · 2019-08-23T15:40:09Z

Seems totally reasonable.

TomAugspurger

I'm OK with the slowdown if we think this is more maintainable.

TomAugspurger · 2019-08-23T17:47:43Z

pandas/_libs/src/ujson/python/objToJSON.c

@@ -166,6 +166,8 @@ void *initObjToJSON(void)
        cls_index = (PyTypeObject *)PyObject_GetAttrString(mod_pandas, "Index");
        cls_series =
            (PyTypeObject *)PyObject_GetAttrString(mod_pandas, "Series");
+        cls_timestamp = PyObject_GetAttrString(mod_pandas, "Timestamp");
+	cls_timedelta = PyObject_GetAttrString(mod_pandas, "Timedelta");


Is this indentation off?

I think just matching the block but generally indentation is off in the C files. We don’t have any linting here so I think a good follow up generally for extension modules.

We have cpplint but I don’t think that really covers C code. gnu indent might be worth looking at

WillAyd · 2019-08-23T22:37:50Z

Sounds good. Merging for now will formalize follow ups over next few days. Worst case can always revert

* master: (40 commits) DOC: Fix GL01 and GL02 errors in the docstrings (pandas-dev#27988) Remove Encoding of values in char** For Labels (pandas-dev#27618) TYPING: more type hints for io.formats.printing (pandas-dev#27765) TST: fix compression tests when run without virtualenv/condaenv (pandas-dev#28051) DOC: Start 0.25.2 (pandas-dev#28111) DOC: Fix docstrings lack of punctuation (pandas-dev#28031) DOC: Remove alias for numpy.random.randn from the docs (pandas-dev#28082) DOC: update GroupBy.head()/tail() documentation (pandas-dev#27844) BUG: timedelta merge asof with tolerance (pandas-dev#27650) BUG: Series.rename raises error on values accepted by Series construc… (pandas-dev#27814) Preserve index when setting new column on empty dataframe. (pandas-dev#26471) BUG: Fixed groupby quantile for listlike q (pandas-dev#27827) BUG: iter with readonly values, closes pandas-dev#28055 (pandas-dev#28074) TST: non-strict xfail for period test (pandas-dev#28072) DOC: Update whatsnew (pandas-dev#28073) CI: disable codecov (pandas-dev#28065) CI: Set SHA for codecov upload (pandas-dev#28067) BUG: Correct the previous bug fixing on xlim for plotting (pandas-dev#28059) CI: Add pip dependence explicitly (pandas-dev#28008) DOC: Change document code prun in a row (pandas-dev#28029) ...

WillAyd added 7 commits July 25, 2019 23:16

with segfaults

648d2b5

Merge remote-tracking branch 'upstream/master' into remove-buf-writes

37226d7

Works save datetimes

e890578

Less failures, still not correct dt support

24a585a

Removed encoder attributes in encodeLabels

42cc1ec

Added docstring

7385fd8

whitespace cleanup

4aca787

WillAyd added IO JSON read_json, to_json, json_normalize Clean labels Jul 26, 2019

WillAyd commented Jul 26, 2019

View reviewed changes

WillAyd added 2 commits July 26, 2019 16:52

Removed unused headers

bc5e69a

Moved macro back to vendor c file

d53527f

WillAyd changed the title ~~Replace Label Buffers with char * Pointers in toJSON~~ Remove Encoding of values in char** For Labels Jul 27, 2019

WillAyd added 8 commits August 1, 2019 09:52

Merge remote-tracking branch 'upstream/master' into remove-buf-writes

eefadfa

Merge remote-tracking branch 'upstream/master' into remove-buf-writes

7b2af9f

Working TS conversions

2e25bef

Parametrized test

0820f57

Fixed epoch precision issue for labels

185aa61

Parametrized failing test

221e43d

Parametrized frame test

18d28c8

Removed UTC cast for index

7a5fc9c

Merge remote-tracking branch 'upstream/master' into remove-buf-writes

ad06e9c

jbrockmendel reviewed Aug 8, 2019

View reviewed changes

WillAyd added 7 commits August 8, 2019 10:38

Added test for tuples

01ae178

Whatsnew

0372794

Warnings fixup

e15bede

Simplified object construction

d698255

Macro for int64 copying

d453ed0

Removed unnecessary padding in sprintf

5ab7ecf

Merge remote-tracking branch 'upstream/master' into remove-buf-writes

8ea895e

whitespace fixup

439b695

WillAyd added this to the 1.0 milestone Aug 23, 2019

TomAugspurger reviewed Aug 23, 2019

View reviewed changes

WillAyd merged commit d75ee70 into pandas-dev:master Aug 23, 2019

WillAyd deleted the remove-buf-writes branch August 23, 2019 22:38

galuhsahid pushed a commit to galuhsahid/pandas that referenced this pull request Aug 25, 2019

Remove Encoding of values in char** For Labels (pandas-dev#27618)

8641797

ohad83 mentioned this pull request Nov 26, 2019

Series.to_json - invalid handling of quotes #29806

Closed

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

Remove Encoding of values in char** For Labels (pandas-dev#27618)

9f1c856

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

Remove Encoding of values in char** For Labels (pandas-dev#27618)

e82093c

WillAyd mentioned this pull request Jan 15, 2020

Series.to_json produces invalid JSON with orient="index" if index is multilevel #31028

Closed

rushabh-v mentioned this pull request Jan 25, 2020

Add test for multiindex json #31307

Merged

3 tasks

simonjayhawkins mentioned this pull request Mar 28, 2020

to_json index with Null Value Broken in 1.0 #31801

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove Encoding of values in char** For Labels #27618

Remove Encoding of values in char** For Labels #27618

WillAyd commented Jul 26, 2019 •

edited

Loading

jbrockmendel commented Jul 26, 2019

WillAyd commented Jul 26, 2019

WillAyd commented Jul 26, 2019

WillAyd Jul 26, 2019

jbrockmendel commented Jul 27, 2019

WillAyd commented Jul 27, 2019 via email

WillAyd commented Jul 27, 2019

jbrockmendel commented Jul 27, 2019

pep8speaks commented Aug 7, 2019 •

edited

Loading

jbrockmendel Aug 8, 2019

WillAyd Aug 8, 2019

jbrockmendel Aug 8, 2019

WillAyd Aug 8, 2019

jbrockmendel commented Aug 8, 2019

WillAyd commented Aug 8, 2019

WillAyd commented Aug 8, 2019

WillAyd commented Aug 23, 2019

jbrockmendel commented Aug 23, 2019

WillAyd commented Aug 23, 2019

jbrockmendel commented Aug 23, 2019

TomAugspurger left a comment

TomAugspurger Aug 23, 2019

WillAyd Aug 23, 2019

WillAyd commented Aug 23, 2019

Remove Encoding of values in char** For Labels #27618

Remove Encoding of values in char** For Labels #27618

Conversation

WillAyd commented Jul 26, 2019 • edited Loading

jbrockmendel commented Jul 26, 2019

WillAyd commented Jul 26, 2019

WillAyd commented Jul 26, 2019

WillAyd Jul 26, 2019

Choose a reason for hiding this comment

jbrockmendel commented Jul 27, 2019

WillAyd commented Jul 27, 2019 via email

WillAyd commented Jul 27, 2019

jbrockmendel commented Jul 27, 2019

pep8speaks commented Aug 7, 2019 • edited Loading

Comment last updated at 2019-08-23 13:09:40 UTC

jbrockmendel Aug 8, 2019

Choose a reason for hiding this comment

WillAyd Aug 8, 2019

Choose a reason for hiding this comment

jbrockmendel Aug 8, 2019

Choose a reason for hiding this comment

WillAyd Aug 8, 2019

Choose a reason for hiding this comment

jbrockmendel commented Aug 8, 2019

WillAyd commented Aug 8, 2019

WillAyd commented Aug 8, 2019

WillAyd commented Aug 23, 2019

jbrockmendel commented Aug 23, 2019

WillAyd commented Aug 23, 2019

jbrockmendel commented Aug 23, 2019

TomAugspurger left a comment

Choose a reason for hiding this comment

TomAugspurger Aug 23, 2019

Choose a reason for hiding this comment

WillAyd Aug 23, 2019

Choose a reason for hiding this comment

WillAyd commented Aug 23, 2019

WillAyd commented Jul 26, 2019 •

edited

Loading

pep8speaks commented Aug 7, 2019 •

edited

Loading