[SPARK-23603][SQL]When the length of the json is in a range,get_json_object will result in missing tail data #20739

cxzl25 · 2018-03-05T15:48:17Z

What changes were proposed in this pull request?

Replace writeRaw(char[] text, int offset, int len) with writeRaw(String text)
Although using writeRaw(String text) will lose some performance

spark-shell:

val value = "x" * 3000
val json = s"""{"big": "$value"}"""
spark.sql("select length(get_json_object(\'"+json+"\','$.big'))" ).collect
res0: Array[org.apache.spark.sql.Row] = Array([2991])

expect result : 3000
actual result : 2991

public abstract void writeRaw(char[] text,int offset,int len) throws IOException
Method that will force generator to copy input text verbatim with no modifications (including that no escaping is done and no separators are added even if context [array, object] would otherwise require such). If such separators are desired, use writeRawValue(String) instead.

public abstract void writeRaw(String text) throws IOException
Method that will force generator to copy input text verbatim with no modifications (including that no escaping is done and no separators are added even if context [array, object] would otherwise require such). If such separators are desired, use writeRawValue(String) instead.

Jackson(>=2.7.7) fixes the possibility of missing tail data when the length of the value is in a range
https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.7.7
https://github.com/FasterXML/jackson-core/issues/30

How was this patch tested?

org.apache.spark.sql.catalyst.expressions.JsonExpressionsSuite
test("some big value")

…ng text)

cxzl25 · 2018-03-12T09:22:26Z

Another solution:
#20738

AmplabJenkins · 2018-06-09T00:13:13Z

Can one of the admins verify this patch?

southernriver · 2019-05-10T09:08:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala

+        // Now we use the jackson version is 2.6.x
+        // So comment calls the code for this method ( writeRaw(char[] text, int offset, int len) )
+        // Although using writeRaw(String text) will lose some performance
+        g.writeRaw(p.getText)


What is the different between writeRaw(String text) and writeRaw(String text, int offset, int len),I have seen the source code of both functions, but I don't know why this action could solve this? thank you !
code for 2.6.X

public void writeRaw(String text) throws IOException, JsonGenerationException { int start = 0; int len = text.length(); while (len > 0) { char[] buf = _charBuffer; final int blen = buf.length; final int len2 = (len < blen) ? len : blen; text.getChars(start, start+len2, buf, 0); writeRaw(buf, 0, len2); start += len2; len -= len2; } } @Override public void writeRaw(String text, int offset, int len) throws IOException, JsonGenerationException { while (len > 0) { char[] buf = _charBuffer; final int blen = buf.length; final int len2 = (len < blen) ? len : blen; text.getChars(offset, offset+len2, buf, 0); writeRaw(buf, 0, len2); offset += len2; len -= len2; } }

I debug that the value of p.getTextOffset is not zero and which is exactly the missing length of string!

Replace writeRaw(char[] text, int offset, int len) with writeRaw(Stri…

5a83def

…ng text)

cxzl25 mentioned this pull request Mar 12, 2018

[SPARK-23603][SQL]When the length of the json is in a range,get_json_object will result in missing tail data #20738

Closed

cxzl25 closed this Jun 28, 2018

southernriver reviewed May 10, 2019

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-23603][SQL]When the length of the json is in a range,get_json_object will result in missing tail data #20739

[SPARK-23603][SQL]When the length of the json is in a range,get_json_object will result in missing tail data #20739

cxzl25 commented Mar 5, 2018 •

edited

Loading

cxzl25 commented Mar 12, 2018

AmplabJenkins commented Jun 9, 2018

southernriver May 10, 2019 •

edited

Loading

[SPARK-23603][SQL]When the length of the json is in a range,get_json_object will result in missing tail data #20739

[SPARK-23603][SQL]When the length of the json is in a range,get_json_object will result in missing tail data #20739

Conversation

cxzl25 commented Mar 5, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

cxzl25 commented Mar 12, 2018

AmplabJenkins commented Jun 9, 2018

southernriver May 10, 2019 • edited Loading

Choose a reason for hiding this comment

cxzl25 commented Mar 5, 2018 •

edited

Loading

southernriver May 10, 2019 •

edited

Loading