-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ORC-1741: Respect decimal reader isRepeating flag #1960
Conversation
Hi @cxzl25, I did a test on my local env: backported this patch into Orc branch 1.9, built Hive with it then run the Hive query test that had been failing before and it passed. Can you please create a new Orc release from branch 1.9 so that we could use it in Hive? This is because we can't use Orc 2.0 or later because it requires Java 17+ and Hive still support Java 8. |
@@ -1562,7 +1563,11 @@ private void nextVector(DecimalColumnVector result, | |||
} | |||
setIsRepeatingIfNeeded(result, r); | |||
} | |||
if (!preIsRepeating && result.isRepeating) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why don't you move this logic inside of setIsRepeatingIfNeeded
instead of scattering it all over the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also a question from the Hive Orc PR:
where do we handle repeated nulls? it seems that setIsRepeatingIfNeeded is only called when result.noNulls || !result.isNull[0]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a draft PR, it should not be considered a bug.
When we convert the same value of the same batch size into isRepeating=true
, the two should be equivalent.
I'm not sure why this behavior will affect the calculation results of Hiive. I repeatedly debugged this in Hive, but I didn't find the root cause.
DecimalColumnVector v1 = new DecimalColumnVector(1024, 10, 2);
v1.isRepeating = true;
v1.vector[0] = new HiveDecimalWritable("1.234");
DecimalColumnVector v2 = new DecimalColumnVector(1024, 10, 2);
for (int i = 0; i < 1024; i++) {
v2.vector[i] = new HiveDecimalWritable("1.234");
}
StringBuilder sb1 = new StringBuilder();
for (int i = 0; i < 1024; i++) {
v1.stringifyValue(sb1, i);
}
StringBuilder sb2 = new StringBuilder();
for (int i = 0; i < 1024; i++) {
v2.stringifyValue(sb2, i);
}
System.out.println(sb1.toString().equals(sb2.toString()));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DoubleTreeReader
and FloatTreeReader
also have similar behavior. If the entire batch size is repeated, it will set isRepeating=true
, which is consistent with the behavior of ORC-1266.
DoubleTreeReader
orc/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java
Lines 1034 to 1046 in 5e6a4cd
} else { | |
// no nulls | |
boolean repeating = (batchSize > 1); | |
final double d1 = utils.readDouble(stream); | |
result.vector[0] = d1; | |
// conditions to ensure bounds checks skips | |
for (int i = 1; i < batchSize && batchSize <= result.vector.length; i++) { | |
final double d2 = utils.readDouble(stream); | |
repeating = repeating && (d1 == d2); | |
result.vector[i] = d2; | |
} | |
result.isRepeating = repeating; | |
} |
FloatTreeReader
orc/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java
Lines 818 to 830 in 5e6a4cd
} else { | |
// no nulls & > 1 row (check repeating) | |
boolean repeating = (batchSize > 1); | |
final float f1 = utils.readFloat(stream); | |
result.vector[0] = f1; | |
// conditions to ensure bounds checks skips | |
for (int i = 1; i < batchSize && batchSize <= result.vector.length; i++) { | |
final float f2 = utils.readFloat(stream); | |
repeating = repeating && (f1 == f2); | |
result.vector[i] = f2; | |
} | |
result.isRepeating = repeating; | |
} |
Some tests
@Test
public void testDoubleIsRepeatingFlag() throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path testFilePath = new Path(workDir, "testDoubleIsRepeatingFlag.orc");
fs.delete(testFilePath, true);
Configuration doubleConf = new Configuration(conf);
doubleConf.set(OrcConf.STRIPE_ROW_COUNT.getAttribute(), "1024");
doubleConf.set(OrcConf.ROWS_BETWEEN_CHECKS.getAttribute(), "1");
String typeStr = "double";
TypeDescription schema = TypeDescription.fromString("struct<col1:" + typeStr + ">");
Writer w = OrcFile.createWriter(testFilePath, OrcFile.writerOptions(doubleConf).setSchema(schema));
VectorizedRowBatch b = schema.createRowBatch();
DoubleColumnVector f1 = (DoubleColumnVector) b.cols[0];
for (int i = 0; i < 1024; i++) {
f1.vector[i] = -119.4594594595D;
}
b.size = 1024;
w.addRowBatch(b);
b.reset();
for (int i = 0; i < 1024; i++) {
f1.vector[i] = 9318.4351351351D;
}
b.size = 1024;
w.addRowBatch(b);
b.reset();
for (int i = 0; i < 1024; i++) {
f1.vector[i] = -4298.1513513514D;
}
b.size = 1024;
w.addRowBatch(b);
b.reset();
w.close();
Reader.Options options = new Reader.Options();
try (Reader reader = OrcFile.createReader(testFilePath, OrcFile.readerOptions(conf));
RecordReader rows = reader.rows(options)) {
VectorizedRowBatch batch = schema.createRowBatch();
rows.nextBatch(batch);
assertEquals(1024, batch.size);
assertTrue(batch.cols[0].isRepeating);
rows.nextBatch(batch);
assertEquals(1024, batch.size);
assertTrue(batch.cols[0].isRepeating);
rows.nextBatch(batch);
assertEquals(1024, batch.size);
assertTrue(batch.cols[0].isRepeating);
}
}
@Test
public void testFloatIsRepeatingFlag() throws IOException {
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path testFilePath = new Path(workDir, "testFloatIsRepeatingFlag.orc");
fs.delete(testFilePath, true);
Configuration floatConf = new Configuration(conf);
floatConf.set(OrcConf.STRIPE_ROW_COUNT.getAttribute(), "1024");
floatConf.set(OrcConf.ROWS_BETWEEN_CHECKS.getAttribute(), "1");
String typeStr = "float";
TypeDescription schema = TypeDescription.fromString("struct<col1:" + typeStr + ">");
Writer w = OrcFile.createWriter(testFilePath, OrcFile.writerOptions(floatConf).setSchema(schema));
VectorizedRowBatch b = schema.createRowBatch();
DoubleColumnVector f1 = (DoubleColumnVector) b.cols[0];
for (int i = 0; i < 1024; i++) {
f1.vector[i] = -119.4594594595D;
}
b.size = 1024;
w.addRowBatch(b);
b.reset();
for (int i = 0; i < 1024; i++) {
f1.vector[i] = 9318.4351351351D;
}
b.size = 1024;
w.addRowBatch(b);
b.reset();
for (int i = 0; i < 1024; i++) {
f1.vector[i] = -4298.1513513514D;
}
b.size = 1024;
w.addRowBatch(b);
b.reset();
w.close();
Reader.Options options = new Reader.Options();
try (Reader reader = OrcFile.createReader(testFilePath, OrcFile.readerOptions(conf));
RecordReader rows = reader.rows(options)) {
VectorizedRowBatch batch = schema.createRowBatch();
rows.nextBatch(batch);
assertEquals(1024, batch.size);
assertTrue(batch.cols[0].isRepeating);
rows.nextBatch(batch);
assertEquals(1024, batch.size);
assertTrue(batch.cols[0].isRepeating);
rows.nextBatch(batch);
assertEquals(1024, batch.size);
assertTrue(batch.cols[0].isRepeating);
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this PR is ready, I can merge it to 1.9.4 as a release manager.
Which version is this PR targeting, @cxzl25 ? |
If the introduction of ORC-1266 causes Hive calculation results to be inaccurate, in order to ensure that Hive can use the 1.9.x version, I hope to merge it into the 1.9 branch. |
If that is your intention, please switch it to Anyway, that's too bad because we missed 1.9.4 release train. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you.
### What changes were proposed in this pull request? Decimal type, when `isRepeating` itself is false, do not try to change it. ### Why are the changes needed? apache/hive#5218 (comment) [ORC-1266](https://issues.apache.org/jira/browse/ORC-1266): DecimalColumnVector resets the isRepeating flag in the nextVector method ### How was this patch tested? Add UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #1960 from cxzl25/decimal_isRepeating. Authored-by: sychen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit e818d56) Signed-off-by: Dongjoon Hyun <[email protected]>
Decimal type, when `isRepeating` itself is false, do not try to change it. apache/hive#5218 (comment) [ORC-1266](https://issues.apache.org/jira/browse/ORC-1266): DecimalColumnVector resets the isRepeating flag in the nextVector method Add UT No Closes #1960 from cxzl25/decimal_isRepeating. Authored-by: sychen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit e818d56) Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to upgrade ORC to 1.9.5 for Apache Spark 3.5.4. ### Why are the changes needed? To bring the latest bug fix: - https://orc.apache.org/news/2024/11/14/ORC-1.9.5/ - apache/orc#1960 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #48845 from dongjoon-hyun/SPARK-50316. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Decimal type, when
isRepeating
itself is false, do not try to change it.Why are the changes needed?
apache/hive#5218 (comment)
ORC-1266: DecimalColumnVector resets the isRepeating flag in the nextVector method
How was this patch tested?
Add UT
Was this patch authored or co-authored using generative AI tooling?
No