Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ORC-1741: Respect decimal reader isRepeating flag #1960

Closed
wants to merge 3 commits into from

Conversation

cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Jun 24, 2024

What changes were proposed in this pull request?

Decimal type, when isRepeating itself is false, do not try to change it.

Why are the changes needed?

apache/hive#5218 (comment)

ORC-1266: DecimalColumnVector resets the isRepeating flag in the nextVector method

How was this patch tested?

Add UT

Was this patch authored or co-authored using generative AI tooling?

No

@difin
Copy link

difin commented Jun 28, 2024

Hi @cxzl25, I did a test on my local env: backported this patch into Orc branch 1.9, built Hive with it then run the Hive query test that had been failing before and it passed. Can you please create a new Orc release from branch 1.9 so that we could use it in Hive? This is because we can't use Orc 2.0 or later because it requires Java 17+ and Hive still support Java 8.

@@ -1562,7 +1563,11 @@ private void nextVector(DecimalColumnVector result,
}
setIsRepeatingIfNeeded(result, r);
}
if (!preIsRepeating && result.isRepeating) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you move this logic inside of setIsRepeatingIfNeeded instead of scattering it all over the code?

Copy link
Member

@deniskuzZ deniskuzZ Jun 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also a question from the Hive Orc PR:

where do we handle repeated nulls? it seems that setIsRepeatingIfNeeded is only called when result.noNulls || !result.isNull[0]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a draft PR, it should not be considered a bug.

When we convert the same value of the same batch size into isRepeating=true, the two should be equivalent.

I'm not sure why this behavior will affect the calculation results of Hiive. I repeatedly debugged this in Hive, but I didn't find the root cause.

    DecimalColumnVector v1 = new DecimalColumnVector(1024, 10, 2);
    v1.isRepeating = true;
    v1.vector[0] = new HiveDecimalWritable("1.234");

    DecimalColumnVector v2 = new DecimalColumnVector(1024, 10, 2);
    for (int i = 0; i < 1024; i++) {
      v2.vector[i] = new HiveDecimalWritable("1.234");
    }

    StringBuilder sb1 = new StringBuilder();
    for (int i = 0; i < 1024; i++) {
      v1.stringifyValue(sb1, i);
    }
    StringBuilder sb2 = new StringBuilder();
    for (int i = 0; i < 1024; i++) {
      v2.stringifyValue(sb2, i);
    }
    System.out.println(sb1.toString().equals(sb2.toString()));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DoubleTreeReader and FloatTreeReader also have similar behavior. If the entire batch size is repeated, it will set isRepeating=true, which is consistent with the behavior of ORC-1266.


DoubleTreeReader

} else {
// no nulls
boolean repeating = (batchSize > 1);
final double d1 = utils.readDouble(stream);
result.vector[0] = d1;
// conditions to ensure bounds checks skips
for (int i = 1; i < batchSize && batchSize <= result.vector.length; i++) {
final double d2 = utils.readDouble(stream);
repeating = repeating && (d1 == d2);
result.vector[i] = d2;
}
result.isRepeating = repeating;
}

FloatTreeReader

} else {
// no nulls & > 1 row (check repeating)
boolean repeating = (batchSize > 1);
final float f1 = utils.readFloat(stream);
result.vector[0] = f1;
// conditions to ensure bounds checks skips
for (int i = 1; i < batchSize && batchSize <= result.vector.length; i++) {
final float f2 = utils.readFloat(stream);
repeating = repeating && (f1 == f2);
result.vector[i] = f2;
}
result.isRepeating = repeating;
}


Some tests

  @Test
  public  void testDoubleIsRepeatingFlag() throws IOException {
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);
    Path testFilePath = new Path(workDir, "testDoubleIsRepeatingFlag.orc");
    fs.delete(testFilePath, true);

    Configuration doubleConf = new Configuration(conf);
    doubleConf.set(OrcConf.STRIPE_ROW_COUNT.getAttribute(), "1024");
    doubleConf.set(OrcConf.ROWS_BETWEEN_CHECKS.getAttribute(), "1");
    String typeStr = "double";
    TypeDescription schema = TypeDescription.fromString("struct<col1:" + typeStr + ">");
    Writer w = OrcFile.createWriter(testFilePath, OrcFile.writerOptions(doubleConf).setSchema(schema));

    VectorizedRowBatch b = schema.createRowBatch();
    DoubleColumnVector f1 = (DoubleColumnVector) b.cols[0];
    for (int i = 0; i < 1024; i++) {
      f1.vector[i] = -119.4594594595D;
    }
    b.size = 1024;
    w.addRowBatch(b);

    b.reset();
    for (int i = 0; i < 1024; i++) {
      f1.vector[i] = 9318.4351351351D;
    }
    b.size = 1024;
    w.addRowBatch(b);

    b.reset();
    for (int i = 0; i < 1024; i++) {
      f1.vector[i] = -4298.1513513514D;
    }
    b.size = 1024;
    w.addRowBatch(b);

    b.reset();
    w.close();

    Reader.Options options = new Reader.Options();
    try (Reader reader = OrcFile.createReader(testFilePath, OrcFile.readerOptions(conf));
         RecordReader rows = reader.rows(options)) {
      VectorizedRowBatch batch = schema.createRowBatch();

      rows.nextBatch(batch);
      assertEquals(1024, batch.size);
      assertTrue(batch.cols[0].isRepeating);

      rows.nextBatch(batch);
      assertEquals(1024, batch.size);
      assertTrue(batch.cols[0].isRepeating);

      rows.nextBatch(batch);
      assertEquals(1024, batch.size);
      assertTrue(batch.cols[0].isRepeating);

    }
  }

  @Test
  public  void testFloatIsRepeatingFlag() throws IOException {
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);
    Path testFilePath = new Path(workDir, "testFloatIsRepeatingFlag.orc");
    fs.delete(testFilePath, true);

    Configuration floatConf = new Configuration(conf);
    floatConf.set(OrcConf.STRIPE_ROW_COUNT.getAttribute(), "1024");
    floatConf.set(OrcConf.ROWS_BETWEEN_CHECKS.getAttribute(), "1");
    String typeStr = "float";
    TypeDescription schema = TypeDescription.fromString("struct<col1:" + typeStr + ">");
    Writer w = OrcFile.createWriter(testFilePath, OrcFile.writerOptions(floatConf).setSchema(schema));

    VectorizedRowBatch b = schema.createRowBatch();
    DoubleColumnVector f1 = (DoubleColumnVector) b.cols[0];
    for (int i = 0; i < 1024; i++) {
      f1.vector[i] = -119.4594594595D;
    }
    b.size = 1024;
    w.addRowBatch(b);

    b.reset();
    for (int i = 0; i < 1024; i++) {
      f1.vector[i] = 9318.4351351351D;
    }
    b.size = 1024;
    w.addRowBatch(b);

    b.reset();
    for (int i = 0; i < 1024; i++) {
      f1.vector[i] = -4298.1513513514D;
    }
    b.size = 1024;
    w.addRowBatch(b);

    b.reset();
    w.close();

    Reader.Options options = new Reader.Options();
    try (Reader reader = OrcFile.createReader(testFilePath, OrcFile.readerOptions(conf));
         RecordReader rows = reader.rows(options)) {
      VectorizedRowBatch batch = schema.createRowBatch();

      rows.nextBatch(batch);
      assertEquals(1024, batch.size);
      assertTrue(batch.cols[0].isRepeating);

      rows.nextBatch(batch);
      assertEquals(1024, batch.size);
      assertTrue(batch.cols[0].isRepeating);

      rows.nextBatch(batch);
      assertEquals(1024, batch.size);
      assertTrue(batch.cols[0].isRepeating);
    }
  }

Copy link
Member

@williamhyun williamhyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this PR is ready, I can merge it to 1.9.4 as a release manager.

cc: @wgtmac @guiyanakuang @deshanxiao

@williamhyun williamhyun added this to the 1.9.4 milestone Jul 2, 2024
@williamhyun williamhyun removed this from the 1.9.4 milestone Jul 10, 2024
@cxzl25 cxzl25 changed the title Respect Decimal isRepeating flag ORC-1741: Respect decimal reader isRepeating flag Jul 11, 2024
@cxzl25 cxzl25 marked this pull request as ready for review July 11, 2024 02:52
@cxzl25 cxzl25 requested a review from deniskuzZ July 11, 2024 02:52
@dongjoon-hyun
Copy link
Member

Which version is this PR targeting, @cxzl25 ?

@dongjoon-hyun
Copy link
Member

If this is not a bug, please set the milestone of this PR to ORC 2.1.0. Otherwise, please set the milestone to 1.9.5 and change the JIRA issue type from Improvement to Bug.

This is just a draft PR, it should not be considered a bug.

Screenshot 2024-07-18 at 22 02 27

@cxzl25
Copy link
Contributor Author

cxzl25 commented Jul 22, 2024

Which version is this PR targeting

If the introduction of ORC-1266 causes Hive calculation results to be inaccurate, in order to ensure that Hive can use the 1.9.x version, I hope to merge it into the 1.9 branch.

@dongjoon-hyun
Copy link
Member

If that is your intention, please switch it to Bug instead of Improvement because we don't backport Improvement in core module.

Anyway, that's too bad because we missed 1.9.4 release train.

@cxzl25 cxzl25 added this to the 1.9.5 milestone Jul 23, 2024
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you.

dongjoon-hyun pushed a commit that referenced this pull request Jul 24, 2024
### What changes were proposed in this pull request?
Decimal type, when `isRepeating` itself is false, do not try to change it.

### Why are the changes needed?
apache/hive#5218 (comment)

[ORC-1266](https://issues.apache.org/jira/browse/ORC-1266): DecimalColumnVector resets the isRepeating flag in the nextVector method

### How was this patch tested?
Add UT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #1960 from cxzl25/decimal_isRepeating.

Authored-by: sychen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit e818d56)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Jul 24, 2024
Decimal type, when `isRepeating` itself is false, do not try to change it.

apache/hive#5218 (comment)

[ORC-1266](https://issues.apache.org/jira/browse/ORC-1266): DecimalColumnVector resets the isRepeating flag in the nextVector method

Add UT

No

Closes #1960 from cxzl25/decimal_isRepeating.

Authored-by: sychen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit e818d56)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun added a commit to apache/spark that referenced this pull request Nov 14, 2024
### What changes were proposed in this pull request?

This PR aims to upgrade ORC to 1.9.5 for Apache Spark 3.5.4.

### Why are the changes needed?

To bring the latest bug fix:
- https://orc.apache.org/news/2024/11/14/ORC-1.9.5/
  - apache/orc#1960

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #48845 from dongjoon-hyun/SPARK-50316.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants