Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds HoodieGlobalBloomIndex #438

Merged
merged 1 commit into from
Sep 29, 2018

Conversation

leletan
Copy link
Contributor

@leletan leletan commented Aug 14, 2018

WHAT

  • feature(HoodieGlobalBloomIndex): adds a new type of bloom index to allow global record key lookup

@CLAassistant
Copy link

CLAassistant commented Aug 14, 2018

CLA assistant check
All committers have signed the CLA.

/**
* This filter will only work with hoodie dataset since it will only load partition
* with .hoodie_partition_metadata file in it.
* Created by jiale.tan on 8/13/18.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the author names from the file

@@ -0,0 +1,63 @@
/*
* Copyright (c) 2017 Uber Technologies, Inc. ([email protected])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please change the year to 2018

* with .hoodie_partition_metadata file in it.
* Created by jiale.tan on 8/13/18.
*/
public class HoodieGlobalBloomIndex<T extends HoodieRecordPayload> extends HoodieBloomIndex<T> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please override the following method here :

  @Override
  public boolean isGlobal() {
    return true;
  }

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome.. Thanks @leletan :) . Have you been able to test this out with any toy datasets for validation?

@n3nash
Copy link
Contributor

n3nash commented Aug 14, 2018

In general, very concise PR @leletan, just what was needed!

@leletan
Copy link
Contributor Author

leletan commented Aug 14, 2018

@vinothchandar @n3nash Thanks for the advices, will make change accordingly. Not yet validated with data set, will do it.

@leletan
Copy link
Contributor Author

leletan commented Aug 14, 2018

A quick small data set test indicates sometimes the cross partition deduping is not working, will dig deeper.

@leletan leletan changed the title Adds HoodieGlobalBloomIndex [DO NOT MERGE] Adds HoodieGlobalBloomIndex Aug 14, 2018
@leletan
Copy link
Contributor Author

leletan commented Aug 20, 2018

@vinothchandar wondering if there is any use case I may run some dataset to test this PR, other than cross partition deduping.

@vinothchandar
Copy link
Member

Thats the main use-case.. Can't think of anything else top of my head.

but there can be several corner cases like

  • ensuring the partitionpath supplied in the RDD, has no bearing - i.e all keys checked against all files..
  • generate input such that partition paths in input RDD are a subset of total partitions in dataset (HoodieTestDataGenerator may have them to be same by default)..

@vinothchandar
Copy link
Member

@leletan @n3nash I think I may know why global de-duping may not be working as expected

/**
   * For each incoming record, produce N output records, 1 each for each file against which the
   * record's key needs to be checked. For datasets, where the keys have a definite insert order
   * (e.g: timestamp as prefix), the number of files to be compared gets cut down a lot from range
   * pruning.
   */
  // sub-partition to ensure the records can be looked up against files & also prune
  // file<=>record comparisons based on recordKey
  // ranges in the index info.
  @VisibleForTesting
  JavaPairRDD<String, Tuple2<String, HoodieKey>> explodeRecordRDDWithFileComparisons(

this method still only joins input records using the partitionpath supplied in the HoodieKey, this needs to be overridden and fixed as well..

@n3nash are you able to take a pass here and provide a path for @leletan ? Also please take a look at usages of HoodieIndex::isGlobal and see if they will still be okay..

@n3nash
Copy link
Contributor

n3nash commented Aug 20, 2018

@vinothchandar Yes, I can drive this. You are right, JavaPairRDD<String, Tuple2<String, HoodieKey>> explodeRecordRDDWithFileComparisons(, this needs to be overloaded for 2 reasons 1) To be able to autoComputeParallelism correctly and 2) For finding the correct matching files (which in this case should be all).

@leletan Please override that method as well. Also, I took a closer look at HoodieIndex::isGlobal, I think it's best to return false here. There are couple of places in the code which has assumptions about what isGlobal=true means and this implementation does not either satisfy those or does not apply to that assumption. There should be no side-effect of setting it to false.

@vinothchandar We need to remove those assumptions from the code (they are specifically there because of HbaseIndex), that would make isGlobal understanding clean.

@n3nash
Copy link
Contributor

n3nash commented Aug 29, 2018

@leletan Just checking back here, were you able to make any progress on this ?

@leletan
Copy link
Contributor Author

leletan commented Aug 30, 2018

@n3nash Was still in the progress of making the end-to-end data test work.
First was stuck by this and found a work around in that ticket.

Then I was kind of stuck but just found a way to do a dedup later record here to make sure the cross partition dedup in this PR will keep the record in the old partition in my end-to-end testing.

And next step (likely tomorrow) I am going to continue this work based on your latest input.

But eventually my test will be based on changes on top of this branch which is based on 0.4.2-SNAPSHOT since we are using spark 2.3 in our production.

Wondering if you guys have any suggestions for end-to-end testing for this PR against master or maybe just running the PR against HoodieJavaApp is good enough

@n3nash
Copy link
Contributor

n3nash commented Aug 31, 2018

@leletan thanks for the update. Definitely run the PR against HoodieJavaApp, also if you can try to run the PR against some dataset internally, that will help vet the PR more. May be run the dataset against the latest release (without your PR) and then with your PR, you can validate if your dataset behaves correctly in both scenarios; that's a good way to vet your code.

We don't have an exhaustive set of end to end testing as such (as we discussed f2f) but unit tests should capture most of the quirks.

@leletan leletan force-pushed the wip-global-bloomfilter-lookup branch from 9139aa8 to bfe573e Compare September 1, 2018 00:36
@leletan
Copy link
Contributor Author

leletan commented Sep 1, 2018

Made some new modification based on comments and also added some unit tests. Will continue to do some real data testing later.

@vinothchandar vinothchandar changed the title [DO NOT MERGE] Adds HoodieGlobalBloomIndex [WIP] Adds HoodieGlobalBloomIndex Sep 4, 2018
@leletan
Copy link
Contributor Author

leletan commented Sep 10, 2018

Ran data with HoodieJavaApp - seems working fine as expected. Now will put this into real spark ETL job for a test

@n3nash
Copy link
Contributor

n3nash commented Sep 10, 2018

@leletan sounds great, let me know how it goes.

@vinothchandar
Copy link
Member

nicee!

@leletan
Copy link
Contributor Author

leletan commented Sep 12, 2018

I had some changes based on discussion in #441, please let me know if it make sense to add that code into this PR or better to make a separate PR for that.

@leletan
Copy link
Contributor Author

leletan commented Sep 12, 2018

Tested this with our real spark ETL job along with some changes I made for #441, things are working fine. Let me know if I need to checkin those changes here as well or in a separate PR

@n3nash
Copy link
Contributor

n3nash commented Sep 16, 2018

@leletan Please open a separate PR for changes made for changes talked in the #441. I'll take another pass at this PR later tonight.

return partitionRecordKeyPairRDD.map(partitionRecordKeyPair -> {
String recordKey = partitionRecordKeyPair._2();

List<Tuple2<String, BloomIndexFileInfo>> indexInfos =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are performing the exact same operation for all recordKey's here. May be move this outside ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point

return writeParquetFile(partitionPath, filename, records, schema, filter, createCommitTime);
}

private String writeParquetFile(String partitionPath, String filename, List<HoodieRecord> records, Schema schema,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move these methods to HoodieTestUtils. These are reused across TestHoodieBloomIndex..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point will do.

@n3nash
Copy link
Contributor

n3nash commented Sep 16, 2018

@leletan Left a few comments, rest looks ok.

@leletan
Copy link
Contributor Author

leletan commented Sep 18, 2018

Made change according to the comments.
If all things look fine, let me know if a final rebase is needed to have cleaner commit log.

BloomFilter filter,
boolean createCommitTime) throws IOException, InterruptedException {
Thread.sleep(1000);
String commitTime = new SimpleDateFormat("yyyyMMddHHmmss").format(new Date());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may already be a util in TestUtils to create a new dataFile, use that instead of creating a new file here ?


if (createCommitTime) {
// Also make sure the commit is valid
new File(basePath + "/" + HoodieTableMetaClient.METAFOLDER_NAME).mkdirs();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add utils for these in HoodieTestUtils - createMetadataFolder etc.. and then use them here

@n3nash
Copy link
Contributor

n3nash commented Sep 18, 2018

@leletan Left a few comments on the TestUtils usage, once that is done please squash your commits to a single commit.

@leletan leletan closed this Sep 19, 2018
@leletan leletan reopened this Sep 19, 2018
@leletan leletan closed this Sep 19, 2018
@leletan leletan reopened this Sep 19, 2018
@n3nash
Copy link
Contributor

n3nash commented Sep 21, 2018

@leletan Please squash the commits and let me know when the diff is ready. I understand the cyclic dependency on HoodieTestUtils so let's avoid that. Please see if there are utils that you can reuse for creating datafiles etc (instead of creating them in the code as you have right now). We can get this diff landed soon.

@leletan
Copy link
Contributor Author

leletan commented Sep 22, 2018

Good point will do

@leletan leletan force-pushed the wip-global-bloomfilter-lookup branch from 19ad7e6 to 2b3ee20 Compare September 24, 2018 05:35
@leletan
Copy link
Contributor Author

leletan commented Sep 24, 2018

Done

Schema schema,
BloomFilter filter,
boolean createCommitTime) throws IOException, InterruptedException {
Thread.sleep(1000);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leletan any reason for the sleep?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually the 2 parquet writing helper functions are mostly just copied from that file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinothchandar This was added due to the need to create unique commitTimes since commit Times are at the granularity of a second. We should look at adding a testUtils to create unique commit times since this sleep() is leaking to many classes.

Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n3nash good to go?

@vinothchandar vinothchandar changed the title [WIP] Adds HoodieGlobalBloomIndex Adds HoodieGlobalBloomIndex Sep 28, 2018
* (e.g: timestamp as prefix), the number of files to be compared gets cut down a lot from range
* pruning.
*/
// sub-partition to ensure the records can be looked up against files & also prune
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This style of commenting doesn't look correct. Could you either move everything here to one java doc style or move some of them inside the method if you want to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, will fix

import com.uber.hoodie.table.HoodieTable;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be try to avoid reshuffle of these imports ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure.

@n3nash
Copy link
Contributor

n3nash commented Sep 28, 2018

@leletan Left 2 minor comments after which this is good to merge.

@vinothchandar FYI

@leletan leletan force-pushed the wip-global-bloomfilter-lookup branch from 2b3ee20 to 9709e2b Compare September 28, 2018 21:44
@leletan
Copy link
Contributor Author

leletan commented Sep 28, 2018

Done. Also rebased and squashed the commits

@n3nash
Copy link
Contributor

n3nash commented Sep 29, 2018

@vinothchandar LGTM

@vinothchandar vinothchandar merged commit 98fd97b into apache:master Sep 29, 2018
@vinothchandar
Copy link
Member

this is an awesome contribution... thanks @leletan ! merged.. :)

@leletan leletan deleted the wip-global-bloomfilter-lookup branch November 29, 2018 07:33
vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants