-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds HoodieGlobalBloomIndex #438
Adds HoodieGlobalBloomIndex #438
Conversation
/** | ||
* This filter will only work with hoodie dataset since it will only load partition | ||
* with .hoodie_partition_metadata file in it. | ||
* Created by jiale.tan on 8/13/18. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the author names from the file
@@ -0,0 +1,63 @@ | |||
/* | |||
* Copyright (c) 2017 Uber Technologies, Inc. ([email protected]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please change the year to 2018
* with .hoodie_partition_metadata file in it. | ||
* Created by jiale.tan on 8/13/18. | ||
*/ | ||
public class HoodieGlobalBloomIndex<T extends HoodieRecordPayload> extends HoodieBloomIndex<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please override the following method here :
@Override
public boolean isGlobal() {
return true;
}
hoodie-client/src/test/java/com/uber/hoodie/index/bloom/TestHoodieGlobalBloomIndex.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome.. Thanks @leletan :) . Have you been able to test this out with any toy datasets for validation?
hoodie-client/src/test/java/com/uber/hoodie/index/bloom/TestHoodieGlobalBloomIndex.java
Show resolved
Hide resolved
hoodie-client/src/test/java/com/uber/hoodie/index/bloom/TestHoodieGlobalBloomIndex.java
Show resolved
Hide resolved
In general, very concise PR @leletan, just what was needed! |
@vinothchandar @n3nash Thanks for the advices, will make change accordingly. Not yet validated with data set, will do it. |
A quick small data set test indicates sometimes the cross partition deduping is not working, will dig deeper. |
@vinothchandar wondering if there is any use case I may run some dataset to test this PR, other than cross partition deduping. |
Thats the main use-case.. Can't think of anything else top of my head. but there can be several corner cases like
|
@leletan @n3nash I think I may know why global de-duping may not be working as expected
this method still only joins input records using the partitionpath supplied in the HoodieKey, this needs to be overridden and fixed as well.. @n3nash are you able to take a pass here and provide a path for @leletan ? Also please take a look at usages of |
@vinothchandar Yes, I can drive this. You are right, @leletan Please override that method as well. Also, I took a closer look at @vinothchandar We need to remove those assumptions from the code (they are specifically there because of HbaseIndex), that would make isGlobal understanding clean. |
@leletan Just checking back here, were you able to make any progress on this ? |
@n3nash Was still in the progress of making the end-to-end data test work. Then I was kind of stuck but just found a way to do a dedup later record here to make sure the cross partition dedup in this PR will keep the record in the old partition in my end-to-end testing. And next step (likely tomorrow) I am going to continue this work based on your latest input. But eventually my test will be based on changes on top of this branch which is based on 0.4.2-SNAPSHOT since we are using spark 2.3 in our production. Wondering if you guys have any suggestions for end-to-end testing for this PR against master or maybe just running the PR against |
@leletan thanks for the update. Definitely run the PR against HoodieJavaApp, also if you can try to run the PR against some dataset internally, that will help vet the PR more. May be run the dataset against the latest release (without your PR) and then with your PR, you can validate if your dataset behaves correctly in both scenarios; that's a good way to vet your code. We don't have an exhaustive set of end to end testing as such (as we discussed f2f) but unit tests should capture most of the quirks. |
9139aa8
to
bfe573e
Compare
Made some new modification based on comments and also added some unit tests. Will continue to do some real data testing later. |
Ran data with HoodieJavaApp - seems working fine as expected. Now will put this into real spark ETL job for a test |
@leletan sounds great, let me know how it goes. |
nicee! |
I had some changes based on discussion in #441, please let me know if it make sense to add that code into this PR or better to make a separate PR for that. |
Tested this with our real spark ETL job along with some changes I made for #441, things are working fine. Let me know if I need to checkin those changes here as well or in a separate PR |
hoodie-client/src/main/java/com/uber/hoodie/index/bloom/HoodieBloomIndex.java
Show resolved
Hide resolved
return partitionRecordKeyPairRDD.map(partitionRecordKeyPair -> { | ||
String recordKey = partitionRecordKeyPair._2(); | ||
|
||
List<Tuple2<String, BloomIndexFileInfo>> indexInfos = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are performing the exact same operation for all recordKey's here. May be move this outside ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point
return writeParquetFile(partitionPath, filename, records, schema, filter, createCommitTime); | ||
} | ||
|
||
private String writeParquetFile(String partitionPath, String filename, List<HoodieRecord> records, Schema schema, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move these methods to HoodieTestUtils. These are reused across TestHoodieBloomIndex..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point will do.
@leletan Left a few comments, rest looks ok. |
Made change according to the comments. |
hoodie-client/src/test/java/com/uber/hoodie/common/HoodieClientTestUtils.java
Show resolved
Hide resolved
BloomFilter filter, | ||
boolean createCommitTime) throws IOException, InterruptedException { | ||
Thread.sleep(1000); | ||
String commitTime = new SimpleDateFormat("yyyyMMddHHmmss").format(new Date()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may already be a util in TestUtils to create a new dataFile, use that instead of creating a new file here ?
|
||
if (createCommitTime) { | ||
// Also make sure the commit is valid | ||
new File(basePath + "/" + HoodieTableMetaClient.METAFOLDER_NAME).mkdirs(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add utils for these in HoodieTestUtils - createMetadataFolder etc.. and then use them here
@leletan Left a few comments on the TestUtils usage, once that is done please squash your commits to a single commit. |
@leletan Please squash the commits and let me know when the diff is ready. I understand the cyclic dependency on HoodieTestUtils so let's avoid that. Please see if there are utils that you can reuse for creating datafiles etc (instead of creating them in the code as you have right now). We can get this diff landed soon. |
Good point will do |
19ad7e6
to
2b3ee20
Compare
Done |
Schema schema, | ||
BloomFilter filter, | ||
boolean createCommitTime) throws IOException, InterruptedException { | ||
Thread.sleep(1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@leletan any reason for the sleep?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually the 2 parquet writing helper functions are mostly just copied from that file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vinothchandar This was added due to the need to create unique commitTimes since commit Times are at the granularity of a second. We should look at adding a testUtils to create unique commit times since this sleep() is leaking to many classes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n3nash good to go?
* (e.g: timestamp as prefix), the number of files to be compared gets cut down a lot from range | ||
* pruning. | ||
*/ | ||
// sub-partition to ensure the records can be looked up against files & also prune |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This style of commenting doesn't look correct. Could you either move everything here to one java doc style or move some of them inside the method if you want to.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point, will fix
import com.uber.hoodie.table.HoodieTable; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be try to avoid reshuffle of these imports ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure.
@leletan Left 2 minor comments after which this is good to merge. @vinothchandar FYI |
…low global record key lookup
2b3ee20
to
9709e2b
Compare
Done. Also rebased and squashed the commits |
@vinothchandar LGTM |
this is an awesome contribution... thanks @leletan ! merged.. :) |
WHAT