Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelized read-write operations in Hoodie Merge phase #370

Merged
merged 1 commit into from
Apr 12, 2018

Conversation

n3nash
Copy link
Contributor

@n3nash n3nash commented Apr 2, 2018

  1. Parallelized read-write operations in Hoodie Merge phase
  2. Make BufferedIterator generic enough to be able to Buffer any type of payload, not just HoodieRecord.

// It caches the exception seen while fetching insert value.
public Optional<Exception> exception = Optional.empty();

public BufferedIteratorPayload(T record, Schema schema) {
this.record = record;
try {
this.insertValue = record.getData().getInsertValue(schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need to remove this? This is an expensive operation which we want to offload to reader thread?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid commenting on "WIP" PR's :)

@n3nash n3nash force-pushed the parallelize_merge branch 3 times, most recently from 0b9c336 to 21a9166 Compare April 3, 2018 00:22
@n3nash
Copy link
Contributor Author

n3nash commented Apr 3, 2018

@vinothchandar Please take a pass at the approach, comments and java docs and code cleaning coming soon after we agree on the approach.

Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decoupling Reading and Writing Parquet files from/to potentially remote server makes sense. Code changes looks good in general. I have added some minor comments.

logger.info("starting hoodie writer thread");
// Passing parent thread's TaskContext to newly launched thread for it to access original TaskContext
// properties.
TaskContext$.MODULE$.setTaskContext(sparkThreadTaskContext);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: It looks like there is a contract in using threads inside Spark Task and is repeated in update case too. Can we make a first class type (like SparkTaskThread) and handle this boiler-plate in one place.

Copy link
Contributor Author

@n3nash n3nash Apr 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TaskContext$.MODULE$.setTaskContext(sparkThreadTaskContext); this is the only boiler plate code here. The other code around is a little different in each scenario unfortunately. Trying to avoid introducing new first class types if possible.


public BufferedIterator(final Iterator<T> iterator, final long bufferMemoryLimit,
final Schema schema) {
final Schema schema, final PayloadFunction<T, R> payloadFunc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Rename PayloadFunction to BufferedIteratorPayloadFactory as it is used only in the context of BufferedIterator


package com.uber.hoodie.func.payload;

public class AbstractBufferedIteratorPayload<I, O> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abstract ?

public HoodieRecordBufferedIteratorPayload(HoodieRecord record, Schema schema) {
this.inputPayload = record;
try {
this.outputPayload = record.getData().getInsertValue(schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An observation: I am assuming this serialization is cpu intensive and with your changes, this work is decoupled from parquet write for new file-group generation case (Hoodie Create Handle). For the merge-case though, the deserialization seems to be tied up with parquet data fetch but they together are decoupled from parquet write.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's true. The issue is that for CreateHandle the process isn't dependent on anything apart from the input, whereas in the MergeHandle, a merge process with the new records takes place. A larger refactor could help address this but I'm hoping to keep that for later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets capture the AIs from here into an issue?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -40,8 +38,8 @@
* internally samples every {@link #RECORD_SAMPLING_RATE}th record and adjusts number of records in
* buffer accordingly. This is done to ensure that we don't OOM.
*/
public class BufferedIterator<K extends HoodieRecordPayload, T extends HoodieRecord<K>> implements
Iterator<BufferedIterator.BufferedIteratorPayload<T>> {
public class BufferedIterator<T, R>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add metrics around ingress, egress throughput and latency.

import com.uber.hoodie.func.payload.AbstractBufferedIteratorPayload;
import java.util.function.Function;

public abstract class PayloadFunction<T, R>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Make it interface as there is no implementation ?

@n3nash
Copy link
Contributor Author

n3nash commented Apr 4, 2018

@bvaradar Thanks for your comments. Unfortunately, the PR wasn't ready for review yet apart from just discussing the approach. Since we agree on the approach and hopefully so does @vinothchandar, I'm going to make necessary code changes now. You might need to do another pass because of this, thanks!

@n3nash n3nash force-pushed the parallelize_merge branch from 21a9166 to 3146eaa Compare April 4, 2018 06:27
@n3nash
Copy link
Contributor Author

n3nash commented Apr 4, 2018

@bvaradar @vinothchandar Cleaned up the code, please take a pass at it.

@n3nash n3nash changed the title (WIP) Parallelized read-write operations in Hoodie Merge phase Parallelized read-write operations in Hoodie Merge phase Apr 4, 2018
@n3nash n3nash force-pushed the parallelize_merge branch from 3146eaa to f005b30 Compare April 4, 2018 06:32
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall fine with approach. On top of comments I left, like to see if we can avoid introducing a single abstraction BufferedIteratorTransform<I,O,F> which contains input, output and also the function. Function can just be a lambda as well.

@Override
public boolean hasNext() {
try {
this.next = parquetReader.read();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets handle the case where hasNext is handled without next being called and vice versa..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i.e make it true to iterator contract

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

*/
public interface BufferedIteratorPayloadFunction<I, O>
extends Function<I, AbstractBufferedIteratorPayload<I, O>> {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove extra line

public HoodieRecordBufferedIteratorPayload(HoodieRecord record, Schema schema) {
this.inputPayload = record;
try {
this.outputPayload = record.getData().getInsertValue(schema);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets capture the AIs from here into an issue?

@@ -136,7 +136,8 @@ private String init(String fileId, Iterator<HoodieRecord<T>> newRecordsItr) {
// Load the new records in a map
logger.info("MaxMemoryPerPartitionMerge => " + config.getMaxMemoryPerPartitionMerge());
this.keyToNewRecords = new ExternalSpillableMap<>(config.getMaxMemoryPerPartitionMerge(),
Optional.empty(), new StringConverter(), new HoodieRecordConverter(schema, config.getPayloadClass()));
Optional.empty(), new StringConverter(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are here again..
@bvaradar @n3nash what could cause this folding/formatting change, if we have checkstyle & already pre formatted code?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are wraps that happened column width 100. Checkstyle and code-style configuration under style/ folder have 120 as wrap point.
@n3nash : If you have not already imported style/intellij-java-google-style.xml, Can you use them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I already use them, not sure what happened here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, Digged a bit. @n3nash: Can you verify IntelliJ -> Preferences-> Code Style -> Java -> Wrapping and Braces is set to 100 (instead of 120).

Found an issue in IntelliJ where if you have existing code-styles (other than default) which are active, importing a new code-style does not seem to change this value. I am not able to reproduce this consistently though. Deleting all non-default code-styles and/or activating the Default Code-Style and then importing our code-style does the trick.

Let me know how it goes.

Balaji.V

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the following :

screen shot 2018-04-04 at 3 20 54 pm

Changing this to 120 fixes it.

Future writerResult =
writerService.submit(
() -> {
logger.info("starting hoodie writer thread");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this code seems repeated, is there a way to modularize and share code? Also revisit logger statements to make then more contextualized?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed.

@@ -94,7 +95,9 @@ private void initFile(File writeOnlyFileHandle) throws IOException {
}
writeOnlyFileHandle.createNewFile();

log.info("Spilling to file location " + writeOnlyFileHandle.getAbsolutePath());
log.info("Spilling to file location " + writeOnlyFileHandle.getAbsolutePath() + " in machine ("
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rename machine to host

@n3nash n3nash force-pushed the parallelize_merge branch 2 times, most recently from 00c0f8a to 80fa5c6 Compare April 4, 2018 22:32
@n3nash
Copy link
Contributor Author

n3nash commented Apr 4, 2018

@vinothchandar @bvaradar addressed CR comments.

@vinothchandar I started making changes for BufferedIteratorTransform but realized it's easier to reason in the code with function and payload separated, kept it that way.

@n3nash n3nash force-pushed the parallelize_merge branch from 80fa5c6 to e16d6c2 Compare April 4, 2018 22:39
Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me.

handle = new HoodieCreateHandle(hoodieConfig, commitTime, hoodieTable,
payload.record.getPartitionPath());
handle.write(payload.record, payload.insertValue,
handle =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this wrapping at 80 or 120?

handle = new HoodieCreateHandle(hoodieConfig, commitTime, hoodieTable, insertPayload.getPartitionPath());

is only 108 chars..

@bvaradar does checkstyle just check for < 120.

@n3nash if its formatting incorrectly, can you please fix and make a pass on all files again

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's formatting incorrectly, I made a pass on all files..

// Holds the next entry returned by the parquet reader
private T next;
// Holds the current entry returned by the parquet reader
private T prev;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can implement this using a single variable next ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add a test for this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, done.

* needs to be buffered, the runnable function that needs to be executed in the reader thread and
* return the transformed output based on the writer function
*/
public class BufferedIteratorWrapper<I, O, E> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets rename to something more specific.. It feels like we are overloading Wrapper a lot. How about BufferedIteratorExecutor

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed.

* This class wraps a parquet reader and provides an iterator based api to
* read from a parquet file. This is used in {@link BufferedIterator}
*/
public class ParquetReaderWrappedIterator<T> implements Iterator<T> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename : ParquetReaderIterator or ParquetIterator

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed

* @param <I> input payload data type
* @param <O> output payload data type
*/
public interface BufferedIteratorPayloadFunction<I, O>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rename : BufferedIteratorTransform
(what it does is to transform I to O)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should reconsider merging the Payload and the Transform here.. Current way introduces too many classes/interfaces for what we need to achieve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree it introduces extra classes to achieve something more simple...I like the cleaner separation of payload and transformer though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a unique advantage for having them separate? In this scenario, you are just looking for a a transform to turn I to O and it seems straightforward to have them together. Even if you look up something like Adapter pattern (https://en.wikipedia.org/wiki/Adapter_pattern) , the I,O,F are together.. (we don't need the target object to reflect source object changes though)

extends AbstractBufferedIteratorPayload<GenericRecord, GenericRecord> {

public GenericRecordBufferedIteratorPayload(GenericRecord record) {
this.inputPayload = record;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we move this constructor to the abstract class above.. thats the typical pattern . or did you intend for it to be a Interface originally?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, missed to refactor that after other changes, thanks for pointing that out.

implements BufferedIteratorPayloadFunction<GenericRecord, AbstractBufferedIteratorPayload> {

@Override
public AbstractBufferedIteratorPayload apply(GenericRecord t) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class for e.g is why I strongly feel we should merge payload and transform into a single entity. We always use them to together and end up with a more complex Generic interfaces and micro implementations in the current way

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in any case, lets name this class shorter? the Transform rename would make this GenericRecordBufferedIteratorTransform which is a tad shorter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed rename

@n3nash n3nash force-pushed the parallelize_merge branch 2 times, most recently from 699805e to 90ba1c3 Compare April 6, 2018 06:31
// It caches the exception seen while fetching insert value.
public Optional<Exception> exception = Optional.empty();

public HoodieRecordBufferedIteratorPayload(HoodieRecord record, Schema schema) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should consider using a static method lambda if possible to capture the transform.. This whole class structure with constructor chaining/inheritance all seem overkill for the amount of work thats done in converting I to O. This is one of the reasons function passing was introduced in java - to avoid such boilerplate code. Can we take adv of that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored, please take a pass.

@n3nash n3nash force-pushed the parallelize_merge branch from 90ba1c3 to 967a338 Compare April 9, 2018 20:22
@n3nash
Copy link
Contributor Author

n3nash commented Apr 9, 2018

@vinothchandar Refactored, please take a pass.

@vinothchandar vinothchandar merged commit 720e42f into apache:master Apr 12, 2018
vinishjail97 pushed a commit to vinishjail97/hudi that referenced this pull request Dec 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants