Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding documentation for migration guide and COW vs MOR tradeoffs #470

Merged

Conversation

n3nash
Copy link
Contributor

@n3nash n3nash commented Sep 25, 2018

No description provided.

Copy link
Contributor

@bvaradar bvaradar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@n3nash : Made a pass. Left few comments. Rest looks good.

| Parquet File Size | Small (high update(I/0) cost) | Large (low update cost) |
| Write Amplification | High | Low (depending on compaction strategy) |

### Hudi Views
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you mention that this w.r.t Merge-On-Read Storage Type

docs/concepts.md Outdated
| Trade-off | CopyOnWrite | MergeOnRead |
|-------------- |------------------| ------------------|
| Data Latency | High | Low |
| Query Latency | Low (raw columnar performance) | High (merge columnar + row based delta) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compare Query latency only w.r.t views ? With Storage type, its a little confusing as Query Latency is configurable

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.. can we remove this from here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


### Approach 2

Import your existing dataset into a Hudi managed dataset using the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in parquet file format. This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data. Since all the data is Hudi managed, none of the limitations of Approach 1 apply here. Updates spanning any partitions can be applied to this dataset and Hudi will efficiently make the update available to queries. Note that not only do you get to use all Hoodie primitives on this dataset, there are other additional advantages of doing this. Hudi automatically manages file sizes of a Hudi managed dataset. You can define the desired file size when converting this dataset using the tool and Hudi will ensure it writes out files adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into small files rather than writing new small ones.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach is essentially creating a new (bootstrap) table managed by Hudi.

HDFSParquetImporter is one of the options here to import parquet files. right ?

The clients are also free to read any Spark DataSource and write to a new location as Hudi data-source. right ?

For huge datasets, this can be as simple as :
for partition in [list of partitions in source dataset] {
val inputDF = spark.read.format("any_input_format").load("partition_path")
inputDF.write.format("com.uber.hoodie").option()....save("basePath")
}

There are other options too like using custom Java/scala scripts using HoodieWriteClient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think it would be good to point this out too.. its very simple to do that with the DataSource API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@n3nash n3nash force-pushed the documentation_upgraded_migration_tradeoff branch from b2d31da to b7e7268 Compare September 26, 2018 23:40
Copy link
Member

@vinothchandar vinothchandar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments

docs/concepts.md Outdated
## Terminologies

* `Hudi Dataset`
A structured hive/spark table managed by Hudi. Hudi supports both partitioned and non-partitioned Hive tables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataset can back multiple tables right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, changed slightly

docs/concepts.md Show resolved Hide resolved
docs/concepts.md Outdated
| Trade-off | CopyOnWrite | MergeOnRead |
|-------------- |------------------| ------------------|
| Data Latency | High | Low |
| Query Latency | Low (raw columnar performance) | High (merge columnar + row based delta) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes.. can we remove this from here?

docs/concepts.md Outdated
|-------------- |------------------| ------------------|
| Data Latency | High | Low |
| Query Latency | Low (raw columnar performance) | High (merge columnar + row based delta) |
| Update cost (I/O) | High (rewrite entire parquet) | Low (append to delta file) |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make everything relative terms? "High" -> "Higher" , "Low" -> "Lower" , "Small" -> "Smaller" .. Currently I get the impression that for e.g COW writes small parquet files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -158,7 +158,8 @@ summary: "Here we list all possible configurations and what they mean"

Writing data via Hoodie happens as a Spark job and thus general rules of spark debugging applies here too. Below is a list of things to keep in mind, if you are looking to improving performance or reliability.

- **Right operations** : Use `bulkinsert` to load new data into a table, and there on use `upsert`/`insert`. Difference between them is that bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
- **Write operations** : Use `bulkinsert` to load new data into a table, and there on use `upsert`/`insert`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats genuinely funny.. "Right"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haha yeah

@@ -58,7 +58,9 @@ export SPARK_CONF_DIR=$SPARK_HOME/conf
export PATH=$JAVA_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_INSTALL/bin:$PATH
```

### DataSource API
### Two different API's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there is some explanation about which API to use when (which may be good to add), can we remove the "Two different APIs" heading/.. it just nests the doc more without clear value

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added some explanation, I think this heading gives a structure to 2 subheadings, otherwise 2 subheadings without an umbrella looked weird when I was reading the whole document.

@@ -215,11 +219,11 @@ ALTER TABLE `hoodie_rt` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCAT



## Querying The Dataset
### Querying The Dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we find another more speciic heading here may be? Having "Query a Hoodie dataset" > "Querying The Dataset" as heading hierarchy seems unclear to me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -263,7 +267,7 @@ select count(*) from hive.default.hoodie_test



## Incremental Queries
## Incremental Queries of a Hoodie dataset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that really needed? is nt "Hoodie dataset" obvious from contexT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally mention a hoodie dataset everywhere so just added this for being standardized


### Approach 1

Hudi can be used to manage an existing dataset without affecting/altering the historical data already present in the dataset. Hudi has been implemented to be compatible with such a mixed dataset with a caveat that either the complete Hive partition is Hudi managed or not. Thus the lowest granularity at which Hudi manages a dataset is a Hive partition. Start using the datasource API or the WriteClient to write to the dataset and make sure you start writing to a new partition. Note, since the historical partitions are not managed by HUDI, none of the primitives provided by HUDI work on the data in those partitions. More concretely, one cannot perform upserts or incremental pull on such older partitions not managed by the HUDI dataset.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"make sure you start writing to a new partition or convert your last N partitions into Hudi instead of entire table"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


### Approach 2

Import your existing dataset into a Hudi managed dataset using the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in parquet file format. This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data. Since all the data is Hudi managed, none of the limitations of Approach 1 apply here. Updates spanning any partitions can be applied to this dataset and Hudi will efficiently make the update available to queries. Note that not only do you get to use all Hoodie primitives on this dataset, there are other additional advantages of doing this. Hudi automatically manages file sizes of a Hudi managed dataset. You can define the desired file size when converting this dataset using the tool and Hudi will ensure it writes out files adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into small files rather than writing new small ones.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 I think it would be good to point this out too.. its very simple to do that with the DataSource API

@n3nash n3nash force-pushed the documentation_upgraded_migration_tradeoff branch 2 times, most recently from 338abae to fda78e4 Compare October 1, 2018 19:26
@vinothchandar
Copy link
Member

@n3nash is this still WIP?

@n3nash n3nash changed the title (WIP) Adding documentation for migration guide and COW vs MOR tradeoffs Adding documentation for migration guide and COW vs MOR tradeoffs Oct 18, 2018
@n3nash
Copy link
Contributor Author

n3nash commented Oct 18, 2018

@vinothchandar removed WIP

@n3nash n3nash force-pushed the documentation_upgraded_migration_tradeoff branch from fda78e4 to 0be3fd6 Compare October 18, 2018 23:28

5. ["Hudi: Large-Scale, Near Real-Time Pipelines at Uber"](https://databricks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for doing this..

@vinothchandar vinothchandar merged commit 48aa026 into apache:master Oct 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants