Adding documentation for migration guide and COW vs MOR tradeoffs #470

n3nash · 2018-09-25T23:05:17Z

No description provided.

bvaradar

@n3nash : Made a pass. Left few comments. Rest looks good.

bvaradar · 2018-09-26T19:43:35Z

docs/concepts.md

+| Parquet File Size | Small (high update(I/0) cost) | Large (low update cost) |
+| Write Amplification | High | Low (depending on compaction strategy) |
+
+### Hudi Views


Can you mention that this w.r.t Merge-On-Read Storage Type

bvaradar · 2018-09-26T20:19:12Z

docs/concepts.md

+| Trade-off | CopyOnWrite | MergeOnRead |
+|-------------- |------------------| ------------------|
+| Data Latency | High   | Low |
+| Query Latency | Low (raw columnar performance) | High (merge columnar + row based delta) |


Compare Query latency only w.r.t views ? With Storage type, its a little confusing as Query Latency is configurable

yes.. can we remove this from here?

bvaradar · 2018-09-26T20:59:35Z

docs/migration_guide.md

+
+### Approach 2
+
+Import your existing dataset into a Hudi managed dataset using the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in parquet file format. This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data. Since all the data is Hudi managed, none of the limitations of Approach 1 apply here. Updates spanning any partitions can be applied to this dataset and Hudi will efficiently make the update available to queries. Note that not only do you get to use all Hoodie primitives on this dataset, there are other additional advantages of doing this. Hudi automatically manages file sizes of a Hudi managed dataset. You can define the desired file size when converting this dataset using the tool and Hudi will ensure it writes out files adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into small files rather than writing new small ones.


This approach is essentially creating a new (bootstrap) table managed by Hudi.

HDFSParquetImporter is one of the options here to import parquet files. right ?

The clients are also free to read any Spark DataSource and write to a new location as Hudi data-source. right ?

For huge datasets, this can be as simple as :
for partition in [list of partitions in source dataset] {
val inputDF = spark.read.format("any_input_format").load("partition_path")
inputDF.write.format("com.uber.hoodie").option()....save("basePath")
}

There are other options too like using custom Java/scala scripts using HoodieWriteClient.

+1 I think it would be good to point this out too.. its very simple to do that with the DataSource API

vinothchandar

Left some comments

vinothchandar · 2018-09-28T01:36:33Z

docs/concepts.md

+## Terminologies
+
+ * `Hudi Dataset` 
+    A structured hive/spark table managed by Hudi. Hudi supports both partitioned and non-partitioned Hive tables. 


dataset can back multiple tables right?

yes, changed slightly

docs/concepts.md

vinothchandar · 2018-09-28T01:43:02Z

docs/concepts.md

+| Trade-off | CopyOnWrite | MergeOnRead |
+|-------------- |------------------| ------------------|
+| Data Latency | High   | Low |
+| Query Latency | Low (raw columnar performance) | High (merge columnar + row based delta) |


yes.. can we remove this from here?

vinothchandar · 2018-09-28T01:44:07Z

docs/concepts.md

+|-------------- |------------------| ------------------|
+| Data Latency | High   | Low |
+| Query Latency | Low (raw columnar performance) | High (merge columnar + row based delta) |
+| Update cost (I/O) | High (rewrite entire parquet) | Low (append to delta file) |


can you make everything relative terms? "High" -> "Higher" , "Low" -> "Lower" , "Small" -> "Smaller" .. Currently I get the impression that for e.g COW writes small parquet files.

vinothchandar · 2018-09-28T01:44:54Z

docs/configurations.md

@@ -158,7 +158,8 @@ summary: "Here we list all possible configurations and what they mean"

 Writing data via Hoodie happens as a Spark job and thus general rules of spark debugging applies here too. Below is a list of things to keep in mind, if you are looking to improving performance or reliability.

- - **Right operations** : Use `bulkinsert` to load new data into a table, and there on use `upsert`/`insert`. Difference between them is that bulk insert uses a disk based write path to scale to load large inputs without need to cache it.
+ - **Write operations** : Use `bulkinsert` to load new data into a table, and there on use `upsert`/`insert`. 


thats genuinely funny.. "Right"

vinothchandar · 2018-09-28T01:47:09Z

docs/quickstart.md

@@ -58,7 +58,9 @@ export SPARK_CONF_DIR=$SPARK_HOME/conf
 export PATH=$JAVA_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_INSTALL/bin:$PATH
 ```

-### DataSource API
+### Two different API's


Unless there is some explanation about which API to use when (which may be good to add), can we remove the "Two different APIs" heading/.. it just nests the doc more without clear value

added some explanation, I think this heading gives a structure to 2 subheadings, otherwise 2 subheadings without an umbrella looked weird when I was reading the whole document.

vinothchandar · 2018-09-28T01:48:30Z

docs/quickstart.md

@@ -215,11 +219,11 @@ ALTER TABLE `hoodie_rt` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCAT



-## Querying The Dataset
+### Querying The Dataset


can we find another more speciic heading here may be? Having "Query a Hoodie dataset" > "Querying The Dataset" as heading hierarchy seems unclear to me.

vinothchandar · 2018-09-28T01:49:03Z

docs/quickstart.md

@@ -263,7 +267,7 @@ select count(*) from hive.default.hoodie_test



-## Incremental Queries
+## Incremental Queries of a Hoodie dataset


is that really needed? is nt "Hoodie dataset" obvious from contexT?

We generally mention a hoodie dataset everywhere so just added this for being standardized

vinothchandar · 2018-09-28T01:50:58Z

docs/migration_guide.md

+
+### Approach 1
+
+Hudi can be used to manage an existing dataset without affecting/altering the historical data already present in the dataset. Hudi has been implemented to be compatible with such a mixed dataset with a caveat that either the complete Hive partition is Hudi managed or not. Thus the lowest granularity at which Hudi manages a dataset is a Hive partition. Start using the datasource API or the WriteClient to write to the dataset and make sure you start writing to a new partition. Note, since the historical partitions are not managed by HUDI, none of the primitives provided by HUDI work on the data in those partitions. More concretely, one cannot perform upserts or incremental pull on such older partitions not managed by the HUDI dataset. 


"make sure you start writing to a new partition or convert your last N partitions into Hudi instead of entire table"?

vinothchandar · 2018-09-28T01:52:27Z

docs/migration_guide.md

+
+### Approach 2
+
+Import your existing dataset into a Hudi managed dataset using the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in parquet file format. This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data. Since all the data is Hudi managed, none of the limitations of Approach 1 apply here. Updates spanning any partitions can be applied to this dataset and Hudi will efficiently make the update available to queries. Note that not only do you get to use all Hoodie primitives on this dataset, there are other additional advantages of doing this. Hudi automatically manages file sizes of a Hudi managed dataset. You can define the desired file size when converting this dataset using the tool and Hudi will ensure it writes out files adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into small files rather than writing new small ones.


+1 I think it would be good to point this out too.. its very simple to do that with the DataSource API

vinothchandar · 2018-10-02T05:33:12Z

@n3nash is this still WIP?

n3nash · 2018-10-18T23:24:15Z

@vinothchandar removed WIP

…ving some docs around for more clarity

vinothchandar · 2018-10-19T22:00:15Z

docs/powered_by.md


+5. ["Hudi: Large-Scale, Near Real-Time Pipelines at Uber"](https://databricks


thanks for doing this..

bvaradar reviewed Sep 26, 2018

View reviewed changes

n3nash force-pushed the documentation_upgraded_migration_tradeoff branch from b2d31da to b7e7268 Compare September 26, 2018 23:40

vinothchandar requested changes Sep 28, 2018

View reviewed changes

n3nash force-pushed the documentation_upgraded_migration_tradeoff branch 2 times, most recently from 338abae to fda78e4 Compare October 1, 2018 19:26

n3nash changed the title ~~(WIP) Adding documentation for migration guide and COW vs MOR tradeoffs~~ Adding documentation for migration guide and COW vs MOR tradeoffs Oct 18, 2018

Adding documentation for migration guide and COW vs MOR tradeoffs, mo…

0be3fd6

…ving some docs around for more clarity

n3nash force-pushed the documentation_upgraded_migration_tradeoff branch from fda78e4 to 0be3fd6 Compare October 18, 2018 23:28

vinothchandar reviewed Oct 19, 2018

View reviewed changes

docs/powered_by.md

5. ["Hudi: Large-Scale, Near Real-Time Pipelines at Uber"](https://databricks

Copy link

Member

vinothchandar Oct 19, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for doing this..

vinothchandar approved these changes Oct 19, 2018

View reviewed changes

vinothchandar merged commit 48aa026 into apache:master Oct 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding documentation for migration guide and COW vs MOR tradeoffs #470

Adding documentation for migration guide and COW vs MOR tradeoffs #470

n3nash commented Sep 25, 2018

bvaradar left a comment

bvaradar Sep 26, 2018

bvaradar Sep 26, 2018

vinothchandar Sep 28, 2018

n3nash Oct 1, 2018

bvaradar Sep 26, 2018

vinothchandar Sep 28, 2018

n3nash Oct 1, 2018

vinothchandar left a comment

vinothchandar Sep 28, 2018

n3nash Oct 1, 2018

vinothchandar Sep 28, 2018

vinothchandar Sep 28, 2018

n3nash Oct 1, 2018

vinothchandar Sep 28, 2018

n3nash Oct 1, 2018

vinothchandar Sep 28, 2018

n3nash Oct 1, 2018

vinothchandar Sep 28, 2018

n3nash Oct 1, 2018

vinothchandar Sep 28, 2018

n3nash Oct 1, 2018

vinothchandar Sep 28, 2018

n3nash Oct 1, 2018

vinothchandar Sep 28, 2018

vinothchandar commented Oct 2, 2018

n3nash commented Oct 18, 2018

vinothchandar Oct 19, 2018


		### Approach 2

		Import your existing dataset into a Hudi managed dataset using the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in parquet file format. This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data. Since all the data is Hudi managed, none of the limitations of Approach 1 apply here. Updates spanning any partitions can be applied to this dataset and Hudi will efficiently make the update available to queries. Note that not only do you get to use all Hoodie primitives on this dataset, there are other additional advantages of doing this. Hudi automatically manages file sizes of a Hudi managed dataset. You can define the desired file size when converting this dataset using the tool and Hudi will ensure it writes out files adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into small files rather than writing new small ones.

		@@ -215,11 +219,11 @@ ALTER TABLE `hoodie_rt` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCAT



		## Querying The Dataset
		### Querying The Dataset

		@@ -263,7 +267,7 @@ select count(*) from hive.default.hoodie_test



		## Incremental Queries
		## Incremental Queries of a Hoodie dataset


		### Approach 1

		Hudi can be used to manage an existing dataset without affecting/altering the historical data already present in the dataset. Hudi has been implemented to be compatible with such a mixed dataset with a caveat that either the complete Hive partition is Hudi managed or not. Thus the lowest granularity at which Hudi manages a dataset is a Hive partition. Start using the datasource API or the WriteClient to write to the dataset and make sure you start writing to a new partition. Note, since the historical partitions are not managed by HUDI, none of the primitives provided by HUDI work on the data in those partitions. More concretely, one cannot perform upserts or incremental pull on such older partitions not managed by the HUDI dataset.


		5. ["Hudi: Large-Scale, Near Real-Time Pipelines at Uber"](https://databricks

Adding documentation for migration guide and COW vs MOR tradeoffs #470

Adding documentation for migration guide and COW vs MOR tradeoffs #470

Conversation

n3nash commented Sep 25, 2018

bvaradar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinothchandar commented Oct 2, 2018

n3nash commented Oct 18, 2018

Choose a reason for hiding this comment