-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding documentation for migration guide and COW vs MOR tradeoffs #470
Adding documentation for migration guide and COW vs MOR tradeoffs #470
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@n3nash : Made a pass. Left few comments. Rest looks good.
| Parquet File Size | Small (high update(I/0) cost) | Large (low update cost) | | ||
| Write Amplification | High | Low (depending on compaction strategy) | | ||
|
||
### Hudi Views |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you mention that this w.r.t Merge-On-Read Storage Type
docs/concepts.md
Outdated
| Trade-off | CopyOnWrite | MergeOnRead | | ||
|-------------- |------------------| ------------------| | ||
| Data Latency | High | Low | | ||
| Query Latency | Low (raw columnar performance) | High (merge columnar + row based delta) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Compare Query latency only w.r.t views ? With Storage type, its a little confusing as Query Latency is configurable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.. can we remove this from here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/migration_guide.md
Outdated
|
||
### Approach 2 | ||
|
||
Import your existing dataset into a Hudi managed dataset using the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in parquet file format. This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data. Since all the data is Hudi managed, none of the limitations of Approach 1 apply here. Updates spanning any partitions can be applied to this dataset and Hudi will efficiently make the update available to queries. Note that not only do you get to use all Hoodie primitives on this dataset, there are other additional advantages of doing this. Hudi automatically manages file sizes of a Hudi managed dataset. You can define the desired file size when converting this dataset using the tool and Hudi will ensure it writes out files adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into small files rather than writing new small ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This approach is essentially creating a new (bootstrap) table managed by Hudi.
HDFSParquetImporter is one of the options here to import parquet files. right ?
The clients are also free to read any Spark DataSource and write to a new location as Hudi data-source. right ?
For huge datasets, this can be as simple as :
for partition in [list of partitions in source dataset] {
val inputDF = spark.read.format("any_input_format").load("partition_path")
inputDF.write.format("com.uber.hoodie").option()....save("basePath")
}
There are other options too like using custom Java/scala scripts using HoodieWriteClient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I think it would be good to point this out too.. its very simple to do that with the DataSource API
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
b2d31da
to
b7e7268
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments
docs/concepts.md
Outdated
## Terminologies | ||
|
||
* `Hudi Dataset` | ||
A structured hive/spark table managed by Hudi. Hudi supports both partitioned and non-partitioned Hive tables. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dataset can back multiple tables right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, changed slightly
docs/concepts.md
Outdated
| Trade-off | CopyOnWrite | MergeOnRead | | ||
|-------------- |------------------| ------------------| | ||
| Data Latency | High | Low | | ||
| Query Latency | Low (raw columnar performance) | High (merge columnar + row based delta) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes.. can we remove this from here?
docs/concepts.md
Outdated
|-------------- |------------------| ------------------| | ||
| Data Latency | High | Low | | ||
| Query Latency | Low (raw columnar performance) | High (merge columnar + row based delta) | | ||
| Update cost (I/O) | High (rewrite entire parquet) | Low (append to delta file) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make everything relative terms? "High" -> "Higher" , "Low" -> "Lower" , "Small" -> "Smaller" .. Currently I get the impression that for e.g COW writes small parquet files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -158,7 +158,8 @@ summary: "Here we list all possible configurations and what they mean" | |||
|
|||
Writing data via Hoodie happens as a Spark job and thus general rules of spark debugging applies here too. Below is a list of things to keep in mind, if you are looking to improving performance or reliability. | |||
|
|||
- **Right operations** : Use `bulkinsert` to load new data into a table, and there on use `upsert`/`insert`. Difference between them is that bulk insert uses a disk based write path to scale to load large inputs without need to cache it. | |||
- **Write operations** : Use `bulkinsert` to load new data into a table, and there on use `upsert`/`insert`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thats genuinely funny.. "Right"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha yeah
docs/quickstart.md
Outdated
@@ -58,7 +58,9 @@ export SPARK_CONF_DIR=$SPARK_HOME/conf | |||
export PATH=$JAVA_HOME/bin:$HIVE_HOME/bin:$HADOOP_HOME/bin:$SPARK_INSTALL/bin:$PATH | |||
``` | |||
|
|||
### DataSource API | |||
### Two different API's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there is some explanation about which API to use when (which may be good to add), can we remove the "Two different APIs" heading/.. it just nests the doc more without clear value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added some explanation, I think this heading gives a structure to 2 subheadings, otherwise 2 subheadings without an umbrella looked weird when I was reading the whole document.
docs/quickstart.md
Outdated
@@ -215,11 +219,11 @@ ALTER TABLE `hoodie_rt` ADD IF NOT EXISTS PARTITION (datestr='2015-03-17') LOCAT | |||
|
|||
|
|||
|
|||
## Querying The Dataset | |||
### Querying The Dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we find another more speciic heading here may be? Having "Query a Hoodie dataset" > "Querying The Dataset" as heading hierarchy seems unclear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -263,7 +267,7 @@ select count(*) from hive.default.hoodie_test | |||
|
|||
|
|||
|
|||
## Incremental Queries | |||
## Incremental Queries of a Hoodie dataset |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is that really needed? is nt "Hoodie dataset" obvious from contexT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally mention a hoodie dataset everywhere so just added this for being standardized
docs/migration_guide.md
Outdated
|
||
### Approach 1 | ||
|
||
Hudi can be used to manage an existing dataset without affecting/altering the historical data already present in the dataset. Hudi has been implemented to be compatible with such a mixed dataset with a caveat that either the complete Hive partition is Hudi managed or not. Thus the lowest granularity at which Hudi manages a dataset is a Hive partition. Start using the datasource API or the WriteClient to write to the dataset and make sure you start writing to a new partition. Note, since the historical partitions are not managed by HUDI, none of the primitives provided by HUDI work on the data in those partitions. More concretely, one cannot perform upserts or incremental pull on such older partitions not managed by the HUDI dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"make sure you start writing to a new partition or convert your last N partitions into Hudi instead of entire table"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/migration_guide.md
Outdated
|
||
### Approach 2 | ||
|
||
Import your existing dataset into a Hudi managed dataset using the HDFSParquetImporter tool. As the name suggests, this only works if your existing dataset is in parquet file format. This tool essentially starts a Spark Job to read the existing parquet dataset and converts it into a HUDI managed dataset by re-writing all the data. Since all the data is Hudi managed, none of the limitations of Approach 1 apply here. Updates spanning any partitions can be applied to this dataset and Hudi will efficiently make the update available to queries. Note that not only do you get to use all Hoodie primitives on this dataset, there are other additional advantages of doing this. Hudi automatically manages file sizes of a Hudi managed dataset. You can define the desired file size when converting this dataset using the tool and Hudi will ensure it writes out files adhering to the config. It will also ensure that smaller files later get corrected by routing some new inserts into small files rather than writing new small ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 I think it would be good to point this out too.. its very simple to do that with the DataSource API
338abae
to
fda78e4
Compare
@n3nash is this still WIP? |
@vinothchandar removed WIP |
…ving some docs around for more clarity
fda78e4
to
0be3fd6
Compare
|
||
5. ["Hudi: Large-Scale, Near Real-Time Pipelines at Uber"](https://databricks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for doing this..
No description provided.