-
Notifications
You must be signed in to change notification settings - Fork 6
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
f7278e0
commit 3073e99
Showing
14 changed files
with
183 additions
and
23 deletions.
There are no files selected for viewing
69 changes: 69 additions & 0 deletions
69
tech-summary/lectures/intro_to_apache_spark_for_java_and_scala_developers.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
# Intro to Apache Spark for Java and Scala Developers | ||
|
||
Initially from onenotes' record on May 24, 2018 | ||
[video](https://www.youtube.com/watch?v=x8xXXqvhZq8), by Ted Malaska | ||
|
||
## Taken | ||
|
||
- One driver and many executors, shuffle service | ||
- DAG is small and can be passed to each of executors | ||
|
||
<img src="resources/imgs/spark_ted_malaska_distribute_program.png" alt="spark_ted_malaska_distribute_program" width="600"/> | ||
|
||
|
||
## Problem with Hadoop | ||
<img src="resources/imgs/spark_ted_malaska_shuffle.png" alt="spark_ted_malaska_shuffle" width="600"/> | ||
|
||
|
||
- Shuffring =. Mapper * Reduccer | ||
- Single point problem | ||
- Transfer data could be the bottleneck | ||
|
||
## Spark | ||
|
||
keyword | ||
Single **driver** - **broadcast** tasks to **schedulers** and **take** result back | ||
|
||
### RDD | ||
Mutable data used for replay(handling single point failure), in memory with schema(data frame) | ||
In memory, no schema | ||
**data frame means RDD with schema** | ||
|
||
### DAG | ||
<img src="resources/imgs/spark_ted_malaska_dag.png" alt="spark_ted_malaska_dag" width="600"/> | ||
|
||
|
||
What is action: Count, take, foreach | ||
What is Transformation: Map, ReducebyKey, Group By Key,Join by Key | ||
DAG+RDD makes, when anything went wrong, its more easy to recover | ||
|
||
### FlumeJava | ||
Write distribute program is the same as writing local one | ||
|
||
<img src="resources/imgs/spark_ted_malaska_flume_java.png" alt="spark_ted_malaska_flume_java" width="600"/> | ||
|
||
|
||
|
||
### Manage Parallelism | ||
|
||
#### A better hash(Stew) | ||
Usually Math.abs(value.HashCode)%# works | ||
IF Most of keys are the same | ||
-> SALT, add random key(dirt), such as Mod2 Records per reducer | ||
|
||
#### Cartesian Join | ||
When you join two tables with many to many relationship, it generate thousands of keys in the middle | ||
<img src="resources/imgs/spark_ted_malaska_cartesian_join.png" alt="spark_ted_malaska_cartesian_join" width="600"/> | ||
|
||
|
||
- Nested structures | ||
+ cell in table, which has rows | ||
+ Example, if we have one to many relations, bob, bob has 3 cats, if we join two table, then will be bob cat1, bot cat2, bot cat3, with nested fields we will only see one row for bob with 3 cates | ||
+ Reduce join scale | ||
- Windowing | ||
- ReduceByKey | ||
|
||
|
||
|
||
|
||
|
Binary file added
BIN
+519 KB
tech-summary/lectures/resources/imgs/spark_ted_malaska_cartesian_join.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+218 KB
tech-summary/lectures/resources/imgs/spark_ted_malaska_distribute_program.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+45 KB
tech-summary/papers/resources/pictures/spark_rdd_narrow_wide_transform.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,26 @@ | ||
# Spark Information Main page | ||
|
||
## Notes | ||
- [Paper of Resilient Distributed Datasets](../papers/rdd.md) | ||
- [Paper of SparkSQL]() todo | ||
- [Intro to Apache Spark By Ted Malaska](../lectures/intro_to_apache_spark_for_java_and_scala_developers.md) | ||
- [RDD, DataFrame, DataSet]() todo | ||
|
||
## Docs | ||
- [RDD Programming](https://spark.apache.org/docs/latest/rdd-programming-guide.html) | ||
- [SparkSQL, DataFrame, DataSet](https://spark.apache.org/docs/latest/sql-programming-guide.html) | ||
- [Spark Quick Start](https://spark.apache.org/docs/latest/quick-start.html) | ||
|
||
## Examples | ||
- [Spark Examples](https://spark.apache.org/examples.html) | ||
- [Spark数据处理常用的那几招](https://blog.csdn.net/eric_sunah/article/details/51822876) | ||
*** | ||
- [Load parquet file](https://sparkbyexamples.com/spark/spark-streaming-kafka-consumer-example-in-json-format/) | ||
- [Saprk SQL map functions](https://sparkbyexamples.com/spark/spark-sql-map-functions/) | ||
- [Spark streaming, consume data from kafka in JSON format](https://sparkbyexamples.com/spark/spark-streaming-kafka-consumer-example-in-json-format/) | ||
- [Different ways to Create DataFrame in Spark](https://sparkbyexamples.com/spark/different-ways-to-create-a-spark-dataframe/) | ||
*** | ||
|
||
## Others | ||
- [awesome-spark/awesome-spark](https://github.com/awesome-spark/awesome-spark) | ||
- [Spark学习笔记--超全总结](http://chant00.com/2017/07/28/Spark%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/) 全到看不完 |