Add notes about spark.

CodeBear801 · Nov 13, 2019 · 3073e99 · 3073e99
1 parent f7278e0
commit 3073e99
Show file tree

Hide file tree

Showing 14 changed files with 183 additions and 23 deletions.
diff --git a/tech-summary/lectures/intro_to_apache_spark_for_java_and_scala_developers.md b/tech-summary/lectures/intro_to_apache_spark_for_java_and_scala_developers.md
@@ -0,0 +1,69 @@
+# Intro to Apache Spark for Java and Scala Developers
+
+Initially from onenotes' record on May 24, 2018  
+[video](https://www.youtube.com/watch?v=x8xXXqvhZq8), by Ted Malaska  
+
+## Taken
+
+- One driver and many executors, shuffle service
+- DAG is small and can be passed to each of executors
+
+<img src="resources/imgs/spark_ted_malaska_distribute_program.png" alt="spark_ted_malaska_distribute_program" width="600"/>
+
+
+## Problem with Hadoop
+<img src="resources/imgs/spark_ted_malaska_shuffle.png" alt="spark_ted_malaska_shuffle" width="600"/>
+
+
+- Shuffring =. Mapper * Reduccer
+- Single point problem
+- Transfer data could be the bottleneck
+
+## Spark
+
+keyword  
+Single **driver** - **broadcast** tasks to **schedulers** and **take** result back   
+
+### RDD
+Mutable data used for replay(handling single point failure), in memory with schema(data frame)
+In memory, no schema  
+**data frame means RDD with schema**  
+
+### DAG
+<img src="resources/imgs/spark_ted_malaska_dag.png" alt="spark_ted_malaska_dag" width="600"/>
+
+
+What is action: Count, take, foreach  
+What is Transformation: Map, ReducebyKey, Group By Key,Join by Key  
+DAG+RDD makes, when anything went wrong, its more easy to recover  
+
+### FlumeJava
+Write distribute program is the same as writing local one  
+
+<img src="resources/imgs/spark_ted_malaska_flume_java.png" alt="spark_ted_malaska_flume_java" width="600"/>
+
+
+
+### Manage Parallelism
+
+#### A better hash(Stew)
+Usually Math.abs(value.HashCode)%# works  
+IF Most of keys are the same  
+-> SALT, add random key(dirt), such as Mod2 Records per reducer  
+
+#### Cartesian Join
+When you join two tables with many to many relationship, it generate thousands of keys in the middle  
+<img src="resources/imgs/spark_ted_malaska_cartesian_join.png" alt="spark_ted_malaska_cartesian_join" width="600"/>
+
+
+- Nested structures
+   + cell in table, which has rows
+   + Example, if we have one to many relations, bob, bob has 3 cats, if we join two table, then will be bob cat1, bot cat2, bot cat3, with nested fields we will only see one row for bob with 3 cates 
+   + Reduce join scale
+- Windowing
+- ReduceByKey
+
+
+
+
+
diff --git a/tech-summary/lectures/resources/imgs/spark_ted_malaska_cartesian_join.png b/tech-summary/lectures/resources/imgs/spark_ted_malaska_cartesian_join.png
diff --git a/tech-summary/lectures/resources/imgs/spark_ted_malaska_dag.png b/tech-summary/lectures/resources/imgs/spark_ted_malaska_dag.png
diff --git a/tech-summary/lectures/resources/imgs/spark_ted_malaska_distribute_program.png b/tech-summary/lectures/resources/imgs/spark_ted_malaska_distribute_program.png
diff --git a/tech-summary/lectures/resources/imgs/spark_ted_malaska_flume_java.png b/tech-summary/lectures/resources/imgs/spark_ted_malaska_flume_java.png
diff --git a/tech-summary/lectures/resources/imgs/spark_ted_malaska_shuffle.png b/tech-summary/lectures/resources/imgs/spark_ted_malaska_shuffle.png
diff --git a/tech-summary/papers/rdd.md b/tech-summary/papers/rdd.md
@@ -1,3 +1,15 @@
+- [Resilient Distributed Datasets](#resilient-distributed-datasets)
+  - [Why](#why)
+  - [Pre-solutions](#pre-solutions)
+  - [RDD](#rdd)
+    - [Interface of RDD](#interface-of-rdd)
+    - [Comments](#comments)
+    - [RDD vs DSM](#rdd-vs-dsm)
+    - [Fault handling](#fault-handling)
+      - [Recovery from faults](#recovery-from-faults)
+      - [checkpointing](#checkpointing)
+    - [Example](#example)
+
 # Resilient Distributed Datasets
 
 Notes for [Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf)   
@@ -38,50 +50,94 @@ Notes for [Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-M
    + -still has rigidness from MR (writes to disk after map, to replicated storage after reduce, RAM)
 
 
-## Spark
 
-### RDD
-RDD means Resilient Distributed Datasets, an RDD is a collection of partitions of records.
-Two operations on RDDs:
-Transformations: compute a new RDD from existing RDDs (flatMap, reduceByKey).  This just specifies a plan. runtime is lazy, doesn't have to materialize (compute), so it doesn't
+## RDD
+RDD means Resilient Distributed Datasets, an RDD is a collection of partitions of records.  
+```
+The main challenge in designing RDDs is defining a programming interface 
+that can provide fault tolerance efficiently. 
+```
+Basically, there are two ways: replicate the data across machines(or data checkpoint) or to log updates across machines.  Both approaches are expensive for data-intensive workloads, as they require copying large amounts of data over the cluster network, whose bandwidth is far lower than that of RAM, and they incur substantial storage overhead.  **RDDs provide an interface based on coarse-grained transformations (e.g., map, filter and join) that apply the same operation to many data items**.  
+
+### Interface of RDD
+
+<img src="resources/pictures/spark_rdd_interface.png" alt="spark_rdd_interface" width="500"/>  <br/>
+
+
+ - `partitions` -- returns a list of partitions
+ - `preferredLocations(p)` -- returns the preferred locations of a partition
+   + tells you about machines where computation would be faster
+ - `dependencies`
+   + how you depend on other RDDs
+ - `iterator(p, parentIters)`
+   + ask an RDD to compute one of its partitions
+ - `partitioner`
+   + allows you to specify a partitioning function
+
+
+### Comments
+
+- Transformations & Actions  
+Two operations on RDDs:  
+Transformations: compute a new RDD from existing RDDs (flatMap, reduceByKey).  This just specifies a plan. runtime is lazy, doesn't have to materialize (compute)  
+
+<img src="resources/pictures/spark_rdd_narrow_wide_transform.png" alt="spark_rdd_narrow_wide_transform" width="500"/>  <br/>
+
+
 
 Actions: where some effect is requested: result to be stored, get specific value, etc.  Causes RDDs to materialize.
 
+- Fault tolerance vs performance    
 Spark gives user control over trade off between fault tolerance with performance  
-if user frequently perist w/REPLICATE, fast recovery, but slower execution  
+if user frequently persist w/REPLICATE, fast recovery, but slower execution  
 if infrequently, fast execution but slow recovery  
 
-
+- What partition carry metadata  
 RDD carry metadata on its partitioning, so transformations that depend on multiple RDDs know whether they need to shuffle data (wide dependency) or not (narrow)
 Allows users control over locality and reduces shuffles.
 
+- What if not enough memory  
+   + LRU (Least Recently Used) on partitions
+        * first on non-persisted
+        * then persisted (but they will be available on disk. makes sure user cannot overbook RAM)
+   + user can have control on order of eviction via "persistence priority"
+   + no reason not to discard non-persisted partitions (if they've already been used)
+
 
 
-#### RDD vs DSM
+
+### RDD vs DSM
+
+<img src="resources/pictures/spark_rdd_rdd_vs_dsm.png" alt="spark_rdd_rdd_vs_dsm" width="500"/>  <br/>
+
 
 ### Fault handling
 
+When Spark computes, by default it only generates one copy of the result, doesn't replicate. Without replication, no matter if it's put in RAM or disk, if node fails, on permanent failure, data is gone.  
+When some partition is lost and needs to be recomputed, the scheduler needs to find a way to recompute it. (a fault can be detected by using a heartbeat)  
+will need to compute all partitions it depends on, until a partition in RAM/disk, or in replicated storage.  
+if wide dependency, will need all partitions of that dependency to recompute, if narrow just one that RDD  
+
+#### Recovery from faults
+
+So two mechanisms enable recovery from faults: **lineage**, and **policy of what partitions to persist**(either to one node or replicated)  
 
-Handling faults.
-    When Spark computes, by default it only generates one copy of the result, doesn't replicate. Without replication, no matter if it's put in RAM or disk, if node fails, on permanent failure, data is gone.
-    When some partition is lost and needs to be recomputed, the scheduler needs to find a way to recompute it. (a fault can be detected by using a heartbeat)
-      will need to compute all partitions it depends on, until a partition in RAM/disk, or in replicated storage.
-      if wide dependency, will need all partitions of that dependency to recompute, if narrow just one that RDD
-
-    So two mechanisms enable recovery from faults: lineage, and policy of what partitions to persist (either to one node or replicated)
-    We talked about lineage before (Transformations)
+    Lineage is represented by Transformations.  
 
     The user can call persist on an RDD.
       With RELIABLE flag, will keep multiple copies (in RAM if possible, disk if RAM is full)
       With REPLICATE flag, will write to stable storage (HDFS)
       Without flags, will try to keep in RAM (will spill to disk when RAM is full)
 
-    Why implement checkpointing? (it's expensive)
-    A: Long lineage could cause large recovery time. Or when there are wide dependencies a single failure might require many partition re-computations.
+#### checkpointing
+Why implement checkpointing? (it's expensive)  
+Long lineage could cause large recovery time. Or when there are wide dependencies a single failure might require many partition re-computations.  
 
-    Checkpointing is like buying insurance: pay writing to stable storage so can recover faster in case of fault.
-    Depends on frequency of failure and on cost of slower recovery
-    An automatic checkpointing will take these into account, together with size of data (how much time it takes to write), and computation time.
+Checkpointing is like buying insurance: pay writing to stable storage so can recover faster in case of fault.  
+Depends on frequency of failure and on cost of slower recovery  
+An automatic checkpointing will take these into account, together with size of data (how much time it takes to write), and computation time.  
+So can handle a node failure by recomputing lineage up to partitions that can be read from RAM/Disk/replicated storage.  
 
-    So can handle a node failure by recomputing lineage up to partitions that can be read from RAM/Disk/replicated storage.
 
+### Example
+pagerank(todo)
diff --git a/tech-summary/papers/resources/pictures/spark_rdd_interface.png b/tech-summary/papers/resources/pictures/spark_rdd_interface.png
diff --git a/tech-summary/papers/resources/pictures/spark_rdd_narrow_wide_transform.png b/tech-summary/papers/resources/pictures/spark_rdd_narrow_wide_transform.png
diff --git a/tech-summary/papers/resources/pictures/spark_rdd_rdd_vs_dsm.png b/tech-summary/papers/resources/pictures/spark_rdd_rdd_vs_dsm.png
diff --git a/tech-summary/tools/elastic/elasticsearch_commands.md b/tech-summary/tools/elastic/elasticsearch_commands.md
@@ -35,6 +35,15 @@ print out.read()
 
 ## System
 
+```
+// start
+./elasticsearch -d
+// find pid
+ps -ef | grep elastic
+
+```
+
+
 ## DBOP
 
 - Create/Delete index

diff --git a/tech-summary/tools/elastic/elasticsearch_template.md b/tech-summary/tools/elastic/elasticsearch_template.md
@@ -5,7 +5,7 @@
 If I want add indexes created start with `logstash-` applied specific settings, such as
 - disabling the _all fields search
 - set the default attribute to @message
-- To saving sapce and indexing time, I disallowed create index for source/source_host/source_path etc
+- To saving space and indexing time, I disallowed create index for source/source_host/source_path etc
 
 
 I could do as following  

diff --git a/tech-summary/tools/elastic_index.md b/tech-summary/tools/elastic_index.md
diff --git a/tech-summary/tools/spark_index.md b/tech-summary/tools/spark_index.md
@@ -0,0 +1,26 @@
+# Spark Information Main page
+
+## Notes
+- [Paper of Resilient Distributed Datasets](../papers/rdd.md)
+- [Paper of SparkSQL]() todo
+- [Intro to Apache Spark By Ted Malaska](../lectures/intro_to_apache_spark_for_java_and_scala_developers.md)
+- [RDD, DataFrame, DataSet]() todo
+
+## Docs
+- [RDD Programming](https://spark.apache.org/docs/latest/rdd-programming-guide.html)
+- [SparkSQL, DataFrame, DataSet](https://spark.apache.org/docs/latest/sql-programming-guide.html)
+- [Spark Quick Start](https://spark.apache.org/docs/latest/quick-start.html)
+
+## Examples
+- [Spark Examples](https://spark.apache.org/examples.html)
+- [Spark数据处理常用的那几招](https://blog.csdn.net/eric_sunah/article/details/51822876)
+***
+- [Load parquet file](https://sparkbyexamples.com/spark/spark-streaming-kafka-consumer-example-in-json-format/)
+- [Saprk SQL map functions](https://sparkbyexamples.com/spark/spark-sql-map-functions/)
+- [Spark streaming, consume data from kafka in JSON format](https://sparkbyexamples.com/spark/spark-streaming-kafka-consumer-example-in-json-format/)
+- [Different ways to Create DataFrame in Spark](https://sparkbyexamples.com/spark/different-ways-to-create-a-spark-dataframe/)
+***
+
+## Others
+- [awesome-spark/awesome-spark](https://github.com/awesome-spark/awesome-spark)
+- [Spark学习笔记--超全总结](http://chant00.com/2017/07/28/Spark%E5%AD%A6%E4%B9%A0%E7%AC%94%E8%AE%B0/) 全到看不完