resources for discussion
- https://github.com/dmlc/xgboost
- http://datascience.la/xgboost-workshop-and-meetup-talk-with-tianqi-chen/
decision tree, weak learner(individually they are quite inaccurate, but slightly better when work together)
second tree must provide positive effort when combine with first tree
-
Dimension for original data is reasonable for human beings, but might not friendly for decision tree
-
Try to represent data with different
coordinate system
-
Make the dimension the principal component with most variation - (Minimize difference between min value and max value)
- Dremel paper https://storage.googleapis.com/pub-tools-public-publication-data/pdf/36632.pdf
- Google Dremel 原理 - 如何能3秒分析1PB https://www.twblogs.net/a/5b82787e2b717766a1e868d0
- Approaching NoSQL Design in DynamoDB https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/bp-general-nosql-design.html#bp-general-nosql-design-approach
- Spark + Parquet In Depth https://www.youtube.com/watch?v=_0Wpwj_gvzg&t=1501s
- https://medium.com/@rajnishtiwari2010/conversion-of-json-to-parquet-format-using-apache-parquet-in-java-b694a0a7487d
- Only load needed data
- Organize data storage based on needs(query, filter)
Flat data vs. Nested data
Option on Parquet with scala
val flatDF = sc.read.option("delimiter", "\t")
.option("header", "true")
.csv(flatInput)
.rdd
.map(r => transformRow(r))
.toDF
flatDF.write.option("compression", "snappy")
.parquet(flatOutput)
var nestedDF = sc.read.json(nestedInput)
nestedDF.write.option("compression", "snappy")
.parquet(nestedOutput)
By record data column oriented, we could use different way to compress the data
Incrementally(record diff), or use dictionary
More details could be found in parquet encoding
tree structure
- file metadata: schema, num of rows
- Row group: each time when write massive data, we are not going to write them all together but piece by piece
- Column Chunk: consider each column individually(if whole column is no, directly skip)
- Page header: size
- Page: record meta data like count, max, min to help quick filter
dataFrame.write
.partitionBy("Year", "Month", "Day", "Hour")
.parquet(outputFile)
https://github.com/apache/parquet-mr
-
via source code
-
via docker
docker pull nathanhowell/parquet-tools:latest
- https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/lateraljoin/nested-customer.json
- https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/lateraljoin/nested-customer.parquet
wget https://github.com/apache/drill/blob/master/exec/java-exec/src/test/resources/lateraljoin/nested-customer.parquet
docker run -it --rm nathanhowell/parquet-tools:latest --help
docker run --rm -it -v /yourlocalpath:/test nathanhowell/parquet-tools:latest schema test/nested-customer.parquet
➜ tmp docker run --rm -it -v /Users/xunliu/Downloads/tmp:/test nathanhowell/parquet-tools:latest schema test/nested-customer.parquet
message root {
optional binary _id (UTF8);
optional binary c_address (UTF8);
optional double c_id;
optional binary c_name (UTF8);
repeated group orders {
repeated group items {
optional binary i_name (UTF8);
optional double i_number;
optional binary i_supplier (UTF8);
}
optional double o_amount;
optional double o_id;
optional binary o_shop (UTF8);
}
}
- For how to generate parquet file from csv, I think the best ways is using
Apache Spark
,Apache Drill
or other cloud platforms, such as Databricks notbook - I failed on following command line tools
- golang https://github.com/xitongsys/parquet-go failed to build
- python 2.7 https://github.com/redsymbol/csv2parquet failed to build
- why rdd https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/papers/rdd.md
- dataframe vs rdd https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/lectures/spark_rdd_dataframe_dataset.md
- https://github.com/CodeBear801/tech_summary/blob/master/tech-summary/papers/flumejava.md
- http://why-not-learn-something.blogspot.com/2016/07/apache-spark-rdd-vs-dataframe-vs-dataset.html
Let's say there are multiple stage of map reduce
-
how to represent distribute data for programming language
-
what if middle step failed, how to recover
-
how to let programmer easy to write mr program
-
let's say we want to first add 1 on all numbers then filter odd numbers, can we optimize calculation?
RDD means Resilient Distributed Datasets, an RDD is a collection of partitions of records.
The main challenge in designing RDDs is defining a programming interface
that can provide fault tolerance efficiently.
What is RDD
每个RDD都包含:
(1)一组RDD分区(partition,即数据集的原子组成部分);
(2)对父RDD的一组依赖,这些依赖描述了RDD的Lineage;
(3)一个函数,即在父RDD上执行何种计算;
(4)元数据,描述分区模式和数据存放的位置。
例如,一个表示HDFS文件的RDD包含:各个数据块的一个分区,并知道各个数据块放在哪些节点上。
而且这个RDD上的map操作结果也具有同样的分区,map函数是在父数据上执行的。
Example code
val rdd = sc.textFile("/mnt/wikipediapagecounts.gz")
var parsedRDD = rdd.flatMap{
line => line.split("""\s+""") match {
case Array(project, page, numRequests,-)=>Some((project, page, numRequests))
case _=None
}
}
// filter only english pages; count pages and requests to it
parsedRDD.filter{case(project, page, numRequests) => project == "en"}
.map{ case(_, page, numRequests) => (page, numRequests)}
.reduceByKey(_+_)
.take(100)
.foreach{case (page, requests) => println(s"$page:$requests")}
why not rdd
Spark don't look into lambda functions, and he don't know what's the data/type
DataFrame Sample
// convert RDD -> DF with colum names
val df = parsedRDD.toDF("project", "page", "numRequests")
// filter, groupBy, sum, and then agg()
df.filter($"project" === "en")
.groupBy($"page")
.agg(sum($"numRequests").as("count"))
.limit(100)
.show(100)
project | page | numRequests |
---|---|---|
en | 23 | 45 |
en | 24 | 200 |