decision tree, weak learner(individually they are quite inaccurate, but slightly better when work together)
second tree must provide positive effort when combine with first tree
Dimension for original data is reasonable for human beings, but might not friendly for decision tree
Try to represent data with different
coordinate system
Make the dimension the principal component with most variation - (Minimize difference between min value and max value)
- Only load needed data
- Organize data storage based on needs(query, filter)
Flat data vs. Nested data
Option on Parquet with scala
val flatDF ="delimiter", "\t")
.option("header", "true")
.map(r => transformRow(r))
flatDF.write.option("compression", "snappy")
var nestedDF =
nestedDF.write.option("compression", "snappy")
By record data column oriented, we could use different way to compress the data
Incrementally(record diff), or use dictionary
More details could be found in parquet encoding
tree structure
- file metadata: schema, num of rows
- Row group: each time when write massive data, we are not going to write them all together but piece by piece
- Column Chunk: consider each column individually(if whole column is no, directly skip)
- Page header: size
- Page: record meta data like count, max, min to help quick filter
.partitionBy("Year", "Month", "Day", "Hour")
via source code
via docker
docker pull nathanhowell/parquet-tools:latest
docker run -it --rm nathanhowell/parquet-tools:latest --help
docker run --rm -it -v /yourlocalpath:/test nathanhowell/parquet-tools:latest schema test/nested-customer.parquet
➜ tmp docker run --rm -it -v /Users/xunliu/Downloads/tmp:/test nathanhowell/parquet-tools:latest schema test/nested-customer.parquet
message root {
optional binary _id (UTF8);
optional binary c_address (UTF8);
optional double c_id;
optional binary c_name (UTF8);
repeated group orders {
repeated group items {
optional binary i_name (UTF8);
optional double i_number;
optional binary i_supplier (UTF8);
optional double o_amount;
optional double o_id;
optional binary o_shop (UTF8);
- For how to generate parquet file from csv, I think the best ways is using
Apache Spark
,Apache Drill
or other cloud platforms, such as Databricks notbook - I failed on following command line tools
- golang failed to build
- python 2.7 failed to build
- why rdd
- dataframe vs rdd
Let's say there are multiple stage of map reduce
how to represent distribute data for programming language
what if middle step failed, how to recover
how to let programmer easy to write mr program
let's say we want to first add 1 on all numbers then filter odd numbers, can we optimize calculation?
RDD means Resilient Distributed Datasets, an RDD is a collection of partitions of records.
The main challenge in designing RDDs is defining a programming interface
that can provide fault tolerance efficiently.
What is RDD
Example code
val rdd = sc.textFile("/mnt/wikipediapagecounts.gz")
var parsedRDD = rdd.flatMap{
line => line.split("""\s+""") match {
case Array(project, page, numRequests,-)=>Some((project, page, numRequests))
case _=None
// filter only english pages; count pages and requests to it
parsedRDD.filter{case(project, page, numRequests) => project == "en"}
.map{ case(_, page, numRequests) => (page, numRequests)}
.foreach{case (page, requests) => println(s"$page:$requests")}
why not rdd
Spark don't look into lambda functions, and he don't know what's the data/type
DataFrame Sample
// convert RDD -> DF with colum names
val df = parsedRDD.toDF("project", "page", "numRequests")
// filter, groupBy, sum, and then agg()
df.filter($"project" === "en")
project | page | numRequests |
en | 23 | 45 |
en | 24 | 200 |