DataFrame
was introduced in Spark 1.3.0 to replace and generalize SchemaRDD
.
There is one significant difference between the two classes: while SchemaRDD
extended RDD[Row]
, DataFrame
contains one.
In most cases, where your code used a SchemaRDD
in the past, it can now use a DataFrame
without additional changes.
However, the domain-specific language (DSL) for querying through a SchemaRDD
without writing SQL has been strengthened considerably in the DataFrame
API.
Two important additional classes to understand are Row
and Column
.
This directory contains examples of using DataFrame
, focusing on the aspects that are not strictly related to Spark SQL queries --
the specifically SQL oriented aspects are covered in the sql directory, which is a peer of this one.
File |
What's Illustrated |
Basic.scala |
How to create a DataFrame from case classes, examine it and perform basic operations. Start here. |
SimpleCreation.scala |
Create essentially the same DataFrame as in Basic.scala from an RDD of tuples, and explicit column names. |
FromRowsAndSchema.scala |
Create essentially the same DataFrame as in Basic.scala from an RDD[Row] and a schema. |
File |
What's Illustrated |
GroupingAndAggrgegation.scala |
Several different ways to specify aggregation. |
Select.scala |
How to extract data from a DataFrame . This is a good place to see how convenient the API can be. |
DateTime.scala |
Functions for querying against DateType and DatetimeType. |
File |
What's Illustrated |
DropColumns.scala |
Creating a new DataFrame that omits some of the original DataFrame's columns. |
DropDuplicates |
Crate a DataFrame that removes duplicate rows, or optionally removes rows that are only identical on certain specified columns. |
Range.scala |
Using range() methods on SQLContext to create simple DataFrames with values from that range. |
File |
What's Illustrated |
ComplexSchema.scala |
Creating a DataFrame with various forms of complex schema -- start with FromRowsAndSchema.scala for a simpler example |
Transform.scala |
How to transform one DataFrame to another: written in response to a question on StackOverflow. |
UDF.scala |
How to use user-defined functions (UDFs) in queries. Note that the use of UDFs in SQL queries is covered seperately in the sql directory. |
UDT.scala |
User defined types from a DataFrame perspective -- depends on understanding UDF.scala |
UDAF.scala |
A simple User Defined Aggregation Function as introduced in Spark 1.5.0 |
DatasetConversion.scala |
Explore interoperability between DataFrame and Dataset -- note that Dataset is convered more detail in dataset |