A library for querying Google Analytics data with Apache Spark, for Spark SQL and DataFrames.
This library requires Spark 1.4+
You can link against this library in your program at the following coordinates:
groupId: com.crealytics
artifactId: spark-google-analytics_2.10
version: 0.9.0
groupId: com.crealytics
artifactId: spark-google-analytics_2.11
version: 0.9.0
This package can be added to Spark using the --packages
command line option. For example, to include it when starting the spark shell:
$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-google-analytics_2.11:0.9.0
$SPARK_HOME/bin/spark-shell --packages com.crealytics:spark-google-analytics_2.10:0.9.0
This package allows querying Google Analytics reports as Spark DataFrames. The API accepts several options (see the Google Analytics developer docs for details):
serviceAccountId
: an account id for accessing the Google Analytics API ([email protected]
)keyFileLocation
: a key-file that you have to generate from the developer consoleids
: the ID of the site for which you want to pull the datastartDate
: the start date for the reportendDate
: the end date for the reportdimensions
: the dimensions by which the data will be segmentedqueryIndividualDays
: fetches each day from the chosen date range individually in order to minimize sampling (only works ifdate
is chosen as dimension)
Spark 1.4+:
Automatically infer schema (data types), otherwise everything is assumed string:
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.crealytics.google.analytics")
.option("serviceAccountId", "[email protected]")
.option("keyFileLocation", "the_key_file.p12")
.option("ids", "ga:12345678")
.option("startDate", "7daysAgo")
.option("endDate", "yesterday")
.option("dimensions", "date,browser,city")
.option("queryIndividualDays", "true")
.load()
df.select("browser", "users").show()
This library is built with SBT, which is automatically downloaded by the included shell script. To build a JAR file simply run sbt/sbt package
from the project root. The build configuration includes support for both Scala 2.10 and 2.11.