distributed-spark-analysis

Abstract

Purchasing nutritional food products can be overwhelming when faced with the enormous amount of items available in grocery stores today. Through the rise of ecommerce and data collection, we can use data analytic methods to help consumers find healthy foods and make the best decisions for their diet. By leveraging data available from the online grocery delivery service, Instacart, and the United States Department of Agriculture, we were able to analyze the nutritional makeup of food products and recommend items to users based on their interests using machine learning techniques. Dietary and nutritional strategies of users were analyzed, recommendations were created with collaborative filtering, and food was broken down into its primary nutrients and explored via principal component analysis. Our findings indicate that Instacart users make relatively healthy choices, that matrix factorization is an effective approach to create user recommendations, and that categorizing foods by their nutrients opens the door for more research.

[paper]

Getting Started

This project makes use of Apache Spark for general-purpose cluster-computing, Hadoop Distributed File System (HDFS) for primary storage, and sbt as a build tool for Scala projects.

It is assumed that HDFS is configured prior to running the application. Apache Spark will need to be downloaded from Apache's website - version 2.4.1 was used for the inital commit.

To use the given run.sh script make sure to update and source the local ~/.bashrc file with the following environment variables:

export SPARK_HOME=${HOME}/spark-2.4.1-bin-hadoop2.7
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:/usr/local/sbt/bin

alias hstartdfs="$HADOOP_HOME/sbin/start-dfs.sh"
alias hstopdfs="$HADOOP_HOME/sbin/stop-dfs.sh"

alias sstartall=$SPARK_HOME/sbin/start-all.sh
alias sstopall=$SPARK_HOME/sbin/stop-all.sh

Data Access

The Instacart Dataset describes 3 Million Instacart orders, and can be described here.

Data for the USDA Food Composition Databases was downloaded as BFPD ASCII CSV Files.

Name		Name	Last commit message	Last commit date
Latest commit History 96 Commits
Recommendations		Recommendations
notebook		notebook
resources		resources
src/main		src/main
.gitignore		.gitignore
README.md		README.md
build.sbt		build.sbt
final-paper.pdf		final-paper.pdf
run.sh		run.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

distributed-spark-analysis

Abstract

Getting Started

Data Access

About

Releases

Packages

Languages

stockeh/distributed-spark-analysis

Folders and files

Latest commit

History

Repository files navigation

distributed-spark-analysis

Abstract

Getting Started

Data Access

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages