This project was completed for my Data Science Practicum I course. In this project I have attempted to combine my fascination with Twitter Data, my interest in Information Technology, and my interest in Data Engineering and Data Science. I have purposely put Data Engineering before Data Science as I first built an IT platform for me to work on and then used that platform to conduct aspects of Data Science to explore Information Technology through Twitter data.
The intended outcome of this project is twofold:
- Explore topics centered around Information Technology
- Visualize Sentiment around Information Technology
My intent for this project was to run PySpark on Hadoop. For a previous project I ran a cluster of 3 virtual linux machines using my Windows laptop as the host machine. I found that this setup barely got me through the class. This time I decided to purchase, at auction, a dual core, 1TB hard drive, 16GB machine. On this machine I setup the following:
- Ubuntu Host OS
- Four Ubuntu Virtual Machines
- Virtual machine 1 was used to collect tweets
- Virtual machines 2 through 4 were setup as a hadoop cluster
- VirtualBox, Ambari-Vagrant, Anaconda were used to setup the environment
- The environment consists of PySpark running on a 3 node cluster, using Jupiter Notebooks
My dataset consists of tweets containing the following terms:
- data
- java
- machine learning
- iot
- computer
- computer programmer
- database administrator
- network engineer
- network administrator
- data scientist
- systems
- systems engineer
- data analyst
- technology
- data architect
- etl
- etl architect
- web programmer
- automation engineer
- data processing
- devops
- cloud
- application engineer
- software engineer
- software developer
- developer
- information architect
- programmer
- security analyst
- business intelligence
- enterprise architect
- solution architect
- data warehouse
- ai
- robotics
- information technology
My Tweets consisted of the following fields:
- timetext
- tweet_id
- tweet_source
- tweet_truncated
- tweet_text
- tweet_user_screen_name
- tweet_user_id
- tweet_user_location
- tweet_user_description
- tweet_user_followers_count
- tweet_user_statuses_count
- tweet_user_time_zone
- tweet_user_geo_enabled
- tweet_user_lang
- tweet_coordinates_coordinates
- tweet_place_country
- tweet_place_country_code
- tweet_place_full_name
- tweet_place_name
- tweet_place_type
For this project I really only needed the tweet_text but I wanted to bring back as much information that I could should I want to do time based analysis or utilize the geo fields to perhaps break down sentiment by location.
Initially the only data cleaning I did was to within each field remove commas and newlines. I did this thinking that with free form fields users could add commas and newlines. Little did I know that this was just the beginning. Thinking back, I really should have known that more data cleaning up front would have been needed.
tweet_text = tweet_text.replace(',',' ').replace('\n', ' ')
In addition, I only pulled back english tweets or so I thought.
isnull(tweet["user"]["lang"]) == "en"
After pulling back approximately 13 million tweets I realized further cleaning would be needed. The first thing I did was update my original Python script with addition cleaning and then I created another script to clean the data I had already retrieved. I adjusted the original script to only include letters, numbers, and spaces. In addition, I am expecting 20 fields. I also added a line of code to only include records that had 20 fields:
tweet_text = re.sub('[^ a-zA-Z0-9]', '', tweet_text)
if tweet_line_len == 20: twitter_jobs_raw_data_file.write(tweet_line + '\n')
The second script simply read the previously accumulated files, applied the above cleaning, and wrote the cleaned files back out to a different directory.
The last bit of data cleaning needed was discovered further into the project. The topic modeler used, LDA, would only model single word topics. I had both single and multiple word search terms. I wanted to make sure that my multiple word search terms would be candidates for topics. In order to accomplish this I wrote a third script that read all previously cleaned data and added an underscore to all of my multiple word search terms:
machine learning became machine_learning
The following libraries were used to build a Spark Context on my Hadoop Cluster, import the tweet data, topic model the tweet data, conduct sentiment analysis on the tweet data, and visualize that sentiment analysis:
- import findspark
- import string
- import re as re
- import nltk
- import time
- from pyspark.sql import SQLContext
- from pyspark.sql.types import *
- from pyspark.sql.functions import monotonically_increasing_id
- from pyspark.mllib.util import MLUtils
- from pyspark.ml.feature import RegexTokenizer, Tokenizer, StopWordsRemover, CountVectorizer, CountVectorizerModel, StopWordsRemover
- from pyspark.mllib.clustering import LDA, LDAModel
- from nltk.corpus import stopwords
- from pyspark.mllib.linalg import Vector as oldVector, Vectors as oldVectors
- from pyspark.ml.linalg import Vector as newVector, Vectors as newVectors
- from pyspark.ml.feature import IDF
- import numpy as np
- import matplotlib.pyplot as plt
- import pyspark.sql.functions as func
This will be used for sql like distriburted data processing. As I get more familiar with what technology to use where I will be switching between using pyspard RDDs, pyspark dataframes, and pandas dataframes
sqlContext = SQLContext(sc)
This is a three node virtual cluster. Read in data from Hadoop
ITData = sc.textFile("hdfs:////user/vagrant/practicum/input")
By default, data is partitioned based on the data size. While I have over 8 million cleaned records collected. I have limited the number of records to approximately 100,000. I have had issues with my hadoop cluster processing through the data. I will need to gain a better understanding of the sizing needs as it relates to data size. 100,000 records produces 6 partitions.
ITData.getNumPartitions()
Further clean tweets, split them out into individual words, and number them by adding an index.
tokens = tweet.map(lambda document: document.strip().lower())
.map(lambda document: re.split(" ", document))
.map(lambda word: [x for x in word if x.isalpha()])
.map(lambda word: [x for x in word if len(x) > 3])
.map(lambda word: [x for x in word if x not in StopWords])
.zipWithIndex()
cv = CountVectorizer(inputCol="tweet_words", outputCol="raw_features", vocabSize=5000, minDF=10.0)
cvmodel = cv.fit(tweet_df)
result_cv = cvmodel.transform(tweet_df)
idf = IDF(inputCol="raw_features", outputCol="features")
idfModel = idf.fit(result_cv)
result_tfidf = idfModel.transform(result_cv)
num_topics = 10 max_iterations = 20 lda_model = LDA.train(rs_df['index', 'raw_features'].rdd.map(list), k=num_topics, maxIterations=max_iterations)
for topic in range(len(topics_final)): print("Topic" + str(topic) + ":") for term in topics_final[topic]: print(term) print('\n')
The above above relates topics to the terms I searched in Twitter.
For sentiment analysis, I would like to rate the actual search terms. For this I will build a python array with those search terms
search_terms = ["machine_learning", "computer_programmer", "database_engineer", "network_engineer",
"data_scientist", "systems_engineer", "data_analyst", "data_architect", "etl_architect",
"web_programmer", "automation_engineer", "data_processing", "application_engineer",
"software_engineer", "software_developer", "information_architect", "security_analyst",
"business_intelligence", "enterprise_architect", "solution_architect", "data_warehouse",
"information_technology", "data", "java", "iot", "computer", "systems", "technology",
"etl", "devops", "cloud", "developer", "programmer", "ai"]
This will relate the search term to the tweet so later I can relate the sentiment of the tweet to the search term.
def SearchTopics(topics, tweet_text): for term in topics: result = tweet_text.find(term) if result > -1: return term, tweet_text return 'NA', tweet_text
While removing stopwords helps obtain valid topics it will not help with sentiment analysis. With topics in hand, topic_tweet, we will use tweets where stop words have not been removed. Search each tweet for topics returning only tweets that match. SearchTopics will return both the topic and the related tweet. Sentiment will be done on these tweets.
topic_tweet = tweet.map(lambda x: SearchTopics(search_terms, x)).filter(lambda x: x[0] != 'NA')
import nltk from nltk.sentiment.vader import SentimentIntensityAnalyzer nltk.download('vader_lexicon')
This function will have topic and related tweet as input. This function will perform sentiment analysis and output topic, tweet, and sentiment. Also note this function will only return the compound portion of the sentiment.
def print_sentiment_scores(topic, sentence): snt = SentimentIntensityAnalyzer().polarity_scores(sentence) print("{:-<40} {}".format(sentence, str(snt))) print(str(snt)) return(topic, sentence, str(snt.get('compound')))
topic_tweet_sentiment = topic_tweet.map(lambda x: print_sentiment_scores(x[0], x[1]))
While I only pulled back tweets with the above search terms, I still wanted to pick out topics from those tweets. I used MLIB's LDA to achieve this.
Understanding data volumes, tool choice, and how both affect analysis performance and capability is very important. With the types of data volumes that I initially set out to process I would have been better served running a very small sample locally and then provisioning a cluster in the cloud to run the complete dataset. Having said that, I found that running a subset of the data did provide results.
For the period that I collected tweets, the predominant topics around Information Technology were Facebook and China. Additionally, I found sentiment to be mostly positive.
Databricks. Topic Modeling with Latent Dirichlet Allocation. In Databricks. Retrieved 18:00, June 27, 2018, from https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/3741049972324885/3783546674231782/4413065072037724/latest.html
Bergvca. (2016) Example on how to do LDA in Spark ML and MLLib with python. In GitHubGist. Retrieved 18:00, June 27, 2018, from https://gist.github.com/Bergvca/a59b127afe46c1c1c479
Soumya Ghosh. (2018, March 17). Topic Modeling with Latent Dirichlet Allocation (LDA) in Pyspark. In Medium. Retrieved 18:00, June 27, 2018, from https://medium.com/@connectwithghosh/topic-modelling-with-latent-dirichlet-allocation-lda-in-pyspark-2cb3ebd5678e
Hacertilbec. (2016, May 7). LDA-spark-python. In GitHub. Retrieved 18:00, June 27, 2018, from https://github.com/hacertilbec/LDA-spark-python/blob/master/SparkLDA.py
Shane Lynn. Summarising, Aggregating, and Grouping data in Python Pandas. In Blog. Retrieved 18:00, June 27, 2018, fromhttps://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/
Python, R, and Linux Tips. (2018, March 14). How to Make Boxplots in Python with Pandas and Seaborn. In Python, R, and Linux Tips. Retrieved 18:00, June 27, 2018, from http://cmdlinetips.com/2018/03/how-to-make-boxplots-in-python-with-pandas-and-seaborn/
Chris Moffitt. (2014, October 26). Simple Graphing with IPython and Pandas. In Practical Business Python. Retrieved 18:00, June 27, 2018, from http://pbpython.com/simple-graphing-pandas.html
ideoforms. (2015, April 25). python-twitter-examples. In GitHub. Retrieved 18:00, June 27, 2018, from https://github.com/ideoforms/python-twitter-examples/blob/master/twitter-stream-search.py