This is an assignment to the Software Architecture class at the Technische Hochschule Nürnberg.
This assignment covers some of the more advanced concepts of the Java 8 Stream
s API:
- Map-Reduce
- Collecting
- Grouping-By
We'll use Map-Reduce to implement the classic word count example. The sample repository contains 3.000 tweets of Donald Trump which we will analyze in this assignment.
A clever data scientist discovered that most of the angrier tweets came from Android where the nicer ones were written with an iPhone (first article and follow up article).
We will have a look if we can use Java's Stream
s to group the tweets by the kind of client which was used to create them.
- Create a fork of this repository (button in the right upper corner)
- Clone the project (get the link by clicking the green Clone or download button)
- Import the project to your IntelliJ
- Read the whole assignment spec!
Remark: the given test suite is incomplete and will not succeed right after checkout!
To be able to analyze the tweets, we need a generator which loads the given tweets from a JSON file and exposes them as a Stream
.
The following UML shows the class structure of the generator and the factory function fromJson
.
Remark: the UML is not complete but just meant as implementation guideline and for orientation.
Hint: the dependency to GSON is already added to your build.gradle
and GSON can be used to deserialize tweets:
Gson gson = new Gson();
Tweet[] tweets = gson.fromJson(reader, Tweet[].class);
Where reader
is an instance of the abstract class Reader
.
To access files from the resources
folder, use the <Class class="getResourceAsStream
method:
getClass().getResourceAsStream("/path/to/trump_tweets.json");
Side note: this is only possible if you are in a non static
context, otherwise you have to write something like this: MyClass.class.getResourceAsStream(...)
.
Implement the TweetStreamGenerator
by using GSON; the class already contains a List<Tweet>
to store the tweets and to get the stream.
Advanced variants:
- Use the
try-with-resources
statement for the deserialization of the tweets - For a more functional way, read up on this Gist.
Up to now, Stream
s are nice to have but printing all results to the command line with the terminator forEach(...)
is not really practicable.
To be able to process the data we need to collect
them, e.g. in a List<>
or a Set<>
or any other Collection.
Fortunately, the Stream
API already defines a method .collect(...)
and the JDK contains the utility class Collectors
which defines the most common collectors.
Once the OfflineTweetStreamGenerator
works, we can start to analyze the tweets in the class TrumpTweetStats
.
The following UML contains the signatures of the methods to be implemented in this assignment.
The method stubs are given already; the UML is only meant to keep the signatures in mind while you're reading the spec.
The first two tasks are the implementation of the methods:
calculateSourceAppStats
calculateTweetsBySourceApp
This method groups the tweets by the app they were created with and counts how many tweets were created with which app. In SQL you would write it like this:
SELECT source, count(*) FROM tweets GROUP BY source
If you want to try it on your own: the repository contains a SQL script that creates a tweet
table and insert all the tweets to it.
The script has been tested on MSSQL and PostgreSQL but if you want to use it on MySQL or MariaDB some additional work might be necessary.
Every Tweet
instance has two methods to access the source
:
getSource()
getSourceApp()
The first one returns a string like this: <a href="http://twitter.com/download/iphone" rel="nofollow">Twitter for iPhone</a>
.
The second one extracts the actual name of the app Twitter for iPhone
.
It does not matter for the assignment which one you choose but the second one is a little bit prettier when the result is printed.
Side note: the task is thought to be solved with collect
but it's also possible using reduce
!
This method is very similar to calculateSourceAppStats
but instead of just counting the tweets, it collects them as a Set<Tweet>
for further analysis.
While the method calculateSourceAppStats
was very easy to implement with SQL, this method is impossible to implement in SQL because SQL does not define a map of lists (i.e. a tuple of tuples)!
Side note: the task is thought to be solved with collect
but it's also possible using reduce
!
A classical algorithm used to process huge amounts of data is Map-Reduce. As the name already indicates the algorithm consists of two steps:
- Map - transform the data (in parallel)
- Reduce - retrieve all partial results and aggregate them
If you're looking for examples of Map-Reduce, the first hit will most likely be the word count problem. It's relatively simple to implement as there's not much transformation required and it demonstrates the concept very well.
We want to analyze which words are the most common in the given tweets. The following flow chart is meant as orientation how to implement the Map-Reduce algorithm.
The text of the tweet can be split like this:
String[] split = "Hello World".split("( )+");
The reduce
method requires a so called accumulator.
For this method, an instance of HashMap<>
or LinkedHashMap<>
seems to be a good idea.
The next part is the reduction step and should be an instance of BiFunction<>
: a function where the accumulator and a single value is passed in and the accumulator is returned after the value is processed (e.g. inserted to a list).
(The last part is a combiner that combines two accumulator values but you won't need it this time.)
Debugging hint: the latest IntelliJ Idea ships with the plugin Java Stream Debugger. The plugin visualizes how the stream is transformed step by step (including the actual data). That's very helpful if something is happening you don't expect to happen!
Another hint: the given 3225 tweets are a little bit too much to debug your Stream
if something is going wrong. It may help to limit
the Stream
you're passing to the method calculateWordCount
to e.g. 200 elements!
Last but not least: the Map-Reduce algorithm is a little bit tricky when you implement it for the first time. If you're stuck, Google can help you!
There are already some unit tests but it might be a good idea to extend the test suite.