-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathvoopter.html
277 lines (254 loc) · 19.1 KB
/
voopter.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
<!DOCTYPE html>
<html>
<head>
<title>Exploration of voopter airfare data - datawerk</title>
<meta charset="utf-8" />
<link href="https://buhrmann.github.io/theme/css/bootstrap-custom.css" rel="stylesheet"/>
<link href="https://buhrmann.github.io/theme/css/pygments.css" rel="stylesheet"/>
<link href="https://buhrmann.github.io/theme/css/style.css" rel="stylesheet" />
<link href="//maxcdn.bootstrapcdn.com/font-awesome/4.2.0/css/font-awesome.min.css" rel="stylesheet">
<link rel="shortcut icon" type="image/png" href="https://buhrmann.github.io/theme/css/logo.png">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<meta name="author" contents="Thomas Buhrmann"/>
<meta name="keywords" contents="datawerk, R,visualization,exploratory,sql,shiny,"/>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-56071357-1', 'auto');
ga('send', 'pageview');
</script> </head>
<body>
<div class="wrap">
<div class="container-fluid">
<div class="header">
<div class="container">
<nav class="navbar navbar-default navbar-fixed-top" role="navigation">
<div class="navbar-header">
<button type="button" class="navbar-toggle collapsed" data-toggle="collapse" data-target=".navbar-collapse">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="https://buhrmann.github.io">
<!-- <span class="fa fa-pie-chart navbar-logo"></span> datawerk -->
<span class="navbar-logo"><img src="https://buhrmann.github.io/theme/css/logo.png" style=""></img></span>
</a>
</div>
<div class="navbar-collapse collapse">
<ul class="nav navbar-nav">
<!--<li><a href="https://buhrmann.github.io/archives.html">Archives</a></li>-->
<li><a href="https://buhrmann.github.io/posts.html">Blog</a></li>
<li><a href="https://buhrmann.github.io/pages/cv.html">Interactive CV</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Data Reports<span class="caret"></span></a>
<ul class="dropdown-menu" role="menu">
<!--<li class="divider"></li>
<li class="dropdown-header">Data Science Reports</li>-->
<li >
<a href="https://buhrmann.github.io/p2p-loans.html">Interest rates on <span class="caps">P2P</span> loans</a>
</li>
<li >
<a href="https://buhrmann.github.io/activity-data.html">Categorisation of inertial activity data</a>
</li>
<li >
<a href="https://buhrmann.github.io/titanic-survival.html">Titanic survival prediction</a>
</li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Data Apps<span class="caret"></span></a>
<ul class="dropdown-menu" role="menu">
<!--<li class="divider"></li>
<li class="dropdown-header">Data Science Reports</li>-->
<li >
<a href="https://buhrmann.github.io/elegans.html">C. elegans connectome explorer</a>
</li>
<li >
<a href="https://buhrmann.github.io/dash+.html">Dash+ visualization of running data</a>
</li>
</ul>
</li>
</ul>
</div>
</nav>
</div>
</div><!-- header -->
</div><!-- container-fluid -->
<div class="container main-content">
<div class="row row-centered">
<div class="col-centered col-max col-min col-sm-12 col-md-10 col-lg-10 main-content">
<section id="content" class="article content">
<header>
<span class="entry-title-info">Sep 13 · <a href="https://buhrmann.github.io/category/data-posts.html">Data Posts</a></span>
<h2 class="entry-title entry-title-tight">Exploration of voopter airfare data</h2>
</header>
<div class="entry-content">
<p>I’ve recently started working as a data science freelancer for <a href="https://voopter.com.br">voopter.com.br</a>, helping them analyze the data generated by airfare searches on their website. Voopter is a metasearch engine for flights from and to Brazil. The first thing I did was to create an interactive dashboard in R and shiny for some explorative statistics of the millions of seaches performed by users of their website (which has already led to more specific business-driven questions).</p>
<p>The dashboard provides a quick and easy way to filter and aggregate the data, which is stored in
an <span class="caps">SQL</span> database. The idea is to start with a destination of interest (which could be a specific city or country) and optionally a particular point of departure. For the given destination (or route in the latter case), the dashboard then allows for the selection of a particular date range, and level of aggregation (year, month, weekday etc.) and displays the most useful descriptive statistics. Here are some examples of interesting results.</p>
<h2>Overview of origins</h2>
<p>The first graph displayed, after having selected a destination of interest, is an overview of the most popular origins from which people intend to depart:</p>
<p><img src="/images/voopter/top_origins.png" alt="Top n origins"/></p>
<p>Here I’ve removed the origin names on the y-axis, as the data is proprietary. The origins are ordered by their frequency of searches (right), and are shown along with their overall price distribution (left). The violin plot shows the relative number of returned flights at the given price point, their 25 and 75 percentiles (ends of white horizontal bars), and the median price (red). Several observations can be made already from these distributions. For example, the plot is produces after removal of the top 0.5 percentile of prices. Still, prices exhibit a long tail of expensive flights, very different from the majority of flights. A simple average price for a given route is therefore not very informative (the median shown is more useful). Also, some distributions seem to be bimodal. Though this cannot be deduced from this plot alone, this is due to big differences in fares for flights made around popular holidays and those during low season.</p>
<h2>Factors influencing price</h2>
<p>Next we can look at the way a fare depends on the time of year, the airline, or how far in advance one is booking. For example, choosing to aggregate by calendar week, the dashboard will produce the following graph:</p>
<p><img src="/images/voopter/aggr_week.png" alt="Aggregate by calendar week"/></p>
<p>Here, a boxplot summarizes the price distribution for each week, while a background scatter plot shows the actual fares color-coded by the number of days between the booking and departure (“advance”). It is obvious that prices increase around the time of popular holidays, like the summer and christmas. What’s less obvious is the fact that people tend to try and book their christmas flights on relatively short notice, while they seem to plan their summer holidays more in advance (with a probability of overspending on their christmas flights). </p>
<p>Median prices can also be visualized on a per-day basis in a calendar view, which more easily picks out shorter popular holiday periods, like the carnival in Brazil in February:</p>
<p><img src="/images/voopter/price_cal.png" alt="Avg price calendar"/></p>
<p>The dashboard also creates a boxplot similar to that above to summarize prices for different airlines:</p>
<p><img src="/images/voopter/aggr_airline.png" alt="Aggregate by airline"/></p>
<p>Clearly, savings can be found on average by selecting the cheapest airline. Also, for each airline their most expensive flights are those around christmas and the cheapest those during the weeks before the summer holidays.</p>
<p>Next, what is the best time to book, i.e. how far in advance would it be best to start looking for flights to the selected destination? The following scatter plot shows fare prices as a function of advance and color-coded by the calendar week of departure:</p>
<p><img src="/images/voopter/price_adv.png" alt="Price vs. advance"/></p>
<p>A common pattern, independent of the particular destination (in most cases), is that prices tend to decrease the further in advance a booking is made (see average price indicated by red line). However, the best time to book usually seems to be 3 to 4 months in advance (depending on the destination), after which fares tend to increase again. The plot also illustrates again the curious clustering of christmas flights that are being searched for only a few days or weeks before departure (while cheaper christmas flights, on average, can theoretically be found booking months in advance).</p>
<p>Overall, there are some clear trends describing the distribution of prices for flights to a given destination. These are rather simple to capture in a model, which allows both for the prediction of prices ranges, as well as recommendations as to the best time to fly and book (inference). </p>
</div><!-- /.entry-content -->
<footer class="post-info">
Published on <span class="published">September 13, 2015</span><br>
Written by <span class="author">Thomas Buhrmann</span><br>
Posted in <span class="label label-default"><a href="https://buhrmann.github.io/category/data-posts.html">Data Posts</a></span>
~ Tagged
<span class="label label-default"><a href="https://buhrmann.github.io/tag/r.html">R</a></span>
<span class="label label-default"><a href="https://buhrmann.github.io/tag/visualization.html">visualization</a></span>
<span class="label label-default"><a href="https://buhrmann.github.io/tag/exploratory.html">exploratory</a></span>
<span class="label label-default"><a href="https://buhrmann.github.io/tag/sql.html">sql</a></span>
<span class="label label-default"><a href="https://buhrmann.github.io/tag/shiny.html">shiny</a></span>
</footer><!-- /.post-info -->
</section>
<div class="blogItem">
<h2>Comments</h2>
<div id="disqus_thread"></div>
<script type="text/javascript">
var disqus_shortname = 'datawerk';
var disqus_title = 'Exploration of voopter airfare data';
var disqus_identifier = "voopter.html";
(function() {
var dsq = document.createElement('script');
dsq.type = 'text/javascript';
dsq.async = true;
//dsq.src = 'http://' + disqus_shortname + '.disqus.com/embed.js';
dsq.src = '//' + disqus_shortname + '.disqus.com/embed.js';
(document.getElementsByTagName('head')[0] ||
document.getElementsByTagName('body')[0]).appendChild(dsq);
})();
</script>
<noscript>
Please enable JavaScript to view the
<a href="http://disqus.com/?ref_noscript=datawerk">
comments powered by Disqus.
</a>
</noscript>
</div>
</div>
</div><!-- row-->
</div><!-- container -->
<!-- <div class="push"></div> -->
</div> <!-- wrap -->
<div class="container-fluid aw-footer">
<div class="row-centered">
<div class="col-sm-3 col-sm-offset-1">
<h4>Author</h4>
<ul class="list-unstyled my-list-style">
<li><a href="http://www.ias-research.net/people/thomas-buhrmann/">Academic Home</a></li>
<li><a href="http://github.com/synergenz">Github</a></li>
<li><a href="http://www.linkedin.com/in/thomasbuhrmann">LinkedIn</a></li>
<li><a href="https://secure.flickr.com/photos/syngnz/">Flickr</a></li>
</ul>
</div>
<div class="col-sm-3">
<h4>Categories</h4>
<ul class="list-unstyled my-list-style">
<li><a href="https://buhrmann.github.io/category/academia.html">Academia (4)</a></li>
<li><a href="https://buhrmann.github.io/category/data-apps.html">Data Apps (2)</a></li>
<li><a href="https://buhrmann.github.io/category/data-posts.html">Data Posts (9)</a></li>
<li><a href="https://buhrmann.github.io/category/reports.html">Reports (3)</a></li>
</ul>
</div>
<div class="col-sm-3">
<h4>Tags</h4>
<ul class="tagcloud">
<li class="tag-4"><a href="https://buhrmann.github.io/tag/shiny.html">shiny</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/networks.html">networks</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/sql.html">sql</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/hadoop.html">hadoop</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/mongodb.html">mongodb</a></li>
<li class="tag-1"><a href="https://buhrmann.github.io/tag/visualization.html">visualization</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/smcs.html">smcs</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/sklearn.html">sklearn</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/tf-idf.html">tf-idf</a></li>
<li class="tag-1"><a href="https://buhrmann.github.io/tag/r.html">R</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/sna.html">sna</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/nosql.html">nosql</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/svm.html">svm</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/java.html">java</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/hive.html">hive</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/scraping.html">scraping</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/lda.html">lda</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/kaggle.html">kaggle</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/exploratory.html">exploratory</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/titanic.html">titanic</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/classification.html">classification</a></li>
<li class="tag-1"><a href="https://buhrmann.github.io/tag/python.html">python</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/random-forest.html">random forest</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/text.html">text</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/big-data.html">big data</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/report.html">report</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/regression.html">regression</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/graph.html">graph</a></li>
<li class="tag-2"><a href="https://buhrmann.github.io/tag/d3.html">d3</a></li>
<li class="tag-3"><a href="https://buhrmann.github.io/tag/neo4j.html">neo4j</a></li>
<li class="tag-4"><a href="https://buhrmann.github.io/tag/flume.html">flume</a></li>
</ul>
</div>
</div>
</div>
<!-- JavaScript -->
<script src="https://code.jquery.com/jquery-2.1.1.min.js"></script>
<script src="//maxcdn.bootstrapcdn.com/bootstrap/3.2.0/js/bootstrap.min.js"></script>
<script type="text/javascript">
jQuery(document).ready(function($)
{
$("div.collapseheader").click(function () {
$header = $(this).children("span").first();
$codearea = $(this).children(".input_area");
$codearea.slideToggle(500, function () {
$header.text(function () {
return $codearea.is(":visible") ? "Collapse Code" : "Expand Code";
});
});
});
// $(window).resize(function(){
// var footerHeight = $('.aw-footer').outerHeight();
// var stickFooterPush = $('.push').height(footerHeight);
// $('.wrap').css({'marginBottom':'-' + footerHeight + 'px'});
// });
// $(window).resize();
// $(window).bind("load resize", function() {
// var footerHeight = 0,
// footerTop = 0,
// $footer = $(".aw-footer");
// positionFooter();
// function positionFooter() {
// footerHeight = $footer.height();
// footerTop = ($(window).scrollTop()+$(window).height()-footerHeight)+"px";
// console.log(footerHeight, footerTop);
// console.log($(document.body).height()+footerHeight, $(window).height());
// if ( ($(document.body).height()+footerHeight) < $(window).height()) {
// $footer.css({ position: "absolute" }).css({ top: footerTop });
// console.log("Positioning absolute");
// }
// else {
// $footer.css({ position: "static" });
// console.log("Positioning static");
// }
// }
// $(window).scroll(positionFooter).resize(positionFooter);
// });
});
</script>
</body>
</html>