-
Notifications
You must be signed in to change notification settings - Fork 10
/
Copy path01-intro.Rmd
71 lines (43 loc) · 5.75 KB
/
01-intro.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
---
knit: bookdown::preview_chapter
---
# Introduction {#intro}
We are living in an age of data - data about us, the world, and our interactions with the world and each other is being collected at an ever-increasing rate. In 2015 the estimation was that by 2020 "about 1.7 megabytes of new information will be created every second for every human being on the planet".
https://www.forbes.com/sites/bernardmarr/2015/09/30/big-data-20-mind-boggling-facts-everyone-must-read/#54f6b00a17b1 (Data Strategy, Bernard Marr)
At the same time, a lot more data is being made available; political developments, such as the 2007 Open Government amendment to the Freedom of Information Act addresses data used to generate reports and requires that agencies "shall make the raw statistical data used in its reports available electronically ...".
This means that data used in reports by any US Department or Agency is publicly available, and usually public in this setting means the world-wide web.
Fun facts: did you know that
(1) "'Despacito' is awesome to listen to in any language" - in Spring and Summer 2018 it was across languages and continents the worldwide most listened to song on spotify?
(2) the number of BASHs (bird aviation strike hits) is lower in Spring when birds nest?
(3) the wingspans (finger tip to finger tip distance) of NBA players are disproportionally high compared to players' heights and still increasing year by year?
(4) 15 year old girls score higher on boys in math, on average, in some countries, but boys always score lower than girls on reading, on average, in all countries.
Those facts can be (relatively easily) found from publicly available data after a course using the material from this book.
## What is exploratory data analysis?
Tools of the trade described lovingly and very detailed in @eda have moved on from paper and pencil. R [@R] has become the de-facto lingua franca for Statisticians and Data Analysts to communicate methods and findings. The RStudio IDE provides R users and developers infrastructure and support. To cite Julie Lowndes, NCEAS, ["if R were an airplane, RStudio is the airport." ](http://jules32.github.io/2016-07-12-Oxford/R_RStudio/)
Founded by J.J. Allaire in 2009, RStudio provides an organizational framework that facilitates well-organized workflows, integration to versioning systems, and collaborations.
The core of RStudio's mission is to provide "open source and enterprise-ready professional software for the R statistical computing environment."
RStudio has been a substantial contributor to the growth of R, and has received the
2015 InfoWorld Technology of the Year Award (together with, among others, Github, Docker, and Apache Hadoop)
R, RStudio and markdown, for teams: git and github ...
why (computational) reproducibility is crucial to any data exploration and makes our work more ethical
When learning R, we are dealing with the two distinct questions of 'how?' and 'why?' almost simultaneously.
The first problem is the mere mechanical problem of getting the code to the place that we want it to be. Often times this first problem is the one we think of when we state that 'we want to learn R'. However, the second question is the 'interesting' one, that touches on the statistical concepts behind why we are applying some methods over others. In this book we will be trying to highlight problems from both perspectives and add the third question of "what do we learn?".
Third question completes the cycle.
Tie into exploratory model.
## What is **wild** data?
<!-- motivational -->
"Wild data" is the data that's 'out there' in the world and available to us. It's the kind of data that we can go out and find in our quest to make sense of the world.
<!-- difference between sources, analogy to food -->
Data is available all around us, There are many places to find data, and the types of data found at different sites can be highly varied in quality. Data, like food, can be delivered in many forms and conditions.
Like sustainable high quality wild data is traceable to its source, and the source continuously updates the data, if this is appropriate. It is possible to determine the trustworthiness of the data, and the ethical usability
Text book data is like highly processed food. Once upon a time it may have been fresh data, but its now been processed beyoond recognition, molded into shape that fits the tool being described.
Wild data is sustainable, and will be continuously updated as new measurements arrive. Many data delivery sites, like https://data.gov.au, or even kaggle.com, should be considered to be data orphanages, the data has been dropped off and abandoned. Fresh and local wild data has meaning and relevance for the audience.
What to look for in wild data
- Traceable source
- Data collection methods and processing procedures are clear and (theoretically) reproducible
- Updates have been made over time
## What is computational statistical thinking?
Statistical thinking is about understanding our world by modeling the variation we care about, and harnessing randomness to "white noise" the variation that is not of primary interest. There is definitely lots of history to statistical thinking. Florence Nightingale employed data collection, computation and data plots to improve hygiene and hence survival rates during the Crimean War. XXX more examples to relate to...
Computational statistical thinking is an exciting new way of doing statistics that makes use the computational tools of today. To understand randomness, we sample, re-sample, simulate or generate values from a model. We use these to learn how the problem might look if we'd collected different data, or if particular conditions hold. It allows us to create a sandbox to play in, a virtual world to examine randomness and variation.
## Workflow
`here`