09-overview.Rmd



# Overview

I want to start the modeling section with the following overview of what role modeling plays in the "life cycle" of a data science project.

:::{.i96}
```{r,echo=F,fig.width=8,fig.height=4,tidy=F}
# import diagram library
library(diagram)
# set margin and create empty diagram
par(mar=c(0,0,1.5,0),cex.main=1.5)
openplotmat(main = "Data science project life cycle")
# set number of elements in each row
nodes = coordinates(c(3,3,5))
# reverse second row boxes order
nodes = nodes[c(1:3,6:4,7:nrow(nodes)),]
# nudge experiment, exploration boxes to right
nodes[2,1] = nodes[2,1] + .02
nodes[5,1] = nodes[5,1] + .02
# shift rows around
nodes[1:3,2] = nodes[1:3,2] + .01
nodes[4:6,2] = nodes[4:6,2] + .04
nodes[7:11,2] = nodes[7:11,2] - .02
# tighen last row boxes
nodes[7:11,1] = mean(nodes[7:11,1]) + (nodes[7:11,1]-mean(nodes[7:11,1]))*.92 - .015
fromto <- matrix(ncol=2,byrow=T,data=c(1,2,2,3,3,4,4,5,5,6,6,7,6,8,6,9,6,10,6,11))
nr <- nrow(fromto)
arrpos <- matrix(ncol=2,nrow=nr)
for (i in 1:nr)
  arrpos[i, ] <- straightarrow(to=nodes[fromto[i,2],],from=nodes[fromto[i,1],],lwd=2,arr.pos=0.57,arr.length=0.5)
curvedarrow(nodes[6,],nodes[1,],curve=-.42,lty=2)
text(sum(nodes[1:2,1]*c(.45,.55)),nodes[1,2]+.04,"expand",cex=.9)
text(sum(nodes[2:3,1]*c(.45,.55)),nodes[2,2]+.04,"sample",cex=.9)
text(nodes[6,1]-.082,nodes[6,2]+.14,"iterate",cex=.9)
textrect(nodes[1,],.127,.08,lab="question of interest\n(testable hypothesis)",shadow.size = 0.005)
textrect(nodes[2,],.088,.08,lab="experiment/\nstudy design",shadow.size = 0.005)
textrect(nodes[3,],.065,.06,lab="dataset",shadow.size = 0.005)
textrect(nodes[4,],.1,.06,lab="cleaning/tidying",shadow.size = 0.005)
textrect(nodes[5,],.108,.08,lab="data exploration/\nvisualization",shadow.size = 0.005)
textrect(nodes[6,],.07,.065,lab="modeling",shadow.size = 0.005)
textrect(nodes[7,],.07,.07,lab="parameter\nestimates",shadow.size = 0.005)
textrect(nodes[8,],.07,.07,lab="hypothesis\ntesting",shadow.size = 0.005)
textrect(nodes[9,],.07,.07,lab="confidence\nintervals",shadow.size = 0.005)
textrect(nodes[10,],.07,.06,lab="predictions",shadow.size = 0.005)
textrect(nodes[11,],.075,.07,lab="etc...(further\ninference)",shadow.size = 0.005)
```
:::

Roughly speaking, data science can be divided into 3 phases:

 I.   In Phase I, you identify a research question, design an experiment, gather a sample, and collect some raw data. These steps correspond to the first row of the "life cycle".
 II.  In Phase II, you start with your raw data and clean it (this is probably where 60-90% of time is actually spent), explore it, and figure out the best way to model it. This is the second row.
 III. In Phase III, you fine tune your model, double check all your work, and interpret the results, which may involve reporting estimates, computing tests/intervals, making predictions, etc. This is the third row.

Usually, this is an iterative process; you start with a question, gather data, analyze it, then either modify the initial inquiry or ask a follow up question, and the cycle continues.

Experiment design & sampling is a more advanced topic, and thus not covered in detail in STAT 240, however we will summarize a few key ideas from Phase I in the next subsection since they are relevant to later topics.

We've spent a lot of time learning the basics of data cleaning, exploration, and visualization, so we're reasonably well covered for Phase II for now, though of course you're always encouraged to explore further on your own.

For most of the remainder of the class, we will focus on Phase III: identifying appropriate models, fitting them well, interpreting the results meaningfully, producing useful further inference such as hypothesis tests & confidence intervals, and communicating the results effectively to a broader audience.

First, we need to briefly summarize a few key concepts relating to experiment design that will greatly enrich our later exploration of models.


## Population vs sample

Statistics is primarily the science of studying **samples** to understand **populations**. Generally, it's impractical to observe every member of a population, but luckily this is usually not necessary and a well-drawn sample is sufficient to answer most questions.

:::{.def}
A **population** can be any large group we want to learn about, e.g. all US mothers (~85 million), all arctic terns (~3 million), all gen-5 Toyota Priuses (~15000), etc..

A **sample** is a smaller set drawn from (and intended to represent) a population.
:::

There are [MANY ways](https://www.scribbr.com/methodology/sampling-methods) to draw a sample, each with their own pros, cons, and potential [biases](https://www.scribbr.com/research-bias/sampling-bias). A detailed discussion of these is reserved for more advanced courses.


## Model vs data

Using only mathematical logic, we can derive different theoretical probability **models** with certain **parameters** that aim to represent real world phenomena, then compare these with real data, i.e. **fitting**, to evaluate their performance and make further **inferences** and/or **predictions**.

:::{.def}
A **model** is an idealized mathematical representation of a process, e.g. a normal distribution may be used to model the distribution of human heights.

Models often have **parameters** which are values that can be adjusted (or "tuned") to **fit** a model so that it matches real data.
:::

The distinction between model vs data may seem obvious, but they often have analogous features that are often easily confused for one another. For example, a model can have **theoretical statistics** values for the mean, median, variance, skew, etc. These are related to but different from the **sample statistics** of mean, median, variance, and skew.

:::{.def}
Recall a **statistic** can be any computed numeric summary from a dataset (or from a hypothethical model). All values discussed in chapter \@ref(descriptive) are examples of statistics.
:::

If the sample is suffciently large and of high quality, and the model is well chosen, these values should match, i.e. the model's predicted statistics should agree with the sample statistics, but any problems along this process can result in discrepancies.