1-linear-regression.Rmd

---
title: "R Notebook"
output: html_notebook
---

Outline:

1. microeconomics vs macroeconomics
2. regression analysis vs causal inference
3. types of microdata


```{r}
install.packages(c("lmtest", "sandwich", "estimatr","olsrr"))
```

```{r}
library(tidyverse) ## data processing
library(readxl) ## reading excel files
library(lmtest) ## tests for lm (normality, linearity, residual variance)
library(olsrr) ## more user friendly packages
library(sandwich) ## robust standard errors
library(estimatr) ## provides function lm_robust (this package is fast; C++)
```

Data from the moodle platform

```{r}
labor <- read_excel("data/labor.xlsx") 
summary(labor)
head(labor)
```
```{r}
with(labor, hist(labour, breaks = "fd"))
with(labor, hist(production, breaks = "fd"))
with(labor, hist(capital, breaks = "fd"))
with(labor, hist(wage, breaks = "fd"))

with(labor, plot(production, labour))
with(labor, plot(capital, labour))
with(labor, plot(wage, labour))
```

First model

$$
\text{labour} = \beta_0 + \beta_1\text{capital} + \beta_2\text{production} + \beta_3\text{wage} + \epsilon
$$

```{r}
m1 <- lm(labour ~ I(capital/1000) + I(production/100000) + I(wage/1000), data = labor)
summary(m1)
```
```{r}
plot(m1)
```
Reference plots

```{r}
set.seed(123)
n <- 1000
x <- rchisq(n, df =2 )
y <- 1 + 5*x + rnorm(n)
m2 <- lm(y~x)
plot(m2)
```
$$
\text{labour} = \alpha \times \text{capital}^{\beta_1} \times \text{production}^{\beta_2}\times\text{wage}^{\beta_3}\times\epsilon
$$
log-transformed model is given by the following formula

$$
\log(\text{labour}) = \gamma + \beta_1\log(\text{capital}) + \beta_2\log(\text{production}) + \beta_3\log(\text{wage}) + \psi
$$
```{r}
m3 <- lm(log(labour) ~ log(capital) + log(production) + log(wage), data = labor)
summary(m3)
plot(m3)
```

Verification of assumptions using lmtest package

```{r}
lmtest::dwtest(m3) ## correlation of residuals (incorrect for our case!!!!)
lmtest::bptest(m3) ## heteroskedasticity test 
shapiro.test(resid(m3)) ## test for normality of residuals
```

Verification of assumptions using olsrr package

```{r}
ols_test_normality(m3) ## testing for normality
ols_test_breusch_pagan(m3) ## test for Heteroskedasticity
```

Note that results from `lmtest::bptest` and `ols_test_breusch_pagan` differ. This is because these two functions does the test in a different way, i.e. `ols_test_breusch_pagan` uses thest that assumes error terms to be normally distributed, while `lmtest::bptest` does not assume that.

To estimate model with heteroskedastic-robust standard errors you may use lm_robust function

```{r}
m1_r <- lm_robust(log(labour) ~ log(capital) + log(production) + log(wage), 
                  data = labor, se_type = "HC2")
summary(m1_r)
```

Do the same using sandwich package

+ coeftest -- is a function from lmtest package
+ vcovHC -- is a function from sandwich package (for calculation of HC robust standard errors)


```{r}
coeftest(m3, vcov = vcovHC(m3, type = "HC2"))
```

Robust standard errors: clustered/panel data


1. cluster -- poviats, gminas, a class in school, households
2. panel data -- companies between 2010-2021; balanced or unbalanced


```{r}
pzn_rent <- read_excel("data/rent-poznan.xlsx")

pzn_rent_subset <- pzn_rent %>%
  add_count(quarter, name = "quarter_count") %>%
  filter(quarter_count >= 50) %>%
  filter(price >= 500, price <= 15000, flat_area >= 15, flat_area <= 250)
  
pzn_rent_subset
```

```{r}
model_pzn <- lm(formula = price ~ flat_area + flat_rooms + individual + flat_furnished + 
                  flat_for_students +  flat_balcony,
                data = pzn_rent_subset)
summary(model_pzn)
plot(model_pzn)
```

We calculate standard errors using two formulas: HC2 and CR2.

```{r}
model_pzn_hc2 <- lm_robust(formula = price ~ flat_area + flat_rooms + individual + flat_furnished +
                             flat_for_students +  flat_balcony,
                           data = pzn_rent_subset,
                           se_type = "HC2")

model_pzn_cr2 <- lm_robust(formula = price ~ flat_area + flat_rooms + individual + flat_furnished +
                             flat_for_students +  flat_balcony,
                           data = pzn_rent_subset,
                           se_type = "CR2",
                           clusters = quarter)

model_pzn_hc2
model_pzn_cr2
```


```{r}
coeftest(model_pzn, vcov = vcovCL(model_pzn, cluster = ~ quarter, type = "HC2"))
```


if your data is cross-section data -> HC2
if your data is a time series data -> HAC2 (implemented in sandwich packace)
if your data is clustered / panels data -> CR2