forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
104 lines (88 loc) · 4.1 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
---
title: 'Reproducible Research: Peer Assessment 1'
output:
html_document:
keep_md: yes
pdf_document: default
---
## Loading and preprocessing the data
Extract the data from the zipped file and load it as a dataframe, and view data.
```{r}
unzip("activity.zip", exdir="./data")
df <- read.csv("./data/activity.csv")
head(df)
```
Create a new column of the dataframe which is a date-and-time in a native R format.
```{r}
df$minutes <- df$interval %% 100
df$hours <- (df$interval - df$minutes)/100
df$date_and_time <- strptime(paste(df$date, df$hours, df$minutes),
format="%Y-%m-%d %H %M", tz="GMT")
```
Finally we can check the start and finish date for the logging, and that we have continuous records (even though some are recorded as NA):
```{r}
summary(df$date_and_time)
summary(as.numeric(diff(df$date_and_time)))
```
## What is mean total number of steps taken per day?
Aggregate the daily totals of steps. Note some days are not represented in this total as some entire days have NA steps.
```{r}
dailyTotals <- aggregate(steps ~ date, df, FUN = sum)
head(dailyTotals)
```
A histogram of the total number of steps taken each day:
```{r, figure.width=400}
hist(dailyTotals$steps, xlab="# of Steps per Day", main="Histogram of Daily Steps")
```
The mean and median of the number of steps per day are:
```{r}
mean(dailyTotals$steps)
median(dailyTotals$steps)
```
## What is the average daily activity pattern?
Aggregate each of the time-steps into an average value for that time-step across all days, and plot the result.
```{r, figure.width=400}
intervalMeans <- aggregate(steps ~ interval, df, FUN = mean)
head(intervalMeans)
plot(intervalMeans$interval, intervalMeans$steps, type="l", xlab="Interval", ylab="Mean # of Steps")
```
The interval number which has on average the most steps is:
```{r}
intervalMeans[intervalMeans$steps == max(intervalMeans$steps), 'interval']
```
i.e. from 08:25 till 08:30 in the morning, assuming intervals are labelled with their end time.
## Imputing missing values
The total number of intervals with missing data is:
```{r}
sum(is.na(df$steps))
```
Representing the following fraction of all of the data:
```{r}
sum(is.na(df$steps))/length(df$steps)
```
I am imputing missing values based on the mean for that 5-minute interval (this was chosen in preference to the average for the day as some days have no recorded data). Save this in a new column which includes the original steps data and the filled in data.
```{r}
df$steps_filled <- apply(df[,c('steps', 'interval')], 1, function(x) { ifelse(is.na(x[1]),
intervalMeans[intervalMeans$interval == x[2], 'steps'], x[1]) } )
```
Here is a histogram of the daily totals used including the imputed values (NB: all days are now represented, unlike the original histogram).
```{r, figure.width=400}
dailyTotalsFilled <- aggregate(steps_filled ~ date, df, FUN = sum)
hist(dailyTotalsFilled$steps_filled, xlab="# of Steps per Day", main="Histogram of Daily Steps (with NAs imputed)")
```
The mean is unchanged, but including the imputed data has slightly increased the median value:
```{r}
mean(dailyTotalsFilled$steps_filled)
median(dailyTotalsFilled$steps_filled)
```
## Are there differences in activity patterns between weekdays and weekends?
```{r}
df$dayOfWeek <- weekdays(df$date_and_time)
df <- transform(df, dayType = ifelse(dayOfWeek %in% c("Saturday", "Sunday"), "weekend", "weekday"))
intervalMeans$weekDaySteps <- aggregate(steps ~ interval, df[df$dayType=="weekday", ], FUN = mean)$steps
intervalMeans$weekEndSteps <- aggregate(steps ~ interval, df[df$dayType=="weekend", ], FUN = mean)$steps
par(mfrow=c(2,1))
plot(intervalMeans$interval,intervalMeans$weekEndSteps, main="Weekend", type="l", xlab="Interval", ylab="Mean # of Steps")
plot(intervalMeans$interval,intervalMeans$weekDaySteps, main="Weekday", type="l", xlab="Interval", ylab="Mean # of Steps")
```
As can be seen from the plot above, on the weekend there is much more activity throughout the day, whereas on a weekday there are a large number of steps in the morning (perhaps as part of a comute), but then a lower level of activity for the rest of the day.