Skip to content

Latest commit

 

History

History
805 lines (680 loc) · 38.9 KB

project2.md

File metadata and controls

805 lines (680 loc) · 38.9 KB
title author output
Most Harmful Categories of Severe Weather Events in the US
Andrea Schioppa
html_document

Synopsis

In this report we investigate which severe weather events are most harmful to the population and which have the greatest economic consequences. Our analysis is based on data obtained from the US National Oceanic and Atmospheric Administration. From this data we find that the 5 categories of events with most human fatalities are: tornado, excessive heat, heat, flash flood and lightning. The 5 categories of events with most injured subjects are: tornado, thunderstorm wind, flood, excessive heat and lightning. The 5 categories of events which are most harmful to property are: flood, hurricane/typhoon, tornado, coastal flood and flash flood. The 5 categories of events most detrimental to agriculture are: drought, flood, hurricane/typhoon, ice and hail.

Data Processing

We download the data:

file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(file_url, destfile = "StormData.bz2", method="curl")

and use bunzip2 (on the shell) to decompress the file. We then load libraries:

library(dplyr)
library(lubridate)
library(xtable)
library(ggplot2)

We report the Sys.info()

info_string <- Sys.info()
cat(paste(paste("sysname:", info_string["sysname"]), 
           paste("release:",info_string["release"]), 
            paste("machine:", info_string["machine"]), sep="\n"))
## sysname: Darwin
## release: 14.4.0
## machine: x86_64
R.version
##                _                           
## platform       x86_64-apple-darwin13.4.0   
## arch           x86_64                      
## os             darwin13.4.0                
## system         x86_64, darwin13.4.0        
## status                                     
## major          3                           
## minor          2.0                         
## year           2015                        
## month          04                          
## day            16                          
## svn rev        68180                       
## language       R                           
## version.string R version 3.2.0 (2015-04-16)
## nickname       Full of Ingredients

and then read the data in R; note that we filter in cleaning_data only the columns we are interested in and use lubridate to properly format the dates.

storm_data <- read.csv("storm_data.csv", header = TRUE, stringsAsFactors=FALSE)
names(storm_data)<-tolower(names(storm_data))
storm_data$evtype <- tolower(storm_data$evtype)
cleaning_data <- select(storm_data, evtype, bgn_date, end_date, fatalities, injuries, propdmg,
                        propdmgexp, cropdmg, cropdmgexp, state)
cleaning_data$bgn_date <- mdy(sub(patter = " 0:00:00", replacement = "", cleaning_data$bgn_date))
cleaning_data$end_date <- mdy(sub(patter = " 0:00:00", replacement = "", cleaning_data$end_date))

The data frame cleaning_data denotes the data that will be cleaned. We need to clean the column evtype, each of whose entries should take value in one of the "main categories" of severe weather events in Section 7 of the Storm Data Documentation. However, we have chosen to keep some additional categories. The details of the cleaning are discussed in the Section R Code. From now on we assume that cleaning_data has been cleaned.

To compute the values of property and crop damage, we need to convert the exponents in the columns propdmgexp, cropdmgexp to integral values. Note that the only values of the exponents which are meaningful are: H/h, k/K, m/M and B.

prop_vec <- c("-"=0, "?"=0, "+"=0, "0"=0, "1"=0, "2"=0, "3"=0, "4"=0, "5"=0, "6"=0, "7"=0,
              "8"=0, "B"=10^9, "h"=10^2, "H"=10^2, "K"=10^3, "M"=10^6, "m"=10^6, "k"=10^3)
cleaning_data$propdmgexp <- prop_vec[cleaning_data$propdmgexp]
cleaning_data$cropdmgexp <- prop_vec[cleaning_data$cropdmgexp]

We then create new columns property and crop which contain the damages created by the events.

cleaning_data <- mutate(cleaning_data, property = propdmg * propdmgexp, crop = cropdmg * cropdmgexp)

We finally create summaries by taking for the columns injuries, fatalities, property and crop the sum, mean and standard deviation across the events in each category. We then join the tables of means and standard deviations.

storm_totals <- group_by(cleaning_data, evtype) %>%
	         summarise(fatalities = sum(fatalities, na.rm=TRUE), injuries = sum(injuries, na.rm=TRUE),
		       property = sum(property, na.rm=TRUE), crop = sum(crop, na.rm=TRUE))
storm_means<- group_by(cleaning_data, evtype) %>%
               summarise(fatalities = mean(fatalities, na.rm=TRUE), injuries = mean(injuries, na.rm=TRUE),
	             property = mean(property, na.rm=TRUE), crop = mean(crop, na.rm=TRUE))
storm_sds<- group_by(cleaning_data, evtype) %>%
             summarise(sdfatalities = sd(fatalities, na.rm=TRUE), sdinjuries = sd(injuries, na.rm=TRUE),
	           sdproperty = sd(property, na.rm=TRUE), sdcrop = sd(crop, na.rm=TRUE))
storm_means <- inner_join(storm_means, storm_sds)

Results

Danger for humans

We select and sort the top 10 causes of total fatalities, and use total injuries to break ties.

topt_fatalities <- arrange(storm_totals, desc(fatalities), desc(injuries))[1:10,1:3]
topt_fatalities$fatalities <- as.integer(topt_fatalities$fatalities)
topt_fatalities$injuries <- as.integer(topt_fatalities$injuries)

We then show the results in a table.

options(xtable.comment=FALSE)
topt_fat_table <- xtable(topt_fatalities)
print(topt_fat_table, include.rownames=FALSE, type='html')
evtype fatalities injuries
tornado 5661 91407
excessive heat 2018 6680
heat 1116 2531
flash flood 1035 1802
lightning 817 5232
thunderstorm wind 730 9542
rip current 572 529
flood 518 6881
high wind 325 1585
cold/wind chill 237 41

We select and sort the top 10 causes of total injuries, and use total fatalities to break ties.

topt_injuries <- arrange(storm_totals, desc(injuries), desc(fatalities))[1:10,c(1,3,2)]
topt_injuries$fatalities <- as.integer(topt_injuries$fatalities)
topt_injuries$injuries <- as.integer(topt_injuries$injuries)

We then show the results in a table.

options(xtable.comment=FALSE)
topt_inj_table <- xtable(topt_injuries)
print(topt_inj_table, include.rownames=FALSE, type='html')
evtype injuries fatalities
tornado 91407 5661
thunderstorm wind 9542 730
flood 6881 518
excessive heat 6680 2018
lightning 5232 817
heat 2531 1116
ice 2159 101
flash flood 1802 1035
wildfire 1608 90
high wind 1585 325

An alternative approach is to find the events which in mean are most harmful. In fact, there might be events which are extremely dangerous, but also rare and so they are penalized in taking the sum of total fatalities / injuries. However, to account for variability, we also report standard deviations.

We find the top 10 events which have most fatalities in mean.

topm_fatalities <- arrange(storm_means, desc(fatalities), desc(injuries))[1:10,c(1,2,3,6,7)]
topm_fat_table <- xtable(topm_fatalities)
print(topm_fat_table, include.rownames=FALSE, type='html')
evtype fatalities injuries sdfatalities sdinjuries
tsunami 1.65 6.45 7.15 28.85
high seas 1.23 1.00 1.09 1.68
heat 1.17 2.65 18.99 19.62
excessive heat 1.06 3.49 4.71 24.59
rip current 0.74 0.68 0.63 2.29
avalanche 0.58 0.44 0.78 1.03
high water 0.50 0.00 0.84 0.00
hurricane/typhoon 0.44 4.44 1.59 48.88
coastal storm 0.36 0.18 0.50 0.40
marine strong wind 0.29 0.46 0.54 1.37

We find the top 10 events which have most injuries in mean.

topm_injuries <- arrange(storm_means, desc(injuries), desc(fatalities))[1:10,c(1,3,2,7,6)]
topm_inj_table <- xtable(topm_injuries)
print(topm_inj_table, include.rownames=FALSE, type='html')
evtype injuries fatalities sdinjuries sdfatalities
tsunami 6.45 1.65 28.85 7.15
glaze 4.80 0.16 12.02 0.77
hurricane/typhoon 4.44 0.44 48.88 1.59
excessive heat 3.49 1.06 24.59 4.71
heat 2.65 1.17 19.62 18.99
tornado 1.51 0.09 17.17 1.42
freezing fog 1.25 0.11 5.77 0.60
ice 1.02 0.05 34.23 0.33
dust storm 1.00 0.05 4.24 0.53
high seas 1.00 1.23 1.68 1.09

The next plot shows the top 10 causes of fatalities.

gfat <- barplot(topt_fatalities$fatalities, main = "Top 10 events by total fatalities",
     xlab = "event types", ylab = "total fatalities")
text(x = gfat, y = rep(2500,10), labels=topt_fatalities$evtype, srt=90)

Figure1: Top 10 causes of fatalities

Danger for property

We select the top 10 causes of damage to property, and use damage to crop to break ties.

topt_property <- arrange(storm_totals, desc(property), desc(crop))[1:10,c(1,4,5)]
topt_property$property <- format(topt_property$property, scientific=TRUE, digits=3)
topt_property$crop <- format(topt_property$crop, scientific=TRUE, digits=3)

We then show the results in a table (the units are USD).

topt_prp_table <- xtable(topt_property)
print(topt_prp_table, include.rownames=FALSE, type='html')
evtype property crop
flood 1.51e+11 1.09e+10
hurricane/typhoon 8.53e+10 5.51e+09
tornado 5.86e+10 4.17e+08
coastal flood 4.33e+10 5.00e+03
flash flood 1.69e+10 1.53e+09
hail 1.60e+10 3.05e+09
thunderstorm wind 1.10e+10 1.27e+09
wildfire 8.50e+09 4.03e+08
tropical storm 7.71e+09 6.95e+08
winter storm 6.69e+09 2.74e+07

We select the top 10 causes of damage to crop, and use damage to property to break ties.

topt_crop <- arrange(storm_totals, desc(crop), desc(property))[1:10,c(1,5,4)]
topt_crop$property <- format(topt_crop$property, scientific=TRUE, digits=3)
topt_crop$crop <- format(topt_crop$crop, scientific=TRUE, digits=3)

We then show the results in a table (the units are USD).

topt_crp_table <- xtable(topt_crop)
print(topt_crp_table, include.rownames=FALSE, type='html')
evtype crop property
drought 1.40e+10 1.05e+09
flood 1.09e+10 1.51e+11
hurricane/typhoon 5.51e+09 8.53e+10
ice 5.02e+09 3.96e+09
hail 3.05e+09 1.60e+10
frost/freeze 2.00e+09 1.93e+07
flash flood 1.53e+09 1.69e+10
extreme cold 1.31e+09 1.24e+08
thunderstorm wind 1.27e+09 1.10e+10
heavy rain 7.96e+08 3.25e+09

As for danger for humans, we also find the events which have greatest economic consequences in mean; for these we report the standard deviations to highlight the variability.

topm_property <- arrange(storm_means, desc(property), desc(crop))[1:10,c(1,4,5,8,9)]
topm_property$property <- format(topm_property$property, scientific=TRUE, digits=3)
topm_property$sdproperty <- format(topm_property$sdproperty, scientific=TRUE, digits=3)
topm_property$crop <- format(topm_property$crop, scientific=TRUE, digits=3)
topm_property$sdcrop <- format(topm_property$sdcrop, scientific=TRUE, digits=3)

We find the top 10 events which have the most property damage in mean:

topm_prp_table <- xtable(topm_property)
print(topm_prp_table, include.rownames=FALSE, type='html')
evtype property crop sdproperty sdcrop
hurricane/typhoon 3.82e+08 4.75e+07 1.59e+09 1.63e+08
coastal flood 2.48e+08 7.14e+02 2.51e+09 1.89e+03
storm tide 3.18e+07 6.25e+03 3.33e+08 6.48e+04
tropical storm 1.35e+07 1.62e+06 2.17e+08 1.20e+07
flood 7.85e+06 7.50e+05 8.31e+08 4.22e+07
tsunami 7.58e+06 1.05e+03 1.90e+07 4.59e+03
wildfire 3.69e+06 2.19e+05 4.70e+07 3.57e+06
ice 2.90e+06 5.20e+06 1.93e+07 1.61e+08
astronomical high tide 1.18e+06 0.00e+00 1.76e+06 NA
tornado 1.13e+06 4.35e+04 1.91e+07 1.04e+06

We find the top 10 events which have most crop damage in mean:

topm_crop <- arrange(storm_means, desc(crop), desc(property))[1:10,c(1,5,4,9,8)]
topm_crop$property <- format(topm_crop$property, scientific=TRUE, digits=3)
topm_crop$sdproperty <- format(topm_crop$sdproperty, scientific=TRUE, digits=3)
topm_crop$crop <- format(topm_crop$crop, scientific=TRUE, digits=3)
topm_crop$sdcrop <- format(topm_crop$sdcrop, scientific=TRUE, digits=3)
topm_crp_table <- xtable(topm_crop)
print(topm_crp_table, include.rownames=FALSE, type='html')
evtype crop property sdcrop sdproperty
hurricane/typhoon 4.75e+07 3.82e+08 1.63e+08 1.59e+09
extreme cold 9.65e+06 8.24e+05 5.58e+07 4.63e+06
drought 9.19e+06 7.43e+05 5.32e+07 1.75e+07
ice 5.20e+06 2.90e+06 1.61e+08 1.93e+07
frost/freeze 1.87e+06 1.87e+04 1.56e+07 2.82e+05
tropical storm 1.62e+06 1.35e+07 1.20e+07 2.17e+08
flood 7.50e+05 7.85e+06 4.22e+07 8.31e+08
excessive heat 6.88e+05 1.08e+04 1.83e+07 1.66e+05
heat 6.40e+05 1.93e+04 1.59e+07 2.10e+05
wildfire 2.19e+05 3.69e+06 3.57e+06 4.70e+07

The following plot illustrates the top 10 events for damage to property. On the y-axis we use a log10 scale.

gfat2 <- barplot(as.numeric(topt_property$property), ylog=TRUE, main = "Top 10 events by property damage",
      xlab="event types", ylab = "damage (logarithmic scale)")
text(x=gfat2, y=rep(5e10,10), labels=topt_property$evtype, srt=90) 

plot of chunk plot_property

Total Frequencies

We also compute the total frequencies and percentages of the events which occurr in one of the top 10 categories discussed in the previous sections.

top_names <- rbind(topm_crop$evtype, topm_fatalities$evtype, topm_property$evtype,
	            topm_injuries$evtype, topt_crop$evtype, topt_fatalities$evtype, topt_property$evtype,
	            topt_injuries$evtype)
top_names <- unique(as.vector(top_names))
top_freq <- filter(cleaning_data, evtype %in% top_names)
top_freq <- mutate(top_freq, freq=1)
total_freq <- group_by(top_freq, evtype) %>% summarise(freq=sum(freq))
total_freq <- mutate(total_freq, percentage = freq/sum(freq)*100)
total_freq <- arrange(total_freq, evtype)
total_freq$freq <- as.integer(total_freq$freq)
tot_freq_table <- xtable(total_freq)
print(tot_freq_table, include.rownames=FALSE, type='html')
evtype freq percentage
astronomical high tide 103 0.01
avalanche 388 0.05
coastal flood 261 0.03
coastal storm 11 0.00
cold/wind chill 1806 0.21
drought 2537 0.30
dust storm 438 0.05
excessive heat 1912 0.22
extreme cold 874 0.10
flash flood 55677 6.52
flood 30445 3.56
freezing fog 587 0.07
frost/freeze 1539 0.18
glaze 45 0.01
hail 288928 33.83
heat 954 0.11
heavy rain 12164 1.42
high seas 13 0.00
high water 6 0.00
high wind 22341 2.62
hurricane/typhoon 300 0.04
ice 2111 0.25
lightning 15778 1.85
marine strong wind 48 0.01
rip current 774 0.09
storm tide 154 0.02
thunderstorm wind 336724 39.43
tornado 60711 7.11
tropical storm 697 0.08
tsunami 20 0.00
wildfire 4239 0.50
winter storm 11436 1.34

R Code

Here is the code for cleaning cleaning_data$evtype.

In the first chunk we attempt to fit most entries in the categories described in Section 7 of the Storm Data Documentation.

## Distinguish tides between "astronomical" and "storm"
indx_astro <- grep("astronomical", cleaning_data$evtype)
indx_tide <- grep("tide", cleaning_data$evtype)
cleaning_data$evtype[setdiff(indx_tide, indx_astro)] <- "storm tide"
## Fix "avalanche" mispellings
indx_aval <- grep("aval", cleaning_data$evtype)
cleaning_data$evtype[indx_aval] <- "avalanche"
## Blizzards 
indx_blizzard <- grep("blizzard", cleaning_data$evtype)
indx_snow <- grep("snow", cleaning_data$evtype)
indx_wind <- grep("wind", cleaning_data$evtype)
cleaning_data$evtype[intersect(indx_blizzard, indx_snow)] <- "heavy snow"
cleaning_data$evtype[intersect(indx_blizzard, indx_wind)] <- "heavy wind"
indx_blizzard <- grep("blizzard", cleaning_data$evtype)
cleaning_data$evtype[indx_blizzard] <- "blizzard"
## Floods
indx_flood <- grep("flood", cleaning_data$evtype)
indx_coastal <- grep("coastal|beach|tidal|cstl", cleaning_data$evtype)
cleaning_data$evtype[intersect(indx_flood,indx_coastal)] <- "coastal flood"
## Find wind chill events
indx_wind <- grep("wind|cold", cleaning_data$evtype)
indx_chill <- grep("chill", cleaning_data$evtype)
cleaning_data$evtype[intersect(indx_wind, indx_chill)] <- "cold/wind chill"
## Debris flow
indx_flow <- grep("debris|flow", cleaning_data$evtype)
cleaning_data$evtype[indx_flow] <- "debris flow"
## Fog events
indx_dense_fog <- grep("dense fog", cleaning_data$evtype)
indx_other_fog <- grep("fog", cleaning_data$evtype)
cleaning_data$evtype[setdiff(indx_other_fog, indx_dense_fog)] <- "freezing fog"
## Decided to keep "dense fog" and "dense smoke" in new categories
cleaning_data$evtype[indx_dense_fog] <- "dense fog"
indx_smoke <- grep("smoke", cleaning_data$evtype)
cleaning_data$evtype[indx_smoke] <- "dense smoke"
## Drought
indx_drought <- grep("drought", cleaning_data$evtype)
indx_heat <- grep("excessive", cleaning_data$evtype)
cleaning_data$evtype[setdiff(indx_drought, indx_heat)] <- "drought"
indx_dev <- grep("dev", cleaning_data$evtype)
## Dust Devil
cleaning_data$evtype[indx_dev] <- "dust devil"
indx_dstorm <- grep("dust", cleaning_data$evtype)
cleaning_data$evtype[setdiff(indx_dstorm,indx_dev)] <- "dust storm"
## Classify "excessive" / "extreme" events
indx_ex <- grep("exc|ext", cleaning_data$evtype)
indx_heat <- grep("heat", cleaning_data$evtype)
cleaning_data$evtype[intersect(indx_ex, indx_heat)] <- "excessive heat"
indx_cold <- grep("cold|chill", cleaning_data$evtype)
cleaning_data$evtype[intersect(indx_ex, indx_cold)] <- "extreme cold"
## Lakeshore and flash floods
indx_flood <- grep("flood", cleaning_data$evtype)
indx_lake <- grep("lake", cleaning_data$evtype)
indx_flash <- grep("flash", cleaning_data$evtype)
cleaning_data$evtype[intersect(indx_flood, indx_lake)] <- "lakeshore flood"
cleaning_data$evtype[intersect(indx_flood, indx_flash)] <- "flash flood"
indx_lake <- union(indx_lake, indx_flash)
indx_flood <- setdiff(indx_flood, indx_lake)
cleaning_data$evtype[indx_flood] <- "flood"
## Classify "frost/freeze"
indx_freeze <- setdiff(grep("freez", cleaning_data$evtype),
                         grep("rain|fog|sleet|pre", cleaning_data$evtype))
indx_frost <- grep("frost", cleaning_data$evtype)
cleaning_data$evtype[union(indx_freeze, indx_frost)] <- "frost/freeze"
## Funnel
indx_funnel <- setdiff(grep("funnel", cleaning_data$evtype),
                          grep("thund|wat", cleaning_data$evtype))
cleaning_data$evtype[indx_funnel] <- "funnel cloud"
## Hail
indx_hail <- setdiff(grep("hail", cleaning_data$evtype),
                      grep("mar|thun|tor|tstm", cleaning_data$evtype))
cleaning_data$evtype[indx_hail] <- "hail"
indx_heat <- setdiff(grep("heat", cleaning_data$evtype), grep("ex", cleaning_data$evtype))
## Heat
cleaning_data$evtype[indx_heat] <- "heat"
indx_rain <- setdiff(grep("rain", cleaning_data$evtype),
                       grep("sleet|low|tstm|snow|uns|light", cleaning_data$evtype))
## Heavy rain		       
cleaning_data$evtype[indx_rain] <- "heavy rain"
indx_snow <- setdiff(grep("snow", cleaning_data$evtype),
                       grep("sleet|late|lake|mode|seas|lack|ligh|ear", cleaning_data$evtype))
## Heavy snow
cleaning_data$evtype[indx_snow] <- "heavy snow"
indx_surf <- grep("surf",cleaning_data$evtype)
## High surf
cleaning_data$evtype[indx_surf] <- "high surf"
indx_wind <- setdiff(grep("wind|wnd", cleaning_data$evtype),
                      grep("rain|tun|tstm|thun|chi|mir|mic|hail|low|thud|mar|strong",
		             cleaning_data$evtype))
## High wind			     
cleaning_data$evtype[indx_wind] <- "high wind"
## Hurricanes
indx_hurri <- grep("hurricane|typh", cleaning_data$evtype)
cleaning_data$evtype[indx_hurri] <- "hurricane/typhoon"
## Ice / Ice storms / Lake-effect snow
indx_icestm <- intersect(grep("ice", cleaning_data$evtype), grep("storm", cleaning_data$evtype))
cleaning_data$evtype[indx_icestm] <- "ice storm"
indx_lakesnow <- intersect(grep("lake", cleaning_data$evtype), grep("snow", cleaning_data$evtype))
cleaning_data$evtype[indx_lakesnow] <- "lake-effect snow"
## Lightning
indx_lig <- grep("ligh|lign", cleaning_data$evtype)
indx_good_lig <- grep("snow|northern|rain", cleaning_data$evtype)
cleaning_data$evtype[setdiff(indx_lig, indx_good_lig)] <- "lightning"
## Marine 
indx_mar_tst <- grep("marine (tst|thun)", cleaning_data$evtype)
cleaning_data$evtype[indx_mar_tst] <- "marine thunderstorm wind"
## Rip current
cleaning_data$evtype[grepl("rip", cleaning_data$evtype)] <- "rip current"
## Sleet
cleaning_data$evtype[grepl("sleet", cleaning_data$evtype)] <- "sleet"
## Strong wind
indx_strong <- setdiff(grep("strong", cleaning_data$evtype), grep("marine", cleaning_data$evtype))
cleaning_data$evtype[indx_strong] <- "strong wind"
## Thunderstorm wind
indx_tswind <- setdiff(grep("down|tst|thun|tun|gust", cleaning_data$evtype),
                          grep("summary|hail", cleaning_data$evtype))
cleaning_data$evtype[indx_tswind] <- "thunderstorm wind"
## Tornado
cleaning_data$evtype[grepl("tornado", cleaning_data$evtype)] <- "tornado"
## Tropical depression
indx_trop <- setdiff(grep("trop", cleaning_data$evtype), grep("depressio", cleaning_data$evtype))
cleaning_data$evtype[indx_trop] <- "tropical storm"
## Volcanic ash
cleaning_data$evtype[grepl("volc", cleaning_data$evtype)] <- "volcanic ash"
## Waterspout
cleaning_data$evtype[grepl("waterspout", cleaning_data$evtype)] <- "waterspout"
## Fires
indx_fire <- grep("fire", cleaning_data$evtype)
indx_wild <- grep("wild|brush|grass|red|forest", cleaning_data$evtype)
cleaning_data$evtype[intersect(indx_fire, indx_wild)] <- "wildfire"
## Winter storms and weather
indx_win <- grep("wint", cleaning_data$evtype)
indx_storm <- grep("storm", cleaning_data$evtype)
cleaning_data$evtype[intersect(indx_win, indx_storm)] <- "winter storm"
cleaning_data$evtype[setdiff(indx_win, indx_storm)] <- "winter weather"

In the second chunk we remove entries which were summaries of other events, since they are duplicated.

indx_sum <- grep("summary", cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_sum,]

In the third chunk we refine the previous cleaning, mostly by "hand". Some changes involved correcting spelling errors, or looking in the original data set (under storm_data$remarks) to guess how to attribute the events. This was not possible for all events as there were cases where either it did not seem possible to fit them in a single category, or a meaningful description was missing. Events removed include:

  • "black ice";
  • "dam" incidents;
  • Some precipitations events which could not be mapped only to one of: "hail", "rain" or "snow";
  • "wet" and "unseasonal" events;
  • Specific accidents/mishap (either for boats or people dying of hypo/hyperthermia);
  • Some events whose label was like "other" and which lacked a description;
## Uncategorized events are removed
indx_uncat <- grep("\\?", cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_uncat,]
## Abnormal warmth is excessive heat
indx_abwarm <- grep("abnormal warmth", cleaning_data$evtype)
cleaning_data$evtype[indx_abwarm] <- "excessive heat"
## Bursts are microbursts
cleaning_data$evtype[grep("burst", cleaning_data$evtype)] <- "microburst"
## Unseasonal events are unclear hence removed
indx_unsea <- grep("unsea", cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_unsea,]
## Most "dry" are drought
indx_drought <- setdiff(grep("dry", cleaning_data$evtype), grep("mild", cleaning_data$evtype))
cleaning_data$evtype[indx_drought] <- "drought"
## Black ice and below normal precipitation are unclear
indx_bel <- grep("below|black", cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_bel, ]
## Correct typo
indx_coastalstorm <- grep("coastalstorm", cleaning_data$evtype)
cleaning_data$evtype[indx_coastalstorm] <- "coastal storm"
## "apache county" was a "strong wind"
cleaning_data$evtype[grep("apac", cleaning_data$evtype)] <- "strong wind"
## Not clear where to fit beach/coastal erosion
indx_eros <- grep("eros", cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_eros, ]
## "precipitations" which can be attributed to rain
indx_prec <- setdiff(grep("prec", cleaning_data$evtype), grep("snow|mix|nor", cleaning_data$evtype))
indx_remprec <- intersect(grep("prec", cleaning_data$evtype), grep("snow|mix|nor", cleaning_data$evtype))
cleaning_data$evtype[indx_prec] <- "heavy rain"
cleaning_data <- cleaning_data[-indx_remprec, ]
## Remove "wet" because not clear if it should be fit in "heat" or "rain"
indx_wet <- grep("wet", cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_wet, ]
## More extreme cold
indx_cold <- setdiff(grep("cold", cleaning_data$evtype), grep("chill", cleaning_data$evtype))
cleaning_data$evtype[indx_cold] <- "extreme cold"
## Events which do not fit or describe accidents
indx_rem <- grep("early|drow|hyp", cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_rem, ]
## "excessive" where  "excessive heat" (there was just one outlier)
indx_exc <- grep("excessive$", cleaning_data$evtype)
cleaning_data$evtype[indx_exc] <- "excessive heat"
## Hail, rain, glaze
indx_hail <- grep("hails", cleaning_data$evtype)
indx_sho <- grep("show", cleaning_data$evtype)
cleaning_data$evtype[indx_hail] <- "hail"
cleaning_data$evtype[indx_sho] <- "heavy rain"
indx_hail <- grep("hail st", cleaning_data$evtype)
cleaning_data$evtype[indx_hail] <- "hail"
indx_glz <- grep("glaze", cleaning_data$evtype)
cleaning_data$evtype[indx_glz] <- "glaze"
indx_hail <- intersect(grep("thunder",cleaning_data$evtype), grep("hail", cleaning_data$evtype))
cleaning_data$evtype[indx_hail] <- "hail"
## Record heat
indx_rechigh <- intersect(setdiff(grep("record",cleaning_data$evtype),
                           grep("snow|low|cool", cleaning_data$evtype)),
			    grep("warm",cleaning_data$evtype))
cleaning_data$evtype[indx_rechigh] <- "excessive heat"
## "urban floods" are floods
indx_flood <- grep("urban", cleaning_data$evtype)
cleaning_data$evtype[indx_flood] <- "flood"
## "landslides" mapped to "debris flow"
indx_debris <- grep("land", cleaning_data$evtype)
cleaning_data$evtype[indx_debris] <- "debris flow"
## remove "dam" events
indx_dam <- grep("dam", cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_dam,]
## map to drought
indx_drmn <- grep("driest month", cleaning_data$evtype)
cleaning_data$evtype[indx_drmn] <- "drought"
## typo
indx_floo <- grep("flash floooding", cleaning_data$evtype)
cleaning_data$evtype[indx_floo] <- "flash flood"
## not clear where to fit "cool spells"
indx_cool <- grep("cool spell", cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_cool, ]
## more rain
indx_hvyrain <- grep("heavy rain/lightning", cleaning_data$evtype)
cleaning_data$evtype[indx_hvyrain] <- "heavy rain"
## Classify these as heat (not clear if they could go to excessive heat)
cleaning_data$evtype[grep("(high temperature record|hot pattern|hot spell| hot weather)",
                       cleaning_data$evtype)] <- "heat"
## seas
cleaning_data$evtype[grep("(high|heavy) seas",cleaning_data$evtype)] <- "high seas"
## Ice is a new category
cleaning_data$evtype[grep("ice|icy", cleaning_data$evtype)] <- "ice"
## remove unclear events
indx_rem <- grep("(^high$|lack)", cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_rem,]
## merge heavy and high swells
cleaning_data$evtype[grep("heavy swells", cleaning_data$evtype)] <- "high swells"
## "wall clouds" would probably fit in tornadoes
cleaning_data$evtype[grep("wall", cleaning_data$evtype)] <- "tornado"
## not clear were to fit late/light snow
indx_rem <- grep("(late|light) snow", cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_rem, ]
## typos
cleaning_data$evtype[grep("lightning", cleaning_data$evtype)] <- "lightning"
cleaning_data$evtype[grep("low temp", cleaning_data$evtype)] <- "extreme cold"
## Remove some marine incidents
indx_rem <- grep("marine (acc|mish)",cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_rem,]
## This was a "thunderstorm wind" according to the original description
cleaning_data$evtype[grep("metro", cleaning_data$evtype)] <- "thunderstorm wind"
## remove moderate/mild and northern lights
indx_rem <- grep("^micr|north|mod|mild", cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_rem,]
## these were cold
cleaning_data$evtype[grep("monthly temperature", cleaning_data$evtype)] <- "extreme cold"
## mud as debris
cleaning_data$evtype[grep("mud", cleaning_data$evtype)] <- "debris flow"
cleaning_data$evtype[grep("near record snow", cleaning_data$evtype)] <- "heavy snow"
## Unclassified events
indx_rem <- grep("(^no|no sever|other)",cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_rem,]
## Remove these events which do not fall in a single category or a main category
string_rem <- "(light freezing rain|rapidly rising water|record cool|record low rainfall"
string_rem <- paste(string_rem, "|red flag|severe turb|southeast)", sep="")
indx_rem <- grep(string_rem ,cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_rem,]
## more record temperatures
cleaning_data$evtype[grep("warm", cleaning_data$evtype)] <- "heat"
cleaning_data$evtype[grep("record high", cleaning_data$evtype)] <- "excessive heat"
cleaning_data$evtype[grep("record low$", cleaning_data$evtype)] <- "extreme cold"
cleaning_data$evtype[grep("record temperature", cleaning_data$evtype)] <- "excessive heat"
## floyd is a hurricane
cleaning_data$evtype[grep("floyd", cleaning_data$evtype)] <- "hurricane/typhoon"
## classify as debris
cleaning_data$evtype[grep("rock", cleaning_data$evtype)] <- "debris flow"
## put in high seas
cleaning_data$evtype[grep("rough seas", cleaning_data$evtype)] <- "high seas"
## 
cleaning_data$evtype[grep("seasonal snowfall", cleaning_data$evtype)] <- "heavy snow"
## These were floods (the name involves a "small")
cleaning_data$evtype[grep("^sm+", cleaning_data$evtype)] <- "flood"
## remove "late" events and wake wind
indx_rem <- grep("^late|^wake", cleaning_data$evtype)
cleaning_data <- cleaning_data[-indx_rem,]
## more corrections and fixing typos
cleaning_data$evtype[grep("rogue", cleaning_data$evtype)] <- "high waves"
cleaning_data$evtype[grep("storm surge", cleaning_data$evtype)] <- "coastal flood"
cleaning_data$evtype[grep("record temperature", cleaning_data$evtype)] <- "excessive heat" 
cleaning_data$evtype[grep("thud|tstm",cleaning_data$evtype)] <- "thunderstorm wind"
cleaning_data$evtype[grep("torndao",cleaning_data$evtype)] <- "tornado"
cleaning_data$evtype[grep("^wa",cleaning_data$evtype)] <- "waterspout"