-
Notifications
You must be signed in to change notification settings - Fork 11
/
Copy pathindex.qmd
193 lines (140 loc) · 10.9 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
```{r}
#| label: preface-setup
#| include: false
source("R/_common.R")
set_options()
```
# Preface {.unnumbered}
Welcome! This is a work in progress. We want to create a practical guide to developing quality predictive models from tabular data. We'll publish materials here as we create them and welcome community contributions in the form of discussions, suggestions, and edits.
We also want these materials to be reusable and open. The sources are in the source [GitHub repository](https://github.com/aml4td/website) with a Creative Commons license attached (see below).
Our intention is to write these materials and, when we feel we're done, pick a publishing partner to produce a print version.
The book takes a holistic view of the predictive modeling process and focuses on a few areas that are usually left out of similar works. For example, the effectiveness of the model can be driven by how the predictors are represented. Because of this, we tightly couple feature engineering methods with machine learning models. Also, quite a lot of work happens after we have determined our best model and created the final fit. These post-modeling activities are an important part of the model development process and will be described in detail.
We deliberately avoid using the term "artificial intelligence." Eugen Rochko's (`@[email protected]`) comment on [Mastodon](https://mastodon.social/@Gargron/111554885513300997) does a good job of summarizing our reservations regarding the term:
> It’s hard not to say "AI" when everybody else does too, but technically calling it AI is buying into the marketing. There is no intelligence there, and it’s not going to become sentient. It's just statistics, and the danger they pose is primarily through the false sense of skill or fitness for purpose that people ascribe to them.
To cite this website, we suggest:
```{r}
#| label: citation
#| echo: false
#| results: asis
cite <- glue::glue("
@online{aml4td,
author = {Kuhn, M and Johnson, K},
title = {{Applied Machine Learning for Tabular Data}},
year = {2023},
url = { https://aml4td.org},
urldate = {[Sys.Date()]}
}
", .open = "[", .close = "]")
cite <- paste("```", cite, "```", sep = "\n")
cat(cite)
```
## License {.unnumbered}
<p xmlns:cc="http://creativecommons.org/ns#" >This work is licensed under <a href="http://creativecommons.org/licenses/by-nc-sa/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;">CC BY-NC-SA 4.0<img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/nc.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/sa.svg?ref=chooser-v1"></a></p>
This license requires that reusers give credit to the creator. It allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, for noncommercial purposes only. If others modify or adapt the material, they must license the modified material under identical terms.
- BY: Credit must be given to you, the creator.
- NC: Only noncommercial use of your work is permitted. Noncommercial means not primarily intended for or directed towards commercial advantage or monetary compensation.
- SA: Adaptations must be shared under the same terms.
Our goal is to have an open book where people can reuse and reference the materials but can't just put their names on them and resell them (without our permission).
## Intended Audience {.unnumbered}
Our intended audience includes data analysts of many types: statisticians, data scientists, professors and instructors of machine learning courses, laboratory scientists, and anyone else who desires to understand how to create a model for prediction. We don't expect readers to be experts in these methods or the math behind them. Instead, our approach throughout this work is applied. That is, we want readers to use this material to build intuition about the predictive modeling process. What are good and bad ideas for the modeling process? What pitfalls should we look out for? How can we be confident that the model will be predictive for new samples? What are advantages and disadvantages of different types of models? These are just some of the questions that this work will address.
Some background in modeling and statistics will be extremely useful. Having seen or used basic regression models is good, and an understanding of basic statistical concepts such as variance, correlation, populations, samples, etc., is needed. There will also be some mathematical notation, so you'll need to be able to grasp these abstractions. But we will keep this to those parts where it is absolutely necessary. There are a few more statistically sophisticated sections for some of the more advanced topics.
If you would like a more theoretical treatment of machine learning models, then we recommend @HastieEtAl2017. Other books for gaining a more in-depth understanding of machine learning are @bishop2006pattern, @arnold2019computational and, for more of a deep learning focus, @goodfellow2016deep and/or @udl2023.
## Is There Code? {.unnumbered}
We definitely want to decouple the content of this work from specific software. [One of our other books](http://appliedpredictivemodeling.com/) on modeling had computing sections. Many people found these sections to be a useful resource at the time of the book's publication. However, code can quickly become outdated in today's computational environment. In addition, this information takes up a lot of page space that would be better used for other topics.
We will create _computing supplements_ to go along with the materials. Since we use R's tidymodels framework for calculations, the supplement currently in-progress is:
- [`tidymodels.aml4td.org`](https://tidymodels.aml4td.org)
If you are interested in working on a python/scikit-learn supplement, please [file an issue](https://github.com//aml4td/website/issues)
When there is code that corresponds to a particular section, there will be one or more icons at the ends of the section to link to the computing supplements, such as this:
`r r_comp("introduction.html")`
## Are There Exercises? {#sec-exercises}
Many readers found the Exercise sections of _Applied Predictive Modeling_ to be helpful for solidifying the concepts presented in each chapter. The current set can be found at [`exercises.aml4td.org`](https://exercises.aml4td.org)
## How can I Ask Questions? {#sec-help}
If you have questions about the content, it is probably best to ask on a public forum, like [cross-validated](https://stats.stackexchange.com/). You'll most likely get a faster answer there if you take the time to ask the questions in the best way possible.
If you want a direct answer from us, you should follow what I call [_Yihui's Rule_](https://yihui.org/en/2017/08/so-gh-email/): add an issue to GitHub (labeled as "Discussion") first. It may take some time for us to get back to you.
If you think there is a bug, please [file an issue](https://github.com//aml4td/website/issues).
## Can I Contribute? {.unnumbered}
There is a [contributing page](chapters/contributing.html) with details on how to get up and running to compile the materials (there are a lot of software dependencies) and suggestions on how to help.
If you just want to fix a typo, you can make a pull request to alter the appropriate `.qmd` file.
Please feel free to improve the quality of this content by submitting **pull requests**. A merged PR will make you appear in the contributor list. It will, however, be considered a donation of your work to this project. You are still bound by the conditions of the license, meaning that you are **not considered an author, copyright holder, or owner** of the content once it has been merged in.
## Computing Notes {.unnumbered}
```{r}
#| label: preface-versions
#| include: false
get_pkg_depends <- function() {
info <- read.dcf("DESCRIPTION")
pkgs <- strsplit(info[, "Imports"], "\\n")[[1]]
pkgs <- purrr::map_chr(pkgs, ~ gsub(",", "", .x))
pkgs <- strsplit(pkgs, " ")
pkgs <- purrr::map_chr(pkgs, ~ .x[1])
pkgs
}
make_matrix <- function(x, cols = 3) {
remainder <- length(x) %% cols
padding <- cols - remainder
if (padding > 0) {
x <- c(x, rep(" ", padding))
}
matrix(x, ncol = 3, byrow = TRUE)
}
write_pkg_list <- function() {
pkgs <- get_pkg_depends()
excld <- c("sessioninfo", "tinytex", "cli", "devtools", "doParallel",
"kableExtra", "knitr", "pak", "renv", "BiocParallel", "magick",
"rsvg", "pillar", "jsonlite", "gifski", "future", "text2vec",
"tibble", "waldo", "xfun", "yaml")
pkgs <- pkgs[!(pkgs %in% excld)]
loaded <-
purrr::map(pkgs,
~ try(
suppressPackageStartupMessages(
library(.x, character.only = TRUE, quietly = TRUE)
),
silent = TRUE
)
)
# Write to repo root
nm <- paste0("session-info-", Sys.info()["user"], "-", Sys.info()["machine"], ".txt")
# sessioninfo::session_info(to_file = nm)
# Save for text
si <-
sessioninfo::session_info()$packages %>%
tibble::as_tibble() %>%
dplyr::filter(package %in% pkgs)
pkgs <- purrr::map2_chr(si$package, si$loadedversion, ~ paste0("`", .x, "` (", .y, ")"))
make_matrix(pkgs)
}
```
[Quarto](https://quarto.org/) was used to compile and render the materials
```{r}
#| label: quarto-info
#| echo: false
#| comment: ""
quarto_info <- function(){
file_out <- tempfile("temp-quarto.txt")
system2(command = "quarto", args = "check", stderr = file_out)
res <- readLines(file_out)
res <- purrr::map_chr(res, cli::ansi_strip)
rms <- c("(|)", "(/)", "(\\)", "(/)", "(-)", "/Users", "Path:", "Install with")
for (pat in rms) {
res <- res[!grepl(pat, res, fixed = TRUE)]
}
res <- res[res != ""]
invisible(res)
}
quarto_res <- quarto_info()
req_quarto_version <- "1.4.533" # in future, use numeric_version()
# if (!any(grepl(req_quarto_version, quarto_res))) {
# cli::cli_abort("Version {req_quarto_version} of Quarto is required.
# See {.url https://quarto.org/docs/download/}")
# }
cat(quarto_res, sep = "\n")
```
[`r R.version.string`](https://en.wikipedia.org/wiki/R_(programming_language)) was used for the majority of the computations. [torch](https://en.wikipedia.org/wiki/Torch_(machine_learning)) `r torch:::torch_version` was also used. The versions of the primary R modeling and visualization packages used here are:
```{r}
#| label: write-pkg-versions
#| echo: false
#| comment: " "
#| results: asis
knitr::kable(write_pkg_list())
```
## Chapter References {.unnumbered}