-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
153 lines (111 loc) · 6.78 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
---
output:
md_document:
variant: markdown_github
html_document:
variant: markdown_github
keep_md: true
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
# Store user's options()
old_options <- options()
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# tlars
[![CRAN_Status_Badge](https://www.r-pkg.org/badges/version/tlars)](https://CRAN.R-project.org/package=tlars)
[![CRAN Downloads](https://cranlogs.r-pkg.org/badges/tlars)](https://CRAN.R-project.org/package=tlars)
![CRAN Downloads Total](https://cranlogs.r-pkg.org/badges/grand-total/tlars?color=brightgreen)
**Title**: The T-LARS Algorithm: Early-Terminated Forward Variable Selection
**Description**: It computes the solution path of the Terminating-LARS (T-LARS) algorithm. The T-LARS algorithm appends dummy predictors to the original predictor matrix and terminates the forward-selection process after a pre-defined number of dummy variables has been selected.
**Paper**: The package is based on the paper
J. Machkour, M. Muma, and D. P. Palomar, “The terminating-random experiments selector: Fast high-dimensional variable selection with false discovery rate control,” arXiv preprint arXiv:2110.06048, 2022. (<https://doi.org/10.48550/arXiv.2110.06048>)
**Note**: The T-LARS algorithm is a major building block of the T-Rex selector ([Paper](https://arxiv.org/abs/2110.06048) and [R package](https://CRAN.R-project.org/package=TRexSelector)). The T-Rex selector performs terminated-random experiments (T-Rex) using the T-LARS algorithm and fuses the selected active sets of all random experiments to obtain a final set of selected variables. The T-Rex selector provably controls the false discovery rate (FDR), i.e., the expected fraction of selected false positives among all selected variables, at the user-defined target level while maximizing the number of selected variables.
In the following, we show how to use the package and give you an idea of why terminating the solution path early is a reasonable approach in high-dimensional and sparse variable selection: In many applications, most active variables enter the solution path early!
## Installation
You can install the 'tlars' package (stable version) from [CRAN](https://CRAN.R-project.org/package=tlars) with
``` r
install.packages("tlars")
library(tlars)
```
You can install the 'tlars' package (developer version) from [GitHub](https://github.com/jasinmachkour/tlars) with
``` r
install.packages("devtools")
devtools::install_github("jasinmachkour/tlars")
```
You can open the help pages with
```{r, eval=FALSE}
library(tlars)
help(package = "tlars")
?tlars
?tlars_model
?tlars_cpp
?plot.Rcpp_tlars_cpp
?print.Rcpp_tlars_cpp
?Gauss_data
```
To cite the package 'tlars' in publications use:
```{r, eval=FALSE}
citation("tlars")
```
## Quick Start
In the following, we illustrate the basic usage of the 'tlars' package to perform variable selection in sparse and high-dimensional regression settings using the T-LARS algorithm.
1. **First**, we generate a high-dimensional Gaussian data set with sparse support:
```{r}
library(tlars)
# Setup
n <- 75 # Number of observations
p <- 150 # Number of variables
num_act <- 3 # Number of true active variables
beta <- c(rep(1, times = num_act), rep(0, times = p - num_act)) # Coefficient vector
true_actives <- which(beta > 0) # Indices of true active variables
num_dummies <- p # Number of dummy predictors (or dummies)
# Generate Gaussian data
set.seed(123)
X <- matrix(stats::rnorm(n * p), nrow = n, ncol = p)
y <- X %*% beta + stats::rnorm(n)
```
2. **Second**, we generate a dummy matrix containing n rows and num_dummies dummy predictors that are sampled from the standard normal distribution and append it to the original predictor matrix:
```{r}
set.seed(1234)
dummies <- matrix(stats::rnorm(n * num_dummies), nrow = n, ncol = num_dummies)
XD <- cbind(X, dummies)
```
3. **Third**, we generate an object of the C++ class 'tlars_cpp' and supply the information that the last num_dummies predictors in XD are dummy predictors:
```{r}
mod_tlars <- tlars_model(X = XD, y = y, num_dummies = num_dummies)
```
4. **Finally**, we perform three T-LARS steps on 'mod_tlars', i.e., the T-LARS algorithm is run until **T_stop = 3** dummies have entered the solution path and stops there. For comparison, we also compute the whole solution path by setting early_stop = FALSE:
4.1. Perform three T-LARS steps on object 'mod_tlars':
```{r terminated_solution_path, echo=TRUE, fig.align='center', message=TRUE, fig.width = 10, fig.height = 6, out.width = "90%"}
tlars(model = mod_tlars, T_stop = 3, early_stop = TRUE) # Perform three T-LARS steps on object "mod_tlars"
print(mod_tlars) # Print information about the results of the performed T-LARS steps
plot(mod_tlars) # Plot the terminated solution path
```
4.2. Compute the whole solution path:
```{r full_solution_path, echo=TRUE, fig.align='center', message=TRUE, fig.width = 10, fig.height = 6, out.width = "90%"}
tlars(model = mod_tlars, early_stop = FALSE) # Compute the whole solution path
print(mod_tlars) # Print information about the results
plot(mod_tlars) # Plot the whole solution path
```
## Outlook
The T-LARS algorithm is a major building block of the T-Rex selector ([Paper](https://arxiv.org/abs/2110.06048) and [R package](https://CRAN.R-project.org/package=TRexSelector)). The T-Rex selector performs terminated-random experiments (T-Rex) using the T-LARS algorithm and fuses the selected active sets of all random experiments to obtain a final set of selected variables. The T-Rex selector provably controls the FDR at the user-defined target level while maximizing the number of selected variables. If you are working in genomics, financial engineering, or any other field that requires a fast and FDR-controlling variable/feature selection method for large-scale high-dimensional settings, then this is for you. Check it out!
## Documentation
For more information and some examples, please check the
[GitHub-vignette](https://htmlpreview.github.io/?https://github.com/jasinmachkour/tlars/blob/main/vignettes/tlars_variable_selection.html).
## Links
tlars package (stable version): [CRAN-tlars](https://CRAN.R-project.org/package=tlars).
tlars package (developer version): [GitHub-tlars](https://github.com/jasinmachkour/tlars).
README file: [GitHub-readme](https://htmlpreview.github.io/?https://github.com/jasinmachkour/tlars/blob/main/README.html).
Vignette: [GitHub-vignette](https://htmlpreview.github.io/?https://github.com/jasinmachkour/tlars/blob/main/vignettes/tlars_variable_selection.html).
TRexSelector package: [CRAN-TRexSelector](https://CRAN.R-project.org/package=TRexSelector).
T-Rex paper: https://arxiv.org/abs/2110.06048
```{r, include = FALSE}
# Reset user's options()
options(old_options)
```