Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Symbolic Regression/Classification C/C++ #3638

Merged
Show file tree
Hide file tree
Changes from 86 commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
0e9d987
Added fitness functions
vimarsh6739 Mar 15, 2021
2bfdcbd
Added single program execution kernel
vimarsh6739 Mar 18, 2021
e7ce249
Added batched tournament kernel impl
vimarsh6739 Mar 19, 2021
b8fa564
Merge branch 'branch-0.19' into fea-ext-genetic-programming-internals
vimarsh6739 Mar 19, 2021
4d55122
Added point mutations and subtree extraction
vimarsh6739 Apr 1, 2021
befe036
Removed some compilation bugs
vimarsh6739 Apr 1, 2021
07bb099
Added crossover
vimarsh6739 Apr 2, 2021
82ef099
Merge branch 'fea-ext-genetic-programming-internals' of https://githu…
vimarsh6739 Apr 2, 2021
e54c0bd
Added hoist mutations
vimarsh6739 Apr 2, 2021
e7cc561
Build random full depth programs
vimarsh6739 Apr 2, 2021
28fdbba
Pass twister directly when mutating
vimarsh6739 Apr 9, 2021
764abaf
Mutation decision before tournaments
vimarsh6739 Apr 11, 2021
1e9b0ae
Host uses only call by references
vimarsh6739 Apr 11, 2021
800811b
Batched execution done
vimarsh6739 Apr 12, 2021
799ee31
Batched loss function impl left
vimarsh6739 Apr 12, 2021
cd66f94
Added batched version of all loss functions
vimarsh6739 Apr 12, 2021
62830de
Initial generation covered
vimarsh6739 Apr 13, 2021
b71f96c
Testing template + double free bugfix in program
vimarsh6739 Apr 14, 2021
01c4804
Added tests for loss functions
vimarsh6739 Apr 14, 2021
9ae8007
Fixed row broadcasting bug
vimarsh6739 Apr 15, 2021
c5301d1
Increased tolerance to 5%. To optimize log loss
vimarsh6739 Apr 15, 2021
e9887d1
Get Stackd
vimarsh6739 Apr 15, 2021
3c3f1a1
Added all tests
vimarsh6739 Apr 15, 2021
040961c
A small price for stability
vimarsh6739 Apr 15, 2021
9302f1d
2.5% relative error, pearson + logloss :(
vimarsh6739 Apr 16, 2021
3ce11c1
Abs testing for pearson, spearman
vimarsh6739 Apr 17, 2021
1261241
Added minimal fn subset to be exposed
vimarsh6739 Apr 18, 2021
c9d82fb
Added simple fit function
vimarsh6739 Apr 28, 2021
0df50a5
Set return values!
vimarsh6739 Apr 28, 2021
1d142c6
Predict clf reg
vimarsh6739 Apr 29, 2021
ec89793
Added transform function
vimarsh6739 Apr 29, 2021
d1bb1e6
Fixed program initialization
vimarsh6739 Apr 29, 2021
db7fe39
Unrolled stack ops
vimarsh6739 Apr 29, 2021
0024e54
Increase readability
vimarsh6739 Apr 29, 2021
256f2e6
Wrong param fixed!
vimarsh6739 Apr 29, 2021
447a316
Added example, fixed mutation bugs
vimarsh6739 May 17, 2021
f01a74d
Fixed a few initialization bugs
vimarsh6739 May 18, 2021
400d97a
Hoist mutations uses subtree of subtree!
vimarsh6739 May 18, 2021
6a78308
Corrected stringify
vimarsh6739 May 18, 2021
f34dda0
Purge memory leaks
vimarsh6739 May 19, 2021
09d7338
Free device mem in example
vimarsh6739 May 19, 2021
12d8c22
Hoisted crossover | call by ref for dev pointers
vimarsh6739 May 23, 2021
78477e0
Corrected RAII bug in tests
vimarsh6739 May 23, 2021
87efef8
Fixed error for rms loss!
vimarsh6739 May 23, 2021
5781418
Shifted to rmm | Tracked training time
vimarsh6739 May 23, 2021
7ae0e75
Added example readme
vimarsh6739 May 23, 2021
4645051
Update standalone cmake build file
vimarsh6739 May 25, 2021
c6c6a39
Updated README for example
vimarsh6739 May 25, 2021
1ab451c
Added support for custom terminal ratios
vimarsh6739 May 27, 2021
86589f5
GTest for sym reg
vimarsh6739 May 29, 2021
fa90e9c
Increased max_depth to 20(for now)
vimarsh6739 Jun 3, 2021
d98ea0c
Updated exampe to account for stack size
vimarsh6739 Jun 3, 2021
b1dc548
Improve readability + low_memory support
vimarsh6739 Jun 4, 2021
339eb80
Updated README
vimarsh6739 Jun 4, 2021
3cadfd3
Ran clang-format on all source files
vimarsh6739 Jun 12, 2021
007a289
Eliminate genetic.cuh
vimarsh6739 Jun 23, 2021
d36c8d2
Eliminate cuml.hpp in example
vimarsh6739 Jun 23, 2021
e0c54b3
doxygen updates
vimarsh6739 Jun 23, 2021
b6d9c7e
Fixed bug for 16M row fitness computation
vimarsh6739 Jun 27, 2021
c208ffc
Training fix for classification
vimarsh6739 Jun 29, 2021
d580cae
Fixed default value for height in ctor
vimarsh6739 Jun 29, 2021
adaf925
Bumped all datatypes to uint64_t
vimarsh6739 Jul 1, 2021
6f596fe
Inline parsimony computation
vimarsh6739 Jul 1, 2021
61ee16a
Increase file parsing speed in example
vimarsh6739 Jul 1, 2021
a4d965e
Stable values for logistic loss
vimarsh6739 Jul 2, 2021
5dc61cf
Clean code
vimarsh6739 Jul 10, 2021
235b453
Merge branch 'branch-21.08' of https://github.com/rapidsai/cuml into …
venkywonka Jul 15, 2021
5a052de
fix improper merge conflict
venkywonka Jul 15, 2021
64f6bdc
FIX clang format
venkywonka Jul 15, 2021
ce8ef9d
FIX copyright
venkywonka Jul 15, 2021
a3abc57
Update standalone cmake
vimarsh6739 Jul 17, 2021
8410561
Fix standalone cmake
vimarsh6739 Jul 17, 2021
0e13d32
Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …
venkywonka Sep 8, 2021
bdf2637
address review changes
venkywonka Sep 9, 2021
133ec59
FIX clang format
venkywonka Sep 9, 2021
41394f4
remove old code from example
venkywonka Sep 13, 2021
aa42ebc
Merge branch 'branch-21.10' of https://github.com/rapidsai/cuml into …
venkywonka Sep 13, 2021
b54fb41
Merge branch 'branch-21.12' into fea-ext-genetic-programming-internals
vimarsh6739 Oct 8, 2021
8f3238d
Merge branch 'branch-21.12' of https://github.com/rapidsai/cuml into …
venkywonka Oct 14, 2021
2862747
Merge branch 'branch-21.12' of https://github.com/rapidsai/cuml into …
venkywonka Oct 18, 2021
de83496
Merge branch 'branch-21.12' of https://github.com/rapidsai/cuml into …
venkywonka Oct 21, 2021
3b55e2f
Merge branch 'branch-21.12' into fea-ext-genetic-programming-internals
venkywonka Oct 26, 2021
8e96272
fix memleak and change all device allocation to rmm
venkywonka Oct 27, 2021
b756c13
fix memory leak of last generation outside in test
venkywonka Oct 28, 2021
a875351
add a doxygen note detailing the memory allocation behaviour
venkywonka Oct 29, 2021
57e88f9
remove unique_ptr
venkywonka Nov 1, 2021
64837aa
Merge branch 'branch-21.12' into fea-ext-genetic-programming-internals
venkywonka Nov 15, 2021
0ca30f1
accounting for raft updates on matrix,stats and random
venkywonka Nov 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -263,6 +263,7 @@ if(BUILD_CUML_CPP_LIBRARY)
src/fil/infer.cu
src/glm/glm.cu
src/genetic/genetic.cu
src/genetic/program.cu
src/genetic/node.cu
src/hdbscan/hdbscan.cu
src/hdbscan/condensed_hierarchy.cu
Expand Down
3 changes: 2 additions & 1 deletion cpp/examples/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
#
# Copyright (c) 2019, NVIDIA CORPORATION.
# Copyright (c) 2019-2021, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
Expand All @@ -16,3 +16,4 @@

add_subdirectory(kmeans)
add_subdirectory(dbscan)
add_subdirectory(symreg)
19 changes: 19 additions & 0 deletions cpp/examples/symreg/CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
#=============================================================================
# Copyright (c) 2021, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#=============================================================================

add_executable(symreg_example symreg_example.cpp)
target_include_directories(symreg_example PRIVATE ${CUML_INCLUDE_DIRECTORIES})
target_link_libraries(symreg_example cuml++)
33 changes: 33 additions & 0 deletions cpp/examples/symreg/CMakeLists_standalone.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#
# Copyright (c) 2021, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
cmake_minimum_required(VERSION 3.8 FATAL_ERROR)
include(ExternalProject)

project(symreg_example VERSION 0.1.0 LANGUAGES CXX CUDA )

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

find_package(CUDAToolkit)
find_package(cuml)

add_executable(symreg_example symreg_example.cpp)

# Need to set linker language to CUDA to link the CUDA Runtime
set_target_properties(symreg_example PROPERTIES LINKER_LANGUAGE "CUDA")

# Link cuml and cudart
target_link_libraries(symreg_example cuml::cuml++ CUDA::cudart)
87 changes: 87 additions & 0 deletions cpp/examples/symreg/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# symbolic regression
This subfolder contains an example on how perform symbolic regression in cuML (from C++)
There are two `CMakeLists.txt` in this folder:
1. `CMakeLists.txt` (default) which is included when building cuML
2. `CMakeLists_standalone.txt` as an example for a stand alone project linking to `libcuml.so`

## Build
`symreg_example` is built as a part of cuML. To build it as a standalone executable, do
```bash
$ cmake .. -DCUML_LIBRARY_DIR=/path/to/directory/with/libcuml.so -DCUML_INCLUDE_DIR=/path/to/cuml/headers
```
Then build with `make` or `ninja`
```
$ make
Scanning dependencies of target raft
[ 10%] Creating directories for 'raft'
[ 20%] Performing download step (git clone) for 'raft'
Cloning into 'raft'...
[ 30%] Performing update step for 'raft'
[ 40%] No patch step for 'raft'
[ 50%] No configure step for 'raft'
[ 60%] No build step for 'raft'
[ 70%] No install step for 'raft'
[ 80%] Completed 'raft'
[ 80%] Built target raft
Scanning dependencies of target symreg_example
[ 90%] Building CXX object CMakeFiles/symreg_example.dir/symreg_example.cpp.o
[100%] Linking CUDA executable symreg_example
[100%] Built target symreg_example
```
`CMakeLists_standalone.txt` also loads a minimal set of header dependencies(namely [raft](https://github.com/rapidsai/raft) and [cub](https://github.com/NVIDIA/cub)) if they are not detected in the system.
## Run

1. Generate a toy training and test dataset
```
$ python prepare_input.py
Training set has n_rows=250 n_cols=2
Test set has n_rows=50 n_cols=2
Wrote 500 values to train_data.txt
Wrote 100 values to test_data.txt
Wrote 250 values to train_labels.txt
Wrote 50 values to test_labels.txt
```

2. Run the symbolic regressor using the 4 files as inputs. An example query is given below
```bash
$ ./symreg_example -n_cols 2 \
-n_train_rows 250 \
-n_test_rows 50 \
-random_state 21 \
-population_size 4000 \
-generations 20 \
-stopping_criteria 0.01 \
-p_crossover 0.7 \
-p_subtree 0.1 \
-p_hoist 0.05 \
-p_point 0.1 \
-parsimony_coefficient 0.01
```

3. The corresponding output for the above query is given below :

```
Reading input with 250 rows and 2 columns from train_data.txt.
Reading input with 250 rows from train_labels.txt.
Reading input with 50 rows and 2 columns from test_data.txt.
Reading input with 50 rows from test_labels.txt.
***************************************
Allocating device memory...
Allocation time = 0.259072ms
***************************************
Beginning training on given dataset...
Finished training for 4 generations.
Best AST index : 1855
Best AST depth : 3
Best AST length : 13
Best AST equation :( add( sub( mult( X0, X0) , div( X1, X1) ) , sub( X1, mult( X1, X1) ) ) )
Training time = 626.658ms
***************************************
Beginning Inference on Test dataset...
Inference score on test set = 5.29271e-08
Inference time = 0.35248ms
Some Predicted test values:
-1.65061;-1.64081;-0.91711;-2.28976;-0.280688;
Corresponding Actual test values:
-1.65061;-1.64081;-0.91711;-2.28976;-0.280688;
```
46 changes: 46 additions & 0 deletions cpp/examples/symreg/prepare_input.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Copyright (c) 2021, NVIDIA CORPORATION.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

import numpy as np
from sklearn.model_selection import train_test_split

rng = np.random.RandomState(seed=2021)

# Training samples
X_train = rng.uniform(-1, 1, 500).reshape(250, 2)
y_train = X_train[:, 0]**2 - X_train[:, 1]**2 + X_train[:, 1] - 1

# Testing samples
X_test = rng.uniform(-1, 1, 100).reshape(50, 2)
y_test = X_test[:, 0]**2 - X_test[:, 1]**2 + X_test[:, 1] - 1

print("Training set has n_rows=%d n_cols=%d" %(X_train.shape))
print("Test set has n_rows=%d n_cols=%d" %(X_test.shape))

train_data = "train_data.txt"
test_data = "test_data.txt"
train_labels = "train_labels.txt"
test_labels = "test_labels.txt"

# Save all datasets in col-major format
np.savetxt(train_data, X_train.T,fmt='%.7f')
np.savetxt(test_data, X_test.T,fmt='%.7f')
np.savetxt(train_labels, y_train,fmt='%.7f')
np.savetxt(test_labels, y_test,fmt='%.7f')

print("Wrote %d values to %s"%(X_train.size,train_data))
print("Wrote %d values to %s"%(X_test.size,test_data))
print("Wrote %d values to %s"%(y_train.size,train_labels))
print("Wrote %d values to %s"%(y_test.size,test_labels))
Loading