[SWIG] Add streaming data support + cpp tests #3997

AlbertoEAF · 2021-02-17T19:27:04Z

Implements #3995. See that issue for the full explanation.

Description

Adds support to train LightGBM models through SWIG from an input dataset stream in a single-pass without having to know the dataset size.

This can later be exposed to remaining language bindings as ChunkedArray is a regular C++ class.

After ingesting the data stream to the ChunkedArray, LGBM_DatasetCreateFromMats can be used as usual to create the train dataset from an array of arrays.

The ChunkedArray provides both a high and low-level API to insert elements. The latter supports loading data in parallel but must be managed manually. The former is not thread-safe and is meant for single input stream ingestion. It is very basic to use. Call add(value) repeatedly and finally get the data through data() or even data_as_void(), the latter being mandatory in SWIG.

Changes

Add ChunkedArray
Add tests for ChunkedArray (cannot use gtest yet, so they are still called manually)
Add ChunkedArray_API_extensions.i
Add SWIG class wrappers to wrap ChunkedArray as a proper Java class
Add C++ tests to ChunkedArray using googletest

Refactor changes

Enable cpplint for the swig/ folder and fix warnings with StringArray.

AlbertoEAF · 2021-02-17T19:32:30Z

I'd like to add googletest tests but the MR that already has gtest support is not merged yet (#3555). The test file must be called manually for now.

swig/ChunkedArray_API_extensions.i

swig/test_ChunkedArray_manually.cpp

StrikerRUS · 2021-02-18T00:48:15Z

also ping @imatiach-msft
We will really appreciate your comments here!

swig/ChunkedArray.hpp

imatiach-msft

added some comments

swig/ChunkedArray.hpp

swig/test_ChunkedArray_manually.cpp

AlbertoEAF · 2021-03-03T14:57:41Z

@StrikerRUS @guolinke can we merge this?

The failing R tests are not caused by these changes as they only occur in Java's swig interface.

jameslamb · 2021-03-03T16:08:51Z

@StrikerRUS @guolinke can we merge this?

The failing R tests are not caused by these changes as they only occur in Java's swig interface.

Please update to the latest master. We've fixed several issues in CI over the last few days (#4010, #4042, #4032).

StrikerRUS · 2021-03-03T18:20:33Z

@StrikerRUS @guolinke can we merge this?

What about target release? Should this PR be "the last passenger" in terms of new features for 3.2.0 or we should wait a little bit and include it in the 4.0.0? As for me, I don't have any strong opinion and since this PR doesn't touch any existent code but add only new one, I don't mind to see it in 3.2.0.

AlbertoEAF · 2021-03-06T15:15:21Z

I agree with StrikerRUS. As this adds value and does not touch any existing code, it would be nice to have this in 3.2.0 so it could start being used in MMLSpark sooner rather than later, and so Feedzai's Java provider could get rid of the LightGBM fork which provides several yet unreleased features (model locale fix & streaming).

As for 4.0.0 we could improve the sparse streaming support by adding a method LGBM_DatasetCreateFromCSRs, that just like LGBM_DatasetCreateFromMats allows collecting creating a Dataset from a sparse input without coallescing first. This would further reduce peak memory usage (see discussion #3995 (comment) on memory spikes). As @StrikerRUS suggested we could also consider exposing ChunkedArray in the C interface if other people find it useful in other languages.

AlbertoEAF · 2021-03-12T12:51:07Z

gentle request for merge ping @guolinke :)

StrikerRUS · 2021-03-12T21:54:56Z

@AlbertoEAF Sorry for the delay!
Now that #3555 has been merged, could you please move tests on the appropriate place and make them GTest-compatible? Thank you!

AlbertoEAF · 2021-03-13T20:08:40Z

Ok @StrikerRUS , ported all the tests to googletest :)

I actually found a bug in coalesce_to that occured only in advanced usage with the low-level insertion API with explicit manual calls to new_chunk()+setitem() instead of add()'s so it was already worth it.

Hopefully this will make it easier for other devs to port C++ tests as they now have some examples, including extending gtest assertions.

StrikerRUS

@AlbertoEAF Great job! Thanks a lot!

Seems that SWIG wrapper should be updated with the recently added bool argument to coalesce_to() because right now SWIG wrapper compilation fails

Scanning dependencies of target _lightgbm_swig
[ 95%] Building CXX object CMakeFiles/_lightgbm_swig.dir/java/lightgbmlibJAVA_wrap.cxx.o
/__w/1/s/build/java/lightgbmlibJAVA_wrap.cxx: In function ‘void Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_int64ChunkedArray_1coalesce_1to_1_1SWIG_10(JNIEnv*, jclass, jlong, jobject, jlong, jboolean)’:
/__w/1/s/build/java/lightgbmlibJAVA_wrap.cxx:4701:65: error: invalid conversion from ‘long long int*’ to ‘long int*’ [-fpermissive]
   ((ChunkedArray< int64_t > const *)arg1)->coalesce_to(arg2,arg3);
                                                                 ^
In file included from /__w/1/s/build/java/lightgbmlibJAVA_wrap.cxx:939:0:
/__w/1/s/include/../include/LightGBM/utils/ChunkedArray.hpp:147:10: error:   initializing argument 1 of ‘void ChunkedArray<T>::coalesce_to(T*, bool) const [with T = long int]’ [-fpermissive]
     void coalesce_to(T *other, bool all_valid_addresses = false) const {
          ^
/__w/1/s/build/java/lightgbmlibJAVA_wrap.cxx: In function ‘void Java_com_microsoft_ml_lightgbm_lightgbmlibJNI_int64ChunkedArray_1coalesce_1to_1_1SWIG_11(JNIEnv*, jclass, jlong, jobject, jlong)’:
/__w/1/s/build/java/lightgbmlibJAVA_wrap.cxx:4714:60: error: invalid conversion from ‘long long int*’ to ‘long int*’ [-fpermissive]
   ((ChunkedArray< int64_t > const *)arg1)->coalesce_to(arg2);
                                                            ^
In file included from /__w/1/s/build/java/lightgbmlibJAVA_wrap.cxx:939:0:
/__w/1/s/include/../include/LightGBM/utils/ChunkedArray.hpp:147:10: error:   initializing argument 1 of ‘void ChunkedArray<T>::coalesce_to(T*, bool) const [with T = long int]’ [-fpermissive]
     void coalesce_to(T *other, bool all_valid_addresses = false) const {
          ^
At global scope:
cc1plus: warning: unrecognized command line option "-Wno-ignored-attributes" [enabled by default]
make[2]: *** [CMakeFiles/_lightgbm_swig.dir/java/lightgbmlibJAVA_wrap.cxx.o] Error 1
make[1]: *** [CMakeFiles/_lightgbm_swig.dir/all] Error 2

Also, please consider checking my some very minor comments below.

include/LightGBM/utils/ChunkedArray.hpp

swig/StringArray.hpp

tests/cpp_test/test_ChunkedArray.cpp

jameslamb · 2021-03-19T03:35:55Z

/gha run r-valgrind

Workflow R valgrind tests has been triggered! 🚀
https://github.com/microsoft/LightGBM/actions/runs/666715529

Status: success ✔️.

jameslamb · 2021-03-19T13:34:20Z

Ok @AlbertoEAF , I think we've fixed the R tests. Could you please merge in the latest changes from master, to get the fixes from #4084 ? Sorry for the inconvenience.

* Add ChunkedArray * Add ChunkedArray_API_extensions.i * Add SWIG class wrappers

AlbertoEAF · 2021-03-19T14:50:39Z

Thanks for the R fixes @jameslamb ;)

@guolinke can you review for the v3.2.0 release? @StrikerRUS already approved. Thanks :)

guolinke · 2021-03-21T04:34:41Z

@AlbertoEAF sorry for miss this. I will review it today.

guolinke

Thank you!

* [docs]Add alt text on images * Update docs/GPU-Windows.rst Co-authored-by: James Lamb <[email protected]> * Update docs/GPU-Windows.rst Co-authored-by: James Lamb <[email protected]> * Apply suggestions from code review Co-authored-by: James Lamb <[email protected]> * Apply suggestions from code review Co-authored-by: James Lamb <[email protected]> * Merge main branch commit updates (#1) * [docs] Add alt text to image in Parameters-Tuning.rst (#4035) * [docs] Add alt text to image in Parameters-Tuning.rst Add alt text to Leaf-wise growth image, as part of #4028 * Update docs/Parameters-Tuning.rst Co-authored-by: James Lamb <[email protected]> Co-authored-by: James Lamb <[email protected]> * [ci] [R-package] upgrade to R 4.0.4 in CI (#4042) * [docs] update description of deterministic parameter (#4027) * update description of deterministic parameter to require using with force_row_wise or force_col_wise * Update include/LightGBM/config.h Co-authored-by: Nikita Titov <[email protected]> * update docs Co-authored-by: Nikita Titov <[email protected]> * [dask] Include support for init_score (#3950) * include support for init_score * use dataframe from init_score and test difference with and without init_score in local model * revert refactoring * initial docs. test between distributed models with and without init_score * remove ranker from tests * test value for root node and change docs * comma * re-include parametrize * fix incorrect merge * use single init_score and the booster_ attribute * use np.float64 instead of float * [ci] ignore untitle Jupyter notebooks in .gitignore (#4047) * [ci] prevent getting incompatible dask and distributed versions (#4054) * [ci] prevent getting incompatible dask and distributed versions * Update .ci/test.sh Co-authored-by: Nikita Titov <[email protected]> * empty commit Co-authored-by: Nikita Titov <[email protected]> * [ci] fix R CMD CHECK note about example timings (fixes #4049) (#4055) * [ci] fix R CMD CHECK note about example timings (fixes #4049) * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * empty commit Co-authored-by: Nikita Titov <[email protected]> * [ci] add CMake + R 3.6 test back (fixes #3469) (#4053) * [ci] add CMake + R 3.6 test back (fixes #3469) * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * Update .ci/test_r_package_windows.ps1 * -Wait and remove rtools40 * empty commit Co-authored-by: Nikita Titov <[email protected]> * [dask] include multiclass-classification task in tests (#4048) * include multiclass-classification task and task_to_model_factory dicts * define centers coordinates. flatten init_scores within each partition for multiclass-classification * include issue comment and fix linting error * Update index.rst (#4029) Add alt text to logo image Co-authored-by: James Lamb <[email protected]> * [dask] raise more informative error for duplicates in 'machines' (fixes #4057) (#4059) * [dask] raise more informative error for duplicates in 'machines' * uncomment * avoid test failure * Revert "avoid test failure" This reverts commit 9442bdf. * [dask] add tutorial documentation (fixes #3814, fixes #3838) (#4030) * [dask] add tutorial documentation (fixes #3814, fixes #3838) * add notes on saving the model * quick start examples * add examples * fix timeouts in examples * remove notebook * fill out prediction section * table of contents * add line back * linting * isort * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> * move examples under python-guide * remove unused pickle import Co-authored-by: Nikita Titov <[email protected]> * set 'pending' commit status for R Solaris optional workflow (#4061) * [docs] add Yu Shi to repo maintainers (#4060) * Update FAQ.rst * Update CODEOWNERS * set is_linear_ to false when it is absent from the model file (fix #3778) (#4056) * Add CMake option to enable sanitizers and build gtest (#3555) * Add CMake option to enable sanitizer * Set up gtest * Address reviewer's feedback * Address reviewer's feedback * Update CMakeLists.txt Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: Nikita Titov <[email protected]> * added type hint (#4070) * [ci] run Dask examples on CI (#4064) * Update Parallel-Learning-Guide.rst * Update test.sh * fix path * address review comments * [python-package] add type hints on Booster.set_network() (#4068) * [python-package] add type hints on Booster.set_network() * change behavior * [python-package] Some mypy fixes (#3916) * Some mypy fixes * address James' comments * Re-introduce pass in empty classes * Update compat.py Remove extra lines * [dask] [ci] fix flaky network-setup test (#4071) * [tests][dask] simplify code in Dask tests (#4075) * simplify Dask tests code * enable CI * disable CI * Revert "[ci] prevent getting incompatible dask and distributed versions (#4054)" (#4076) This reverts commit 4e9c976. * Fix parsing of non-finite values (#3942) * Fix index out-of-range exception generated by BaggingHelper on small datasets. Prior to this change, the line "score_t threshold = tmp_gradients[top_k - 1];" would generate an exception, since tmp_gradients would be empty when the cnt input value to the function is zero. * Update goss.hpp * Update goss.hpp * Add API method LGBM_BoosterPredictForMats which runs prediction on a data set given as of array of pointers to rows (as opposed to existing method LGBM_BoosterPredictForMat which requires data given as contiguous array) * Fix incorrect upstream merge * Add link to LightGBM.NET * Fix indenting to 2 spaces * Dummy edit to trigger CI * Dummy edit to trigger CI * remove duplicate functions from merge * Fix parsing of non-finite values. Current implementation silently returns zero when input string is "inf", "-inf", or "nan" when compiled with VS2017, so instead just explicitly check for these values and fail if there is no match. No attempt to optimise string allocations in this implementation since it is usually rarely invoked. * Dummy commit to trigger CI * Also handle -nan in double parsing method * Update include/LightGBM/utils/common.h Remove trailing whitespace to pass linting tests Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: matthew-peacock <[email protected]> Co-authored-by: Guolin Ke <[email protected]> Co-authored-by: Nikita Titov <[email protected]> * [dask] remove unused imports from typing (#4079) * Range check for DCG position discount lookup (#4069) * Add check to prevent out of index lookup in the position discount table. Add debug logging to report number of queries found in the data. * Change debug logging location so that we can print the data file name as well. * Revert "Change debug logging location so that we can print the data file name as well." This reverts commit 3981b34. * Add data file name to debug logging. * Move log line to a place where it is output even when query IDs are read from a separate file. * Also add the out-of-range check to rank metrics. * Perform check after number of queries is initialized. * Update * [ci] upgrade R CI scripts to work on Ubuntu 20.04 (#4084) * [ci] install additional LaTeX packages in R CI jobs * update autoconf version * bump upper limit on package size to 100 * [SWIG] Add streaming data support + cpp tests (#3997) * [feature] Add ChunkedArray to SWIG * Add ChunkedArray * Add ChunkedArray_API_extensions.i * Add SWIG class wrappers * Address some review comments * Fix linting issues * Move test to tests/test_ChunkedArray_manually.cpp * Add test note * Move ChunkedArray to include/LightGBM/utils/ * Declare more explicit types of ChunkedArray in the SWIG API. * Port ChunkedArray tests to googletest * Please C++ linter * Address StrikerRUS' review comments * Update SWIG doc & disable ChunkedArray<int64_t> * Use CHECK_EQ instead of assert * Change include order (linting) * Rename ChunkedArray -> chunked_array files * Change header guards * Address last comments from StrikerRUS * store all CMake files in one place (#4087) * v3.2.0 release (#3872) * Update VERSION.txt * update appveyor.yml and configure * fix Appveyor builds Co-authored-by: James Lamb <[email protected]> Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: StrikerRUS <[email protected]> * [ci] Bump version for development (#4094) * Update .appveyor.yml * Update cran-comments.md * Update VERSION.txt * update configure Co-authored-by: James Lamb <[email protected]> * [ci] fix flaky Azure Pipelines jobs (#4095) * Update test.sh * Update setup.sh * Update .vsts-ci.yml * Update test.sh * Update setup.sh * Update .vsts-ci.yml * Update setup.sh * Update setup.sh Co-authored-by: Subham Agrawal <[email protected]> Co-authored-by: James Lamb <[email protected]> Co-authored-by: shiyu1994 <[email protected]> Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: jmoralez <[email protected]> Co-authored-by: marcelonieva7 <[email protected]> Co-authored-by: Philip Hyunsu Cho <[email protected]> Co-authored-by: Deddy Jobson <[email protected]> Co-authored-by: Alberto Ferreira <[email protected]> Co-authored-by: mjmckp <[email protected]> Co-authored-by: matthew-peacock <[email protected]> Co-authored-by: Guolin Ke <[email protected]> Co-authored-by: ashok-ponnuswami-msft <[email protected]> Co-authored-by: StrikerRUS <[email protected]> * Apply suggestions from code review Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: James Lamb <[email protected]> Co-authored-by: Subham Agrawal <[email protected]> Co-authored-by: shiyu1994 <[email protected]> Co-authored-by: Nikita Titov <[email protected]> Co-authored-by: jmoralez <[email protected]> Co-authored-by: marcelonieva7 <[email protected]> Co-authored-by: Philip Hyunsu Cho <[email protected]> Co-authored-by: Deddy Jobson <[email protected]> Co-authored-by: Alberto Ferreira <[email protected]> Co-authored-by: mjmckp <[email protected]> Co-authored-by: matthew-peacock <[email protected]> Co-authored-by: Guolin Ke <[email protected]> Co-authored-by: ashok-ponnuswami-msft <[email protected]> Co-authored-by: StrikerRUS <[email protected]>

github-actions · 2023-08-23T23:58:05Z

This pull request has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this.

StrikerRUS added the feature label Feb 18, 2021

StrikerRUS requested review from btrotta, chivee, guolinke and shiyu1994 February 18, 2021 00:43

StrikerRUS reviewed Feb 18, 2021

View reviewed changes

swig/ChunkedArray_API_extensions.i Outdated Show resolved Hide resolved

swig/test_ChunkedArray_manually.cpp Outdated Show resolved Hide resolved

AlbertoEAF commented Feb 18, 2021

View reviewed changes