Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] regression tests: binary Dataset format #4406

Closed
jameslamb opened this issue Jun 25, 2021 · 2 comments
Closed

[ci] regression tests: binary Dataset format #4406

jameslamb opened this issue Jun 25, 2021 · 2 comments

Comments

@jameslamb
Copy link
Collaborator

Summary

This project's continuous integration (CI) should include a job which tests that LightGBM Dataset binary files produced by previous versions can be successfully loaded and used in newer versions.

Specifically, it should test the following claim:

Binary Dataset files produced in LightGBM version (N).x.x should be readable and usable in all versions in the same major version series.

It should also include tests of expected compatibility between other versions. For example, if 4.0.0 does not include breaking changes to saving / loading of Dataset files, then a test should be added that such a file created in LightGBM 3.2.1 can be loaded in LightGBM 4.0.0

Motivation

LightGBM uses semantic versioning for releases. As a result, users expect that there will not be breaking changes within a major release series. For example, they expect that a Dataset saved to a binary file using LightGBM 3.1.0 will be readable in any other LightGBM 3.x.x release.

Adding explicit tests on that fact might provide greater confidence that releases are not introducing such changes.

Description

LightGBM performs several preprocessing steps on training data before beginning the boosting process. Those steps are performed in the construction of a Dataset object, a LightGBM-specific format for training data.

To support use cases like hyperparameter tuning, where users want to train many models using the same Dataset, LightGBM can save a constructed Dataset to a binary file.

int LGBM_DatasetSaveBinary(DatasetHandle handle,

That Dataset can then be loaded back and used for training, without repeating those pre-processing steps.

int LGBM_DatasetCreateFromFile(const char* filename,

References

Created based on #4228 (comment).

@jameslamb
Copy link
Collaborator Author

This issue has been added to #2302 with other feature requests. I'd like to leave it open for a few days in case others want to add comments, since I just locked discussion on #4228.

After a few days, this issue will be closed until someone leaves a comment saying they'd like to work on it.

@jameslamb
Copy link
Collaborator Author

Ok now that this has been open for a few days, I am going to close it. If you're reading this and would like to work on this, please comment below and it can be re-opened!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant