Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Support int64_t data_size_t #2818

Closed
3 tasks
guolinke opened this issue Feb 24, 2020 · 10 comments
Closed
3 tasks

[feature request] Support int64_t data_size_t #2818

guolinke opened this issue Feb 24, 2020 · 10 comments

Comments

@guolinke
Copy link
Collaborator

As the cnt in the histogram is removed, the overhead of int64_t as data_size_t will be smaller.

There are several todo items:

  • remove data_size_t completely in code, directly used int64_t
  • Template types for the indices in DataPartition, to achieve the same performance as before when the data_size_t is smaller than int32_t.
  • update related c_api, and Python / R interfaces
@guolinke
Copy link
Collaborator Author

I have a branch (https://github.com/microsoft/LightGBM/tree/sparse_bin_int64) working on this, will revisit it when have time.

@Adam1679
Copy link

hey, can i help this one?

@guolinke
Copy link
Collaborator Author

@Adam1679 yeah, very welcome

@Adam1679
Copy link

sorry, this is my first time to make public contribution. Any suggestions about how to begin this one? Should I fork your branch and get familiar with the code and your modifications?

@guolinke
Copy link
Collaborator Author

yes, you can fork the code and code in your cloned repo first, and create a PR here when it is almost ready.
If there are any problems/questions, you can ask me directly in this issue.
BTW, it is better to start from master, not the sparse_bin_int64, as there are many new changes in master branch.

@Adam1679
Copy link

ok, i will try my best.

@Adam1679
Copy link

Adam1679 commented Mar 18, 2020

hey, I read your commit "5a5b26d" of your branch and I saw that you changed data_size_t to from int32_t to int64_t directly. And also you changed some "int" to "data_size_t" and some "data_size_t" to "int". I understand that you cast some "data_size_t" to "int" since now "data_size_t" is bigger. But why do you change some "int" to "data_size_t"? I don't understand this part.

Also, it would be highly appreciated if you can elaborate more on the "the overhead of int64_t as data_size_t will be smaller." Is it because that "int32_t" is not big enough in some cases but you have to compromise to use it since the overhead of using int64 in the cnt related part is unbearable?

@guolinke
Copy link
Collaborator Author

@Adam1679 some of int should be data_size_t, so i fix them.

For the overheads:

  1. the cnt in histogram, this is removed. So you don't need to care it any more.
  2. the data_size_t in src/tree_learner/DataPartition.hpp . DataPartition stores the used row indices for each leaf. These indices need to be accessed when ConstructHistogram and Split. Changing it to int64_t will increase the memory cost. So I think we should use template here, for the small data, LightGBM will auto use int32_t (even int16_t).

@Adam1679
Copy link

Sorry that I left for a long time. I'm trying to figure out a solution for such kind of template. But I did a lot of stack overflow search and found that there is no solution for such kind of runtime template. (template is compile time). So i think what we need is polymorphism. for example, construct a base class and subclass it using different data size. But this may result in other problems like the base class cannot have any fields because the type is undetermined. Am I on the right track?

@StrikerRUS
Copy link
Collaborator

Closed in favor of being in #2302. We decided to keep all feature requests in one place.

Welcome to contribute this feature! Please re-open this issue (or post a comment if you are not a topic starter) if you are actively working on implementing this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants