Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding how Sequence Works #6656

Open
jamespinkerton opened this issue Sep 28, 2024 · 2 comments
Open

Understanding how Sequence Works #6656

jamespinkerton opened this issue Sep 28, 2024 · 2 comments
Labels

Comments

@jamespinkerton
Copy link

jamespinkerton commented Sep 28, 2024

Hi. I looked at #4672 and found the advice when your dataset is too large is to use Sequence. However I can't find good documentation on Sequence and I'm having trouble understanding how it works.

My use case is I have multiple files on google cloud storage of floating point numbers. Each file has all of the features, but a different range of the samples. Because they're floats of 4 bytes, I can't put the entire dataset onto my machine due to lack of memory. However, I can fit it in once it's a dataset because.

I was hoping I could write a custom sequence class that downloaded these files when pinged, but when I do this I get lots of random access requests and I can't download the data that many times.

I was hoping for some advice on how the Sequence API works.
Do I need to provide a list of sequences?
Does the batch size refer to the number of samples returned at each index, or does it refer to the requested total number of samples at a time?
Is there a way to download the data in one stream, or does the dataset have to see the data multiple times to be constructed?
Does Sequence not actually solve my problem?

Thanks so much.

@jameslamb
Copy link
Collaborator

Thanks for using LightGBM.

A minimal, reproducible example of what you tried would be helpful (here are some docs on how to do that). For example, you didn't tell us what version of lightgbm you're using, what operating system, etc. Please do that in future reports.

The lightgbm.Sequence class is an abstract class, which shows the API that you need to implement. Here are some resources you might find helpful:

I think Sequence is a good way to accomplish what you're trying to accomplish.

But it's been a while since I personally worked with this API, so I can't provide a reproducible example right now. When I find time, if no one else has answered your questions by then, I'll try to create one with the publicly-available data on S3 from https://github.com/ContinuumIO/anaconda-package-data to demonstrate how to do what you're trying to do.

@jamespinkerton
Copy link
Author

I think you found some documentation I couldn't find. This is very helpful, thank you. Given that I have to randomly sample the data, it looks like Sequence presents some challenges. My data comes in chunks of 1 million samples / chunk, and I have about 200 or so chunks. I'm not sure if there's an easy way to do the random sampling part given this constraint?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants