Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Accelerate Data Generation? #1

Open
LilinNK opened this issue Dec 19, 2024 · 1 comment
Open

How to Accelerate Data Generation? #1

LilinNK opened this issue Dec 19, 2024 · 1 comment

Comments

@LilinNK
Copy link

LilinNK commented Dec 19, 2024

Hello,

I have a question regarding how to generate a large amount of data when only one API is available. I've noticed that the data generation rate is very slow during actual use. Increasing the batch_size and code_batch_size to large values does not seem to help much. Additionally, I found that after starting one process, I cannot start another process, even though I have changed the session_output path. Starting multiple threads also did not significantly speed up the process. I hope to receive your response as soon as possible. Thank you.

@YueYANG1996
Copy link
Collaborator

Hi, this depends on the rate limit of your APIs. Even though this code base can support multiprocess, we found the main bottleneck is still the rate limit when calling those proprietary models. From my experience, we can generate up to 10k - 15k samples per day using Anthropic API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants