Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spanish performance and comparison of performance between models. #88

Closed
josemlopez opened this issue Mar 20, 2023 · 3 comments
Closed

Comments

@josemlopez
Copy link

Hi there,

I recently noticed that the performance of the Spanish language model is subpar. To improve it, I want to add more Spanish language examples to the model. I was wondering if anyone else has a similar idea and what tools they are using to accomplish this.

Currently, I have only trained the model with some basic cleaning techniques. However, I want to incorporate an "automatic" cleaning method using this PR: #62 and compare the performance. It would be interesting to see how the quality of the data can impact the model's improvement.

I am also wondering if there are any benchmarks that should be run to measure the performance of the model. Any suggestions or insights would be greatly appreciated!

@DanielWe2
Copy link

add more Spanish language examples to the model. I was wondering if anyone else has a similar idea and what tools they are using to accomplish this.

Someone else here in the comments did that with Korean and used OpenAI GPT APIs to translate some of the dataset.

I am also wondering if there are any benchmarks that should be run to measure the performance of the model. Any suggestions or insights would be greatly appreciated!

Take a look at https://github.com/EleutherAI/lm-evaluation-harness

Through that I learned how test like WinoGrando are actualy provide to the model.

What I tried was to build a prompt with instructions for the test data. As a human who uses at chatbot would do it. I think that would be more relevant but highly depend on fine tuning and the best prompt.

What is normal done is: For an multiple choice A/B test (like WinoGrande) provide both as full sentences to the model and let it calculate the probability for both variantes. And the variante with the higher probility is the one chosen by the model. That should show the theoretic performance of the actual model. This is more objective and totaly indpentend of the prompt. But also not really relevant in terms what a normal chatbot user would see.

Anyway I think it would be a good idea to test the lora models. The base version compared to the base+lora. We would see if the fine tuning somehow degrades the general model performance.

I am not aware if there are language specific tests. OpenAI release a chart for GPT-4 with model performance for each langauge. Not sure how they meassured that.

@josemlopez
Copy link
Author

Thanks Daniel!
This is very interesting. I'll follow your leads and share here some of my insights.

@josemlopez
Copy link
Author

Just closing this issue, because I just realised that the best place for this is "Discussions".
Here is the thread I just opened there: #108 .

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants