Unleashing the Power of LLM's for Geoscience Data
EAGE Hackathon - Natural Language Processing (NLP)
Corrales M.1, Luiken N.1, Alfarhan M.1
King Abdullah University of Science and Technology (KAUST)1
The key concept behind this project is the utilization of Llama loaders, which facilitate the seamless integration of structured and unstructured data. These collectors are then used to create Llama indexes, enabling in-context learning without the need for fine-tuning. This approach empowers efficient and effective information retrieval and analysis, enhancing the overall learning experience.
This repository is organized as follows:
- 📂 package: python library containing routines for geodude.
- 📂 notebooks: set of jupyter notebooks reproducing the experiments for the hackathon.
- 📂 notebooks/data_test: folder containing geoscience data for the experiments.
- 📂 notebooks/data_petrobowl: folder containing petrobowl data for alpaca fine-tuning.
The following notebooks are provided:
- 📙
01_Camel_Loading_PDF_Example_1.ipynb
: notebook performing 'in context' learning from a pdf provided. The model used is Camel 5B (Example 2). - 📙
01_Camel_Loading_PDF_Example_2.ipynb
: notebook performing 'in context' learning from a pdf provided. The model used is Camel 5B (Example 2). - 📙
01_GPT_Loading_PDF_Example_1.ipynb
: notebook performing 'in context' learning from a pdf provided. The model used is GPT from openai(Example 1). - 📙
01_GPT_Loading_PDF_Example_2.ipynb
: notebook performing 'in context' learning from a pdf provided. The model used is GPT from openai(Example 2). - 📙
02_Camel_Loading_Multiple_PDF_Example.ipynb
: notebook performing 'in context' learning from a set of pdf's. The model used is Camel 5B. - 📙
02_GPT_Loading_Multiple_PDF_Example.ipynb
: notebook performing 'in context' learning from a set of pdf's. The model used is GPT from openai. - 📙
03_Camel_Loading_Youtube_Example.ipynb
: notebook performing 'in context' learning from a youtube video. The model used is Camel 5B. - 📙
03_GPT_Loading_Youtube_Example.ipynb
: notebook performing 'in context' learning from a youtube video. The model used GPT from openai. - 📙
04_Gradio_Interface.ipynb
: notebook performing example for API generation using gradio. - 📙
05_Custom_Dataset_for_Alpaca.ipynb
: notebook performing the dataset creating needed for the alpaca lora fine-tunning. - 📙
06_Alpaca-fine-tuning.ipynb
: notebook performing the LoRA alpaca fine-tuning for the petrobowl dataset.
To ensure reproducibility of the results, we suggest using the environment.yml
file when creating an environment.
Simply run:
./install_env.sh
It will take some time, if at the end you see the word Done!
on your terminal you are ready to go. After that you can simply install your package:
pip install .
or in developer mode:
pip install -e .
Remember to always activate the environment by typing:
conda activate geodude
Note
All experiments have been carried on a Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz equipped with a single NVIDIA TESLA A100 GPU. Different environment configurations may be required for different combinations of workstation and GPU.
Warning
To simplify the process, we have utilized the model from OpenAI, requiring an API key. If you wish to replicate these notebooks, you can generate your own API key. This will allow you to access the necessary resources and execute the code successfully.