PetroRAG

Retrieval Augmented Generation (RAG) application in energy

First of all is RAG-LLM, then I will do some experiments with NLP and Knowledge Graph. Just want to do something fun with GenAI

RAG with OpenAI

Here, for example, a user comes to us and has few reports in his database and he wants to quickly perform a query to get information from the reports.

I use 3 reports from the public Volve dataset; discovery report, completion report, and drilling report of well 15/9-F-15 A.

RAG with ColPali and Qwen2

While OpenAI is costly, I use an open source ColPali model (by Illuin Technology) and Qwen2 model (by Alibaba Group) to build a multimodal RAG that can understand reports with charts and tables.

This is a free of charge alternative than OpenAI. The only limitation of using ColPali-Qwen2 is that it requires a very powerful GPU up to Ampere (GPU A100). You may need Colab Pro to run this.

Notebooks:

ColPali Using the pretrained ColPali to generate relevant pages from a collection of reports based on user query
Generate Norwegian dataset for fine tuning Synthetic dataset for fine tuning pushed to HuggingFace space based on a Norwegian handbook of oil and gas

Open for contribution/collaboration

There are several challenges that I want to address:

Most legacy reports are scanned so it has the format of picture. How can we effectively apply OCR?
Reports have charts, tables, or even equations. How can we use multimodal LLM to understand these information?
Company reports are highly confidential. How can we use open-source alternatives of OpenAI models to store RAG privately? How are cost compared from one model to another?
How can we use knowledge graph from reports to help us summarize and find connections between reports?
Embedding vectors are huge and need storage as there are a lot of reports. How can we use different vector databases and scale up with hundreds or thousands of reports?
How to finetune RAG with a knowledge to improve the answer for example to teach the system with Petroleum Engineering knowledge?
How to represent multilingual RAG so that it understand reports written in Norwegian for example?
LLM need a very huge computational power such as the top-notch GPU. Can we use a smaller model - Small Language Model (SLM)?

Open for collaboration, if anyone has ideas on this :)

References

Finetuning ColPali with LoRA - https://github.com/tonywu71/colpali-cookbooks/blob/main/examples/finetune_colpali.ipynb
Generate data for finetuning - https://danielvanstrien.xyz/posts/post-with-code/colpali/2024-09-23-generate_colpali_dataset.html
Vespa on Docker - https://pyvespa.readthedocs.io/en/latest/getting-started-pyvespa.html#Deploy-the-Vespa-application
ColPali on Vespa - https://pyvespa.readthedocs.io/en/latest/examples/visual_pdf_rag_with_vespa_colpali_cloud.html

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
data		data
notebooks		notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PetroRAG

RAG with OpenAI

RAG with ColPali and Qwen2

Notebooks:

Open for contribution/collaboration

References

About

Releases

Packages

Languages

License

yohanesnuwara/PetroRAG

Folders and files

Latest commit

History

Repository files navigation

PetroRAG

RAG with OpenAI

RAG with ColPali and Qwen2

Notebooks:

Open for contribution/collaboration

References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages