Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I use it with PDF instead of txt? #41

Closed
F4k3r22 opened this issue Oct 18, 2024 · 8 comments
Closed

How can I use it with PDF instead of txt? #41

F4k3r22 opened this issue Oct 18, 2024 · 8 comments

Comments

@F4k3r22
Copy link

F4k3r22 commented Oct 18, 2024

I'm trying to use this for PDF files but I don't see any PDF examples.

@LarFii
Copy link
Collaborator

LarFii commented Oct 18, 2024

Thanks for your attention. Currently, we don't offer support for PDF files. However, you can extract the content using OCR and convert it into a TXT file. We will consider adding OCR support in future updates.

@F4k3r22
Copy link
Author

F4k3r22 commented Oct 18, 2024

yeah,` that's what I did XD, but it can be used with LLama 3.2, I'm trying and it just stays in Loop XD. Terminal: INFO:lightrag:Logger initialized for working directory: ./
DEBUG:lightrag:LightRAG init with param:
working_dir = ./,
chunk_token_size = 1200,
chunk_overlap_token_size = 100,
tiktoken_model_name = gpt-4o-mini,
entity_extract_max_gleaning = 1,
entity_summary_to_max_tokens = 500,
node_embedding_algorithm = node2vec,
node2vec_params = {'dimensions': 1536, 'num_walks': 10, 'walk_length': 40, 'window_size': 2, 'iterations': 3, 'random_seed': 3},
embedding_func = {'embedding_dim': 384, 'max_token_size': 5000, 'func': <function at 0x7936c8cfa9e0>},
embedding_batch_num = 32,
embedding_func_max_async = 16,
llm_model_func = <function hf_model_complete at 0x7936c95ac4c0>,
llm_model_name = meta-llama/Llama-3.2-1B-Instruct,
llm_model_max_token_size = 32768,
llm_model_max_async = 16,
key_string_value_json_storage_cls = <class 'lightrag.storage.JsonKVStorage'>,
vector_db_storage_cls = <class 'lightrag.storage.NanoVectorDBStorage'>,
vector_db_storage_cls_kwargs = {},
graph_storage_cls = <class 'lightrag.storage.NetworkXStorage'>,
enable_llm_cache = True,
addon_params = {},
convert_response_to_json_func = <function convert_response_to_json at 0x7936c95960e0>

INFO:lightrag:Load KV full_docs with 0 data
INFO:lightrag:Load KV text_chunks with 0 data
INFO:lightrag:Load KV llm_response_cache with 0 data
INFO:lightrag:[New Docs] inserting 1 docs
INFO:lightrag:[New Chunks] inserting 24 chunks
INFO:lightrag:Inserting 24 vectors to chunks
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
INFO:lightrag:[Entity Extraction]...
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:615: UserWarning: num_beams is set to 1. However, early_stopping is set to True -- this flag is only used in beam-based generation modes. You should set num_beams>1 or unset early_stopping.
warnings.warn(
Setting pad_token_id to eos_token_id:128001 for open-end generation.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
WARNING:accelerate.big_modeling:Some parameters are on the meta device because they were offloaded to the cpu.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:615: UserWarning: num_beams is set to 1. However, early_stopping is set to True -- this flag is only used in beam-based generation modes. You should set num_beams>1 or unset early_stopping.
warnings.warn(
Setting pad_token_id to eos_token_id:128001 for open-end generation.
WARNING:accelerate.big_modeling:Some parameters are on the meta device because they were offloaded to the cpu.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
Setting pad_token_id to eos_token_id:128001 for open-end generation.

I'll leave the code here:

from transformers import AutoModel, AutoTokenizer
from lightrag import LightRAG, QueryParam
from lightrag.utils import EmbeddingFunc
import os
import sys
import nest_asyncio  # Import nest_asyncio

WORKING_DIR = "./"

# Apply nest_asyncio to allow nested event loops
nest_asyncio.apply()  

# Initialize LightRAG with Hugging Face model
rag = LightRAG(
    working_dir=WORKING_DIR,
    llm_model_func=hf_model_complete,  # Use Hugging Face model for text generation
    llm_model_name='meta-llama/Llama-3.2-1B-Instruct',  # Model name from Hugging Face
    # Use Hugging Face embedding function
    embedding_func=EmbeddingFunc(
        embedding_dim=384,
        max_token_size=5000,
        func=lambda texts: hf_embedding(
            texts, 
            tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
            embed_model=AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
        )
    ),
)

with open("./Ley-de-Bienestar-Animal.txt") as f:
    rag.insert(f.read())

# Perform naive search
print(rag.query("Haz un resumen de las leyes", param=QueryParam(mode="naive")))

# Perform local search
#print(rag.query("Haz un resumen de las leyes", param=QueryParam(mode="local")))

# Perform global search
#print(rag.query("Haz un resumen de las leyes", param=QueryParam(mode="global")))

# Perform hybrid search
#print(rag.query("Haz un resumen de las leyes", param=QueryParam(mode="hybrid")))```

@F4k3r22
Copy link
Author

F4k3r22 commented Oct 18, 2024

and if I comment this line: #nest_asyncio.apply()

I get this error: INFO:lightrag:Logger initialized for working directory: ./
DEBUG:lightrag:LightRAG init with param:
working_dir = ./,
chunk_token_size = 1200,
chunk_overlap_token_size = 100,
tiktoken_model_name = gpt-4o-mini,
entity_extract_max_gleaning = 1,
entity_summary_to_max_tokens = 500,
node_embedding_algorithm = node2vec,
node2vec_params = {'dimensions': 1536, 'num_walks': 10, 'walk_length': 40, 'window_size': 2, 'iterations': 3, 'random_seed': 3},
embedding_func = {'embedding_dim': 384, 'max_token_size': 5000, 'func': <function at 0x7f82cc1df7f0>},
embedding_batch_num = 32,
embedding_func_max_async = 16,
llm_model_func = <function hf_model_complete at 0x7f81fb9b91b0>,
llm_model_name = meta-llama/Llama-3.2-1B-Instruct,
llm_model_max_token_size = 32768,
llm_model_max_async = 16,
key_string_value_json_storage_cls = <class 'lightrag.storage.JsonKVStorage'>,
vector_db_storage_cls = <class 'lightrag.storage.NanoVectorDBStorage'>,
vector_db_storage_cls_kwargs = {},
graph_storage_cls = <class 'lightrag.storage.NetworkXStorage'>,
enable_llm_cache = True,
addon_params = {},
convert_response_to_json_func = <function convert_response_to_json at 0x7f81fb99add0>

INFO:lightrag:Load KV full_docs with 0 data
INFO:lightrag:Load KV text_chunks with 0 data
INFO:lightrag:Load KV llm_response_cache with 0 data

RuntimeError Traceback (most recent call last)
in <cell line: 31>()
30
31 with open("./Ley-de-Bienestar-Animal.txt") as f:
---> 32 rag.insert(f.read())
33
34 # Perform naive search

2 frames
/usr/lib/python3.10/asyncio/base_events.py in _check_running(self)
582 def _check_running(self):
583 if self.is_running():
--> 584 raise RuntimeError('This event loop is already running')
585 if events._get_running_loop() is not None:
586 raise RuntimeError(

RuntimeError: This event loop is already running

@LimFang
Copy link

LimFang commented Oct 18, 2024

I'm trying to use this for PDF files but I don't see any PDF examples.

yeah so am i working on this while ensuring the LLM can understand inherent math and code logic within these PDFs in the database

@LarFii
Copy link
Collaborator

LarFii commented Oct 19, 2024

and if I comment this line: #nest_asyncio.apply()

I get this error: INFO:lightrag:Logger initialized for working directory: ./ DEBUG:lightrag:LightRAG init with param: working_dir = ./, chunk_token_size = 1200, chunk_overlap_token_size = 100, tiktoken_model_name = gpt-4o-mini, entity_extract_max_gleaning = 1, entity_summary_to_max_tokens = 500, node_embedding_algorithm = node2vec, node2vec_params = {'dimensions': 1536, 'num_walks': 10, 'walk_length': 40, 'window_size': 2, 'iterations': 3, 'random_seed': 3}, embedding_func = {'embedding_dim': 384, 'max_token_size': 5000, 'func': <function at 0x7f82cc1df7f0>}, embedding_batch_num = 32, embedding_func_max_async = 16, llm_model_func = <function hf_model_complete at 0x7f81fb9b91b0>, llm_model_name = meta-llama/Llama-3.2-1B-Instruct, llm_model_max_token_size = 32768, llm_model_max_async = 16, key_string_value_json_storage_cls = <class 'lightrag.storage.JsonKVStorage'>, vector_db_storage_cls = <class 'lightrag.storage.NanoVectorDBStorage'>, vector_db_storage_cls_kwargs = {}, graph_storage_cls = <class 'lightrag.storage.NetworkXStorage'>, enable_llm_cache = True, addon_params = {}, convert_response_to_json_func = <function convert_response_to_json at 0x7f81fb99add0>

INFO:lightrag:Load KV full_docs with 0 data

INFO:lightrag:Load KV text_chunks with 0 data
INFO:lightrag:Load KV llm_response_cache with 0 data
RuntimeError Traceback (most recent call last) in <cell line: 31>() 30 31 with open("./Ley-de-Bienestar-Animal.txt") as f: ---> 32 rag.insert(f.read()) 33 34 # Perform naive search

2 frames /usr/lib/python3.10/asyncio/base_events.py in _check_running(self) 582 def _check_running(self): 583 if self.is_running(): --> 584 raise RuntimeError('This event loop is already running') 585 if events._get_running_loop() is not None: 586 raise RuntimeError(

RuntimeError: This event loop is already running

Could you provide more details about the specific runtime environment?

@TianyuFan0504
Copy link
Contributor

and if I comment this line: #nest_asyncio.apply()

I get this error: INFO:lightrag:Logger initialized for working directory: ./ DEBUG:lightrag:LightRAG init with param: working_dir = ./, chunk_token_size = 1200, chunk_overlap_token_size = 100, tiktoken_model_name = gpt-4o-mini, entity_extract_max_gleaning = 1, entity_summary_to_max_tokens = 500, node_embedding_algorithm = node2vec, node2vec_params = {'dimensions': 1536, 'num_walks': 10, 'walk_length': 40, 'window_size': 2, 'iterations': 3, 'random_seed': 3}, embedding_func = {'embedding_dim': 384, 'max_token_size': 5000, 'func': <function at 0x7f82cc1df7f0>}, embedding_batch_num = 32, embedding_func_max_async = 16, llm_model_func = <function hf_model_complete at 0x7f81fb9b91b0>, llm_model_name = meta-llama/Llama-3.2-1B-Instruct, llm_model_max_token_size = 32768, llm_model_max_async = 16, key_string_value_json_storage_cls = <class 'lightrag.storage.JsonKVStorage'>, vector_db_storage_cls = <class 'lightrag.storage.NanoVectorDBStorage'>, vector_db_storage_cls_kwargs = {}, graph_storage_cls = <class 'lightrag.storage.NetworkXStorage'>, enable_llm_cache = True, addon_params = {}, convert_response_to_json_func = <function convert_response_to_json at 0x7f81fb99add0>

INFO:lightrag:Load KV full_docs with 0 data

INFO:lightrag:Load KV text_chunks with 0 data
INFO:lightrag:Load KV llm_response_cache with 0 data
RuntimeError Traceback (most recent call last) in <cell line: 31>() 30 31 with open("./Ley-de-Bienestar-Animal.txt") as f: ---> 32 rag.insert(f.read()) 33 34 # Perform naive search

2 frames /usr/lib/python3.10/asyncio/base_events.py in _check_running(self) 582 def _check_running(self): 583 if self.is_running(): --> 584 raise RuntimeError('This event loop is already running') 585 if events._get_running_loop() is not None: 586 raise RuntimeError(

RuntimeError: This event loop is already running

Are you running your code in Jupyter notebook? If so, do not comment #nest_asyncio.apply().

This function is for 1. Allowing the execution of asynchronous code in Jupyter Notebook.& 2.Solving the problems that arise when initiating a new event loop within an existing event loop.

@F4k3r22
Copy link
Author

F4k3r22 commented Oct 19, 2024

Yeah it was in Jupyter Notebook, but it always stays in the loop, and never responds

@Soumil32
Copy link
Contributor

Maybe you can use a library such as Marker to parse the document first? Then pass the text output to LightRAG

@LarFii LarFii closed this as completed Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants