How can I use it with PDF instead of txt? #41

F4k3r22 · 2024-10-18T03:37:37Z

I'm trying to use this for PDF files but I don't see any PDF examples.

LarFii · 2024-10-18T03:40:54Z

Thanks for your attention. Currently, we don't offer support for PDF files. However, you can extract the content using OCR and convert it into a TXT file. We will consider adding OCR support in future updates.

F4k3r22 · 2024-10-18T04:07:37Z

yeah,` that's what I did XD, but it can be used with LLama 3.2, I'm trying and it just stays in Loop XD. Terminal: INFO:lightrag:Logger initialized for working directory: ./
DEBUG:lightrag:LightRAG init with param:
working_dir = ./,
chunk_token_size = 1200,
chunk_overlap_token_size = 100,
tiktoken_model_name = gpt-4o-mini,
entity_extract_max_gleaning = 1,
entity_summary_to_max_tokens = 500,
node_embedding_algorithm = node2vec,
node2vec_params = {'dimensions': 1536, 'num_walks': 10, 'walk_length': 40, 'window_size': 2, 'iterations': 3, 'random_seed': 3},
embedding_func = {'embedding_dim': 384, 'max_token_size': 5000, 'func': <function at 0x7936c8cfa9e0>},
embedding_batch_num = 32,
embedding_func_max_async = 16,
llm_model_func = <function hf_model_complete at 0x7936c95ac4c0>,
llm_model_name = meta-llama/Llama-3.2-1B-Instruct,
llm_model_max_token_size = 32768,
llm_model_max_async = 16,
key_string_value_json_storage_cls = <class 'lightrag.storage.JsonKVStorage'>,
vector_db_storage_cls = <class 'lightrag.storage.NanoVectorDBStorage'>,
vector_db_storage_cls_kwargs = {},
graph_storage_cls = <class 'lightrag.storage.NetworkXStorage'>,
enable_llm_cache = True,
addon_params = {},
convert_response_to_json_func = <function convert_response_to_json at 0x7936c95960e0>

INFO:lightrag:Load KV full_docs with 0 data
INFO:lightrag:Load KV text_chunks with 0 data
INFO:lightrag:Load KV llm_response_cache with 0 data
INFO:lightrag:[New Docs] inserting 1 docs
INFO:lightrag:[New Chunks] inserting 24 chunks
INFO:lightrag:Inserting 24 vectors to chunks
/usr/local/lib/python3.10/dist-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: clean_up_tokenization_spaces was not set. It will be set to True by default. This behavior will be depracted in transformers v4.45, and will be then set to False by default. For more details check this issue: huggingface/transformers#31884
warnings.warn(
INFO:lightrag:[Entity Extraction]...
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:615: UserWarning: num_beams is set to 1. However, early_stopping is set to True -- this flag is only used in beam-based generation modes. You should set num_beams>1 or unset early_stopping.
warnings.warn(
Setting pad_token_id to eos_token_id:128001 for open-end generation.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
WARNING:accelerate.big_modeling:Some parameters are on the meta device because they were offloaded to the cpu.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:615: UserWarning: num_beams is set to 1. However, early_stopping is set to True -- this flag is only used in beam-based generation modes. You should set num_beams>1 or unset early_stopping.
warnings.warn(
Setting pad_token_id to eos_token_id:128001 for open-end generation.
WARNING:accelerate.big_modeling:Some parameters are on the meta device because they were offloaded to the cpu.
Setting pad_token_id to eos_token_id:128001 for open-end generation.
Setting pad_token_id to eos_token_id:128001 for open-end generation.

I'll leave the code here:

from transformers import AutoModel, AutoTokenizer
from lightrag import LightRAG, QueryParam
from lightrag.utils import EmbeddingFunc
import os
import sys
import nest_asyncio  # Import nest_asyncio

WORKING_DIR = "./"

# Apply nest_asyncio to allow nested event loops
nest_asyncio.apply()  

# Initialize LightRAG with Hugging Face model
rag = LightRAG(
    working_dir=WORKING_DIR,
    llm_model_func=hf_model_complete,  # Use Hugging Face model for text generation
    llm_model_name='meta-llama/Llama-3.2-1B-Instruct',  # Model name from Hugging Face
    # Use Hugging Face embedding function
    embedding_func=EmbeddingFunc(
        embedding_dim=384,
        max_token_size=5000,
        func=lambda texts: hf_embedding(
            texts, 
            tokenizer=AutoTokenizer.from_pretrained("sentence-transformers/all-MiniLM-L6-v2"),
            embed_model=AutoModel.from_pretrained("sentence-transformers/all-MiniLM-L6-v2")
        )
    ),
)

with open("./Ley-de-Bienestar-Animal.txt") as f:
    rag.insert(f.read())

# Perform naive search
print(rag.query("Haz un resumen de las leyes", param=QueryParam(mode="naive")))

# Perform local search
#print(rag.query("Haz un resumen de las leyes", param=QueryParam(mode="local")))

# Perform global search
#print(rag.query("Haz un resumen de las leyes", param=QueryParam(mode="global")))

# Perform hybrid search
#print(rag.query("Haz un resumen de las leyes", param=QueryParam(mode="hybrid")))```

F4k3r22 · 2024-10-18T04:15:05Z

and if I comment this line: #nest_asyncio.apply()

I get this error: INFO:lightrag:Logger initialized for working directory: ./
DEBUG:lightrag:LightRAG init with param:
working_dir = ./,
chunk_token_size = 1200,
chunk_overlap_token_size = 100,
tiktoken_model_name = gpt-4o-mini,
entity_extract_max_gleaning = 1,
entity_summary_to_max_tokens = 500,
node_embedding_algorithm = node2vec,
node2vec_params = {'dimensions': 1536, 'num_walks': 10, 'walk_length': 40, 'window_size': 2, 'iterations': 3, 'random_seed': 3},
embedding_func = {'embedding_dim': 384, 'max_token_size': 5000, 'func': <function at 0x7f82cc1df7f0>},
embedding_batch_num = 32,
embedding_func_max_async = 16,
llm_model_func = <function hf_model_complete at 0x7f81fb9b91b0>,
llm_model_name = meta-llama/Llama-3.2-1B-Instruct,
llm_model_max_token_size = 32768,
llm_model_max_async = 16,
key_string_value_json_storage_cls = <class 'lightrag.storage.JsonKVStorage'>,
vector_db_storage_cls = <class 'lightrag.storage.NanoVectorDBStorage'>,
vector_db_storage_cls_kwargs = {},
graph_storage_cls = <class 'lightrag.storage.NetworkXStorage'>,
enable_llm_cache = True,
addon_params = {},
convert_response_to_json_func = <function convert_response_to_json at 0x7f81fb99add0>

INFO:lightrag:Load KV full_docs with 0 data
INFO:lightrag:Load KV text_chunks with 0 data
INFO:lightrag:Load KV llm_response_cache with 0 data

RuntimeError Traceback (most recent call last)
in <cell line: 31>()
30
31 with open("./Ley-de-Bienestar-Animal.txt") as f:
---> 32 rag.insert(f.read())
33
34 # Perform naive search

2 frames
/usr/lib/python3.10/asyncio/base_events.py in _check_running(self)
582 def _check_running(self):
583 if self.is_running():
--> 584 raise RuntimeError('This event loop is already running')
585 if events._get_running_loop() is not None:
586 raise RuntimeError(

RuntimeError: This event loop is already running

LimFang · 2024-10-18T07:47:55Z

I'm trying to use this for PDF files but I don't see any PDF examples.

yeah so am i working on this while ensuring the LLM can understand inherent math and code logic within these PDFs in the database

LarFii · 2024-10-19T03:55:59Z

and if I comment this line: #nest_asyncio.apply()

I get this error: INFO:lightrag:Logger initialized for working directory: ./ DEBUG:lightrag:LightRAG init with param: working_dir = ./, chunk_token_size = 1200, chunk_overlap_token_size = 100, tiktoken_model_name = gpt-4o-mini, entity_extract_max_gleaning = 1, entity_summary_to_max_tokens = 500, node_embedding_algorithm = node2vec, node2vec_params = {'dimensions': 1536, 'num_walks': 10, 'walk_length': 40, 'window_size': 2, 'iterations': 3, 'random_seed': 3}, embedding_func = {'embedding_dim': 384, 'max_token_size': 5000, 'func': <function at 0x7f82cc1df7f0>}, embedding_batch_num = 32, embedding_func_max_async = 16, llm_model_func = <function hf_model_complete at 0x7f81fb9b91b0>, llm_model_name = meta-llama/Llama-3.2-1B-Instruct, llm_model_max_token_size = 32768, llm_model_max_async = 16, key_string_value_json_storage_cls = <class 'lightrag.storage.JsonKVStorage'>, vector_db_storage_cls = <class 'lightrag.storage.NanoVectorDBStorage'>, vector_db_storage_cls_kwargs = {}, graph_storage_cls = <class 'lightrag.storage.NetworkXStorage'>, enable_llm_cache = True, addon_params = {}, convert_response_to_json_func = <function convert_response_to_json at 0x7f81fb99add0>

INFO:lightrag:Load KV full_docs with 0 data

INFO:lightrag:Load KV text_chunks with 0 data
INFO:lightrag:Load KV llm_response_cache with 0 data
RuntimeError Traceback (most recent call last) in <cell line: 31>() 30 31 with open("./Ley-de-Bienestar-Animal.txt") as f: ---> 32 rag.insert(f.read()) 33 34 # Perform naive search

2 frames /usr/lib/python3.10/asyncio/base_events.py in _check_running(self) 582 def _check_running(self): 583 if self.is_running(): --> 584 raise RuntimeError('This event loop is already running') 585 if events._get_running_loop() is not None: 586 raise RuntimeError(

RuntimeError: This event loop is already running

Could you provide more details about the specific runtime environment?

TianyuFan0504 · 2024-10-19T07:02:10Z

and if I comment this line: #nest_asyncio.apply()

I get this error: INFO:lightrag:Logger initialized for working directory: ./ DEBUG:lightrag:LightRAG init with param: working_dir = ./, chunk_token_size = 1200, chunk_overlap_token_size = 100, tiktoken_model_name = gpt-4o-mini, entity_extract_max_gleaning = 1, entity_summary_to_max_tokens = 500, node_embedding_algorithm = node2vec, node2vec_params = {'dimensions': 1536, 'num_walks': 10, 'walk_length': 40, 'window_size': 2, 'iterations': 3, 'random_seed': 3}, embedding_func = {'embedding_dim': 384, 'max_token_size': 5000, 'func': <function at 0x7f82cc1df7f0>}, embedding_batch_num = 32, embedding_func_max_async = 16, llm_model_func = <function hf_model_complete at 0x7f81fb9b91b0>, llm_model_name = meta-llama/Llama-3.2-1B-Instruct, llm_model_max_token_size = 32768, llm_model_max_async = 16, key_string_value_json_storage_cls = <class 'lightrag.storage.JsonKVStorage'>, vector_db_storage_cls = <class 'lightrag.storage.NanoVectorDBStorage'>, vector_db_storage_cls_kwargs = {}, graph_storage_cls = <class 'lightrag.storage.NetworkXStorage'>, enable_llm_cache = True, addon_params = {}, convert_response_to_json_func = <function convert_response_to_json at 0x7f81fb99add0>

INFO:lightrag:Load KV full_docs with 0 data

INFO:lightrag:Load KV text_chunks with 0 data
INFO:lightrag:Load KV llm_response_cache with 0 data
RuntimeError Traceback (most recent call last) in <cell line: 31>() 30 31 with open("./Ley-de-Bienestar-Animal.txt") as f: ---> 32 rag.insert(f.read()) 33 34 # Perform naive search

2 frames /usr/lib/python3.10/asyncio/base_events.py in _check_running(self) 582 def _check_running(self): 583 if self.is_running(): --> 584 raise RuntimeError('This event loop is already running') 585 if events._get_running_loop() is not None: 586 raise RuntimeError(

RuntimeError: This event loop is already running

Are you running your code in Jupyter notebook? If so, do not comment #nest_asyncio.apply().

This function is for 1. Allowing the execution of asynchronous code in Jupyter Notebook.& 2.Solving the problems that arise when initiating a new event loop within an existing event loop.

F4k3r22 · 2024-10-19T16:36:21Z

Yeah it was in Jupyter Notebook, but it always stays in the loop, and never responds

Soumil32 · 2024-10-20T17:13:26Z

Maybe you can use a library such as Marker to parse the document first? Then pass the text output to LightRAG

LarFii closed this as completed Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I use it with PDF instead of txt? #41

How can I use it with PDF instead of txt? #41

F4k3r22 commented Oct 18, 2024

LarFii commented Oct 18, 2024

F4k3r22 commented Oct 18, 2024 •

edited

Loading

F4k3r22 commented Oct 18, 2024

LimFang commented Oct 18, 2024

LarFii commented Oct 19, 2024

INFO:lightrag:Load KV full_docs with 0 data

TianyuFan0504 commented Oct 19, 2024

INFO:lightrag:Load KV full_docs with 0 data

F4k3r22 commented Oct 19, 2024

Soumil32 commented Oct 20, 2024

How can I use it with PDF instead of txt? #41

How can I use it with PDF instead of txt? #41

Comments

F4k3r22 commented Oct 18, 2024

LarFii commented Oct 18, 2024

F4k3r22 commented Oct 18, 2024 • edited Loading

F4k3r22 commented Oct 18, 2024

INFO:lightrag:Load KV full_docs with 0 data INFO:lightrag:Load KV text_chunks with 0 data INFO:lightrag:Load KV llm_response_cache with 0 data

LimFang commented Oct 18, 2024

LarFii commented Oct 19, 2024

INFO:lightrag:Load KV full_docs with 0 data

TianyuFan0504 commented Oct 19, 2024

INFO:lightrag:Load KV full_docs with 0 data

F4k3r22 commented Oct 19, 2024

Soumil32 commented Oct 20, 2024

F4k3r22 commented Oct 18, 2024 •

edited

Loading

INFO:lightrag:Load KV full_docs with 0 data
INFO:lightrag:Load KV text_chunks with 0 data
INFO:lightrag:Load KV llm_response_cache with 0 data