This repository contains scripts for embedding text documents and querying a Pinecone index using OpenAI's embedding model. The scripts facilitate the processing of CSV files, embedding text data, and interacting with a Pinecone index.
Ensure you have the following installed:
- Python 3.7+
- Pip
-
Clone the repository:
git clone <repository-url> cd <repository-directory>
-
Install the required packages:
pip install -r requirements.txt
-
Create a
.env
file in the root directory and add your Pinecone and OpenAI API keys:PINECONE_API_KEY=<your-pinecone-api-key> OPENAI_API_KEY=<your-openai-api-key> OPENAI_EMBEDDING_MODEL=text-embedding-3-small
The process_csv.py
script processes a CSV file, embeds the text data, and upserts it into a Pinecone index.
```bash
python process_csv.py -csv <path_to_csv_file> -folder <path_to_file_folder> -index <index_name> -ns <namespace> [-rebuild] [-workers <num_workers>] [-max_words <max_words_per_file>] [-cs <chunk_size>] [-ovlp <overlap>]
```
-csv
,--csv_file
: Path to the CSV file (required).-folder
,--file_folder
: Folder containing the files referenced in the CSV (required).-index
,--index_name
: Name of the Pinecone index (required).-ns
,--namespace
: Namespace for the Pinecone index (required).-rebuild
,--rebuild_index
: Rebuild the index if specified.-workers
,--num_workers
: Number of workers for parallel processing (default: 4).-max_words
,--max_words_per_file
: Maximum number of words per file (default: None).-cs
,--chunk_size
: Chunk size for text division (default: 50).-ovlp
,--overlap
: Overlap size for text division (default: 10).
The query_pinecone.py
script queries the Pinecone index with a text input and retrieves the top results.
```bash
python query_pinecone.py -query <query_text> -index <index_name> -top_k <top_k_results> -ns <namespace>
```
-query
,--query_text
: Text input for the query (required).-index
,--index_name
: Name of the Pinecone index (required).-top_k
,--top_k_results
: Number of top results to retrieve (default: 10).-ns
,--namespace
: Namespace for the Pinecone index (required).
Both scripts log their operations to log files (process_pinecone.log
and query_pinecone.log
) in the root directory. The logs include information about the process flow, warnings, and errors.
This project is licensed under the MIT License.