ChatGPT for my lecture slides
Built with Streamlit, powered by LlamaIndex and LangChain.
Uses the latest ChatGPT API from OpenAI.
Inspired by AthensGPT
demo.mp4
- Parses pdf with pypdf
- Index Construction with LlamaIndex's
GPTSimpleVectorIndex
- the
text-embedding-ada-002
model is used to create embeddings - see vector store index page to learn more
- here's a sample index
- the
- indexes and files are stored on s3
- Query the index
- uses the latest ChatGPT model
gpt-3.5-turbo
- uses the latest ChatGPT model
- configure aws (quickstart)
aws configure
-
create s3 bucket named
"classgpt"
-
rename [.env.local.example] to
.env
and add your openai credentials
- create python env
conda create -n classgpt python=3.9
conda activate classgpt
- install dependencies
pip install -r requirements.txt
- run streamlit app
cd app/
streamlit run app/01_❓_Ask.py
Alternative, you can use Docker
docker compose up
Then open up a new tab and navigate to http://localhost:8501/
- fix ValueError: Could not parse LLM output issues
- Add ability to query on multiple files
- Compose indices of multiple lectures and query on all of them
- loop through all existing index, create the ones that haven't been created, and compose them together
- references
- Custom prompts and tweak settings
- create a settings page for tweaking model parameters and provide custom prompts example
- choose local or cloud storage version, so users don't have to setup AWS s3 and everything is downloaded locally
- deploy app on AWS
Tokens can be thought of as pieces of words. Before the API processes the prompts, the input is broken down into tokens. These tokens are not cut up exactly where the words start or end - tokens can include trailing spaces and even sub-words. Here are some helpful rules of thumb for understanding tokens in terms of lengths:
- 1 token ~= 4 chars in English
- 1 token ~= ¾ words
- 100 tokens ~= 75 words
- 1-2 sentence ~= 30 tokens
- 1 paragraph ~= 100 tokens
- 1,500 words ~= 2048 tokens
Try the OpenAI Tokenizer tool
An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.
For text-embedding-ada-002
, cost is $0.0004 / 1k tokens or 3000 pages/dollar
For gpt-3.5-turbo
model (ChatGPTAPI) cost is $0.002 / 1K tokens
For text-davinci-003
model, cost is $0.02 / 1K tokens
- Increase upload limit of st.file_uploader
- st.cache_resource - Streamlit Docs
- Session State
- hayabhay/whisper-ui: Streamlit UI for OpenAI's Whisper
- Streamlit Deployment Guide (wiki) - 🚀 Deployment - Streamlit
- How to Deploy a streamlit application to AWS? Part-3
Loading data
ChatGPT
- boto3 file_upload does it check if file exists
- Boto 3: Resource vs Client
- Writing json to file in s3 bucket
- amazon web services - What is the best way to pass AWS credentials to a Docker container?
- docker-compose up failing due to: error: can't find Rust compiler · Issue #572 · acheong08/ChatGPT
- linux - When installing Rust toolchain in Docker, Bash
source
command doesn't work - software installation - How to install a package with apt without the "Do you want to continue [Y/n]?" prompt? - Ask Ubuntu
- How to use sudo inside a docker container?