-
Notifications
You must be signed in to change notification settings - Fork 16.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: document_loaders classification #4069
docs: document_loaders classification #4069
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm in theory i like this, but the distinction seems a bit blurred. for example, why is Google Drive a formatter and not a knowledge document loader
@@ -37,7 +37,7 @@ | |||
"# This is from https://langchain.readthedocs.io/en/latest/modules/document_loaders/examples/csv.html\n", | |||
"\n", | |||
"from langchain.document_loaders.csv_loader import CSVLoader\n", | |||
"loader = CSVLoader(file_path='../../document_loaders/examples/example_data/mlb_teams_2012.csv')\n", | |||
"loader = CSVLoader(file_path='../../document_loaders/examples/../example_data/mlb_teams_2012.csv')\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weird pathing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OOps. My bad.
@@ -25,7 +25,7 @@ | |||
"metadata": {}, | |||
"outputs": [], | |||
"source": [ | |||
"loader = NotebookLoader(\"example_data/notebook.ipynb\")" | |||
"loader = NotebookLoader(\"../../../example_data/notebook.ipynb\")" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldnt really need to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. We don't need to change test data. Thanks.
hmmm in theory i like this, but the distinction seems a bit blurred. for example, why is Google Drive a formatter and not a knowledge document loader
I'll add more description to this.
The idea is "knowledge loader" works with storage that we do not control. Something that can be used as a "tool" (in terms of LangChain). That can be accessed with queries. Something that can be considered as a source of "external" knowledge. We can allow LLM to make queries and get information or we can download documents and use them in a more controllable way.
"Formatters" can be as easy as transformers for CSV, SQL, etc. But they also can be cloud services or app stores. They can be hosted out of our control but the information inside is under our control.
hmm i think formatter and i think csv or word... but not like google drive. like google drive could have csv files in it i would be down to split out the ones which related to a certain file type. eg csv/pdf/ppt/etc. and then other ones could load in from various locations (eg from drive or website etc) and use formatters under the hood this may be related to some of the stuff @eyurtsev is working on? |
what is definition of those categories? eg why is microsoft word (.docx) not a format? |
Hello @leo-gan 👋 Thanks for helping with the docs! I am slowly making changes to implement the plan that's outlined here: #2833 (comment) The high level is to decouple the code that loads raw data (bytes) from the code that parses the raw data to generate documents. It'll still be possible to define arbitrary document loaders, but it'll also become easier to re-use existing parsers in a document loader (or even existing blob loaders). Not sure that this would change the documentation much. |
@hwchase17 I've completely reworked the document_loader classification. Please, check it out. |
@hwchase17 any comments? If you are busy, maybe @dev2049 can help? TNX |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
awesome - thanks!
Problem statement: the document_loaders section is too long and hard to comprehend.
Proposal: group document_loaders by 3 classes: (see
Files changed
tab)UPDATE: I've completely reworked the document_loader classification.
Now this PR changes only one file!
FYI @eyurtsev @hwchase17