NOTE if you download the source files, its missing the packages as it made the upload too big. you can get all the packages from nuget:
- Newtonsoft.Json.13.0.3
- Open-XML-SDK.2.9.1
- PdfiumViewer.2.13.0.0
- PdfiumViewer.Native.x86.v8-xfa.2018.4.8.256
- PdfiumViewer.Native.x86_64.v8-xfa.2018.4.8.256
ABOUT: simple .exe for windows only,
batch-process PDFs or Word Documents, and strips them to plain text. Allows you to quickly add multiple files and folders to build a "source list" which is then used to iterate through and build a "File list" to convert to plain text. Searches sub-folders recursively.
Currenlty only supports PDF's .DOCX and .TXT
the purpose of this application is to automate the process of preparing documents for vector embeddings.
Written in VB.NET using .net framework 4.5. and PDFium (nuget package)
main App:
manage your sources, by individual files or by directories on mass
add by directory
add multiple directories quickly
all valid files (PDF / .txt so far) in all sub-folders recursively will be added.
customize your options
configure your outputs you can write the plain text to the same directory as each origin file, or collated all of them to a new custom directory. you can also keep the original file name, choose a prefix, suffix, or an entirely custom name. if splitting by pages, it automatically names each new page file so nothing is overwritten.