GitHub - coffeecodeconverter/DocStrip: PDF Document Stripper to Plain text - used to prepare documents for Vector Embeddings.

NOTE if you download the source files, its missing the packages as it made the upload too big. you can get all the packages from nuget:

Newtonsoft.Json.13.0.3
Open-XML-SDK.2.9.1
PdfiumViewer.2.13.0.0
PdfiumViewer.Native.x86.v8-xfa.2018.4.8.256
PdfiumViewer.Native.x86_64.v8-xfa.2018.4.8.256

ABOUT: simple .exe for windows only,

batch-process PDFs or Word Documents, and strips them to plain text. Allows you to quickly add multiple files and folders to build a "source list" which is then used to iterate through and build a "File list" to convert to plain text. Searches sub-folders recursively.

Currenlty only supports PDF's .DOCX and .TXT

the purpose of this application is to automate the process of preparing documents for vector embeddings.

Written in VB.NET using .net framework 4.5. and PDFium (nuget package)

main App:

manage your sources, by individual files or by directories on mass

add by directory

add multiple directories quickly

all valid files (PDF / .txt so far) in all sub-folders recursively will be added.

customize your options

configure your outputs you can write the plain text to the same directory as each origin file, or collated all of them to a new custom directory. you can also keep the original file name, choose a prefix, suffix, or an entirely custom name. if splitting by pages, it automatically names each new page file so nothing is overwritten.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Doc Strip Source Files.zip		Doc Strip Source Files.zip
DocStrip_1_0_0_5.zip		DocStrip_1_0_0_5.zip
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

coffeecodeconverter/DocStrip

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages