Skip to content

PDF Document Stripper to Plain text - used to prepare documents for Vector Embeddings.

Notifications You must be signed in to change notification settings

coffeecodeconverter/DocStrip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

NOTE if you download the source files, its missing the packages as it made the upload too big. you can get all the packages from nuget:

  • Newtonsoft.Json.13.0.3
  • Open-XML-SDK.2.9.1
  • PdfiumViewer.2.13.0.0
  • PdfiumViewer.Native.x86.v8-xfa.2018.4.8.256
  • PdfiumViewer.Native.x86_64.v8-xfa.2018.4.8.256

ABOUT: simple .exe for windows only,

batch-process PDFs or Word Documents, and strips them to plain text. Allows you to quickly add multiple files and folders to build a "source list" which is then used to iterate through and build a "File list" to convert to plain text. Searches sub-folders recursively.

Currenlty only supports PDF's .DOCX and .TXT

the purpose of this application is to automate the process of preparing documents for vector embeddings.

Written in VB.NET using .net framework 4.5. and PDFium (nuget package)

main App:

image

manage your sources, by individual files or by directories on mass

image

add by directory

image

add multiple directories quickly

image

all valid files (PDF / .txt so far) in all sub-folders recursively will be added.

image

customize your options

image

configure your outputs you can write the plain text to the same directory as each origin file, or collated all of them to a new custom directory. you can also keep the original file name, choose a prefix, suffix, or an entirely custom name. if splitting by pages, it automatically names each new page file so nothing is overwritten.

image

About

PDF Document Stripper to Plain text - used to prepare documents for Vector Embeddings.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published