Welcome to the realm of PDF Alchemist, where the secrets of PDFs are transmuted into HTML.
This Python application lovely named PDF Alchemist is a sophisticated, open-source toolkit that combines the arcane arts of PDF parsing, OCR, image processing, and HTML generation. It's designed for those who seek to unlock the knowledge sealed within the enigmatic tomes we call PDFs.
This project brings together a fellowship of powerful components:
- PDFParser: The Document Detective, powered by PyMuPDF
- OCREngine: The Text Archaeologist, empowered by Tesseract
- ImageProcessor: The Digital Alchemist, enhanced by Pillow
- HTMLGenerator: The Web Illusionist, crafted with Dominate
- ProgressTracker: The Expedition Chronicler, utilizing Python's built-in logging module
- Unearth text and images from PDF archives
- Decipher text using advanced OCR incantations
- Transmute images into optimized, base64-encoded artifacts
- Weave extracted elements into responsive HTML tapestries
- Chronicle the expedition with detailed logs and progress tracking
To establish your own PDF Alchemist's laboratory:
- Clone this arcane repository:
git clone https://github.com/team-bitfuture/pdf-alchemist.git
- Enter the sacred circle:
cd pdf-alchemist
- Summon the required artifacts:
pip install -r requirements.txt
- Ensure you possess the Tesseract grimoire. If not, acquire it here.
To initiate the PDF transmutation ritual:
if __name__ == "__main__":
pdf_path = "input.pdf"
output_dir = "output"
os.makedirs(output_dir, exist_ok=True)
main(pdf_path, output_dir)
This will transmute your PDF into a series of HTML pages, complete with extracted text, images, and layout information.
To ensure your PDF Alchemist is operating at peak efficiency:
pytest tests/
This will execute a series of arcane trials, testing each component of the PDF Alchemist.
We welcome fellow arcane researchers to join our quest. If you wish to contribute:
- Fork the repository
- Create your feature branch (
git checkout -b feature/MagicSpell
) - Commit your changes (
git commit -m 'Add MagicSpell'
) - Push to the branch (
git push origin feature/MagicSpell
) - Open a Pull Request
This project is licensed under the GPL3.0 License - see the LICENSE.md file for details.
- Kevin Ossenbrück - Archmage of PDF Transformation - ossenbrück.de
See also the list of contributors who participated in this arcane project.
- Website: team-bitfuture.de
- Email: [email protected]
May your PDFs always yield their secrets, and your HTML render with perfection. 📜🌐