ResumeRevealer: Advanced Resume Parsing
Demo (Streamlit)
demo_output.mp4
The primary challenge of the ResumeRevealer project is to develop a comprehensive resume parser capable of extracting detailed information from resumes in various formats, including PDF, JPG, HTML, DOC, etc. The parser should accurately classify text into distinct sections such as education, work experience, and skills. Additionally, it should sequence these sections based on dates wherever available.
In addition to parsing resumes, the ResumeRevealer project aims to enhance standardization by aligning different job titles and occupations against the O-NET database. This process ensures a consistent taxonomy across parsed resumes, making it easier to analyze and compare candidate profiles.
An advanced feature of the ResumeRevealer project involves implementing a skill extraction mechanism. This feature mines detailed skills and competencies from project descriptions and position roles within the resume, highlighting the candidate's specific abilities and expertise. Abstractive skill extraction, if achieved, would be considered a bonus feature.
This repository contains the code for the ResumeRevealer project, which aims to develop a comprehensive resume parser that can extract detailed information from resumes in various formats such as PDF, JPG, HTML, DOC, etc. The parser accurately classifies text into distinct sections like education, work experience, skills, and sequences them based on dates, where available.
ResumeRevealer utilizes Dedoc for document processing.
Dedoc is an open universal system for converting documents to a unified output format. It extracts a document’s logical structure and content, its tables, text formatting and metadata. The document’s content is represented as a tree storing headings and lists of any level. Dedoc can be integrated in a document contents and structure analysis system as a separate module.
Get started in seconds with our verified environments.
If you don't want to use docker for running the application, it's possible to run dedoc locally.
However, it isn't suitable for any operating system (Ubuntu 20+
is recommended) and
there may be not enough machine's resources for its work.
You should have python
(python3.9
, python3.10
are recommended) and pip
installed.
- An Operating System like Windows, OsX or Linux
- A working Python installation.
- We recommend using 64bit Python 3.8
- a Shell
- an Editor
- The Git cli
Clone the repo
https://github.com/astro215/ResumeRevealer.git
cd into the project root folder
cd streamlit-Directory
Then you should create a virtual environment named .venv
python -m venv .venv
and activate the environment.
On Linux, OsX or in a Windows Git Bash terminal it's
source .venv/Scripts/activate
or alternatively
source .venv/bin/activate
In a Windows terminal it's
.venv/Scripts/activate.bat
Create virtual environment named awesome-streamlit
conda create -n awesome-streamlit python
and activate environment.
activate atreamlit
If you are on windows you need to install some things required by GeoPandas by following these instructions.
Then you should install the local requirements
pip install -r requirements.txt
streamlit run app.py
sudo apt-get install -y libreoffice djvulibre-bin unzip unrar
libreoffice
and djvulibre-bin
packages are used by converters (doc, odt to docx; xls, ods to xlsx; ppt, odp to pptx; djvu to pdf).
If you don't need converters, you can skip this step.
unzip
and unrar
packages are used in the process of extracting archives.
You can try any tutorial for this purpose or look here
to get the example of Tesseract installing for dedoc container or use next commands for building Tesseract OCR 5 from sources:
sudo apt-get update
sudo apt-get install -y automake binutils-dev build-essential ca-certificates clang g++ g++-multilib gcc-multilib libcairo2 libffi-dev \
libgdk-pixbuf2.0-0 libglib2.0-dev libjpeg-dev libleptonica-dev libpango-1.0-0 libpango1.0-dev libpangocairo-1.0-0 libpng-dev libsm6 \
libtesseract-dev libtool libxext6 make pkg-config poppler-utils pstotext shared-mime-info software-properties-common swig zlib1g-dev
sudo add-apt-repository -y ppa:alex-p/tesseract-ocr-devel
sudo apt-get update --allow-releaseinfo-change
sudo apt-get install -y tesseract-ocr tesseract-ocr-rus
git clone --depth 1 --branch 5.0.0-beta-20210916 https://github.com/tesseract-ocr/tesseract/
cd tesseract && ./autogen.sh && sudo ./configure && sudo make && sudo make install && sudo ldconfig && cd ..
export TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata/
To use the ResumeRevealer parser, follow these steps:
- Import the necessary libraries and create a DedocManager instance:
from dedoc import DedocManager
manager = DedocManager()
- Process a single file using the
process_file_with_dedoc
function:
file_path = "path/to/your/file.extension"
output_data = process_file_with_dedoc(file_path)
if output_data:
write_output_to_json(output_data)
- Process all files in a directory using the
process_folder_with_dedoc
function:
folder_path = "path/to/your/folder"
process_folder_with_dedoc(folder_path)
The ResumeRevealer parser performs the following tasks:
- Extracts structured data from resumes in various formats (PDF, JPG, HTML, DOC, etc.).
- Classifies text into distinct sections such as education, work experience, and skills.
- Sequences sections based on dates, where available.
Please note that the current implementation has the following limitations:
- The parser might not work on all operating systems (Ubuntu 20+ is recommended).
- The parser may require significant machine resources for optimal performance.
- The accuracy of text classification and date sequencing depends on the quality and formatting of the input resumes.
Contributions to the ResumeRevealer project are welcome! To contribute, please follow these steps:
- Fork the repository.
- Create a new branch for your feature or bug fix.
- Make the necessary changes and commit them.
- Submit a pull request detailing your changes.
Use this space to list resources you find helpful and would like to give credit to. I've included a few of my favorites to kick things off!
- @Jainil Patel - Contributor
- @Divyam Kumar - Contributor
- @Amitesh Patra - Contributor