ResumeRevealer: Advanced Resume Parsing Challenge

ResumeRevealer: Advanced Resume Parsing

Demo (Streamlit)

demo_output.mp4

Primary Challenge

The primary challenge of the ResumeRevealer project is to develop a comprehensive resume parser capable of extracting detailed information from resumes in various formats, including PDF, JPG, HTML, DOC, etc. The parser should accurately classify text into distinct sections such as education, work experience, and skills. Additionally, it should sequence these sections based on dates wherever available.

Standardization Challenge

In addition to parsing resumes, the ResumeRevealer project aims to enhance standardization by aligning different job titles and occupations against the O-NET database. This process ensures a consistent taxonomy across parsed resumes, making it easier to analyze and compare candidate profiles.

Skill Extraction Challenge

An advanced feature of the ResumeRevealer project involves implementing a skill extraction mechanism. This feature mines detailed skills and competencies from project descriptions and position roles within the resume, highlighting the candidate's specific abilities and expertise. Abstractive skill extraction, if achieved, would be considered a bonus feature.

ResumeRevealer: Advanced Resume Parsing Challenge - Primary Challenge

This repository contains the code for the ResumeRevealer project, which aims to develop a comprehensive resume parser that can extract detailed information from resumes in various formats such as PDF, JPG, HTML, DOC, etc. The parser accurately classifies text into distinct sections like education, work experience, skills, and sequences them based on dates, where available.

ResumeRevealer utilizes Dedoc for document processing.

Dedoc is an open universal system for converting documents to a unified output format. It extracts a document’s logical structure and content, its tables, text formatting and metadata. The document’s content is represented as a tree storing headings and lists of any level. Dedoc can be integrated in a document contents and structure analysis system as a separate module.

Environments

Environments

Get started in seconds with our verified environments.

Installation

If you don't want to use docker for running the application, it's possible to run dedoc locally. However, it isn't suitable for any operating system (Ubuntu 20+ is recommended) and there may be not enough machine's resources for its work. You should have python (python3.9, python3.10 are recommended) and pip installed.

Getting Started with the Awesome Streamlit Repository

Prerequisites

An Operating System like Windows, OsX or Linux
A working Python installation.
- We recommend using 64bit Python 3.8
a Shell
- We recommend Git Bash for Windows 8.1
- We recommend wsl for For Windows 10 ,11 or earlier.
an Editor
- We recommend VS Code (Preferred) or PyCharm.
The Git cli

Environment Installation

Clone the repo

https://github.com/astro215/ResumeRevealer.git

cd into the project root folder

cd streamlit-Directory

Create virtual environment

via python

Then you should create a virtual environment named .venv

python -m venv .venv

and activate the environment.

On Linux, OsX or in a Windows Git Bash terminal it's

source .venv/Scripts/activate

or alternatively

source .venv/bin/activate

In a Windows terminal it's

.venv/Scripts/activate.bat

or via anaconda

Create virtual environment named awesome-streamlit

conda create -n awesome-streamlit python

and activate environment.

activate atreamlit

If you are on windows you need to install some things required by GeoPandas by following these instructions.

Then you should install the local requirements

pip install -r requirements.txt

Build and run the Application Locally

streamlit run app.py

2. Install necessary packages :

sudo apt-get install -y libreoffice djvulibre-bin unzip unrar

libreoffice and djvulibre-bin packages are used by converters (doc, odt to docx; xls, ods to xlsx; ppt, odp to pptx; djvu to pdf). If you don't need converters, you can skip this step. unzip and unrar packages are used in the process of extracting archives.

2. Install `Tesseract OCR 5` framework:

You can try any tutorial for this purpose or look here to get the example of Tesseract installing for dedoc container or use next commands for building Tesseract OCR 5 from sources:

2.1. Install compilers and libraries required by the Tesseract OCR:

sudo apt-get update
sudo apt-get install -y automake binutils-dev build-essential ca-certificates clang g++ g++-multilib gcc-multilib libcairo2 libffi-dev \
libgdk-pixbuf2.0-0 libglib2.0-dev libjpeg-dev libleptonica-dev libpango-1.0-0 libpango1.0-dev libpangocairo-1.0-0 libpng-dev libsm6 \
libtesseract-dev libtool libxext6 make pkg-config poppler-utils pstotext shared-mime-info software-properties-common swig zlib1g-dev

2.2. Build Tesseract from sources:

sudo add-apt-repository -y ppa:alex-p/tesseract-ocr-devel
sudo apt-get update --allow-releaseinfo-change
sudo apt-get install -y tesseract-ocr tesseract-ocr-rus
git clone --depth 1 --branch 5.0.0-beta-20210916 https://github.com/tesseract-ocr/tesseract/
cd tesseract && ./autogen.sh && sudo ./configure && sudo make && sudo make install && sudo ldconfig && cd ..
export TESSDATA_PREFIX=/usr/share/tesseract-ocr/5/tessdata/

Usage

To use the ResumeRevealer parser, follow these steps:

Import the necessary libraries and create a DedocManager instance:

from dedoc import DedocManager

manager = DedocManager()

Process a single file using the process_file_with_dedoc function:

file_path = "path/to/your/file.extension"
output_data = process_file_with_dedoc(file_path)
if output_data:
    write_output_to_json(output_data)

Process all files in a directory using the process_folder_with_dedoc function:

folder_path = "path/to/your/folder"
process_folder_with_dedoc(folder_path)

Functionality

The ResumeRevealer parser performs the following tasks:

Extracts structured data from resumes in various formats (PDF, JPG, HTML, DOC, etc.).
Classifies text into distinct sections such as education, work experience, and skills.
Sequences sections based on dates, where available.

Limitations

Please note that the current implementation has the following limitations:

The parser might not work on all operating systems (Ubuntu 20+ is recommended).
The parser may require significant machine resources for optimal performance.
The accuracy of text classification and date sequencing depends on the quality and formatting of the input resumes.

Contributing

Contributions to the ResumeRevealer project are welcome! To contribute, please follow these steps:

Fork the repository.
Create a new branch for your feature or bug fix.
Make the necessary changes and commit them.
Submit a pull request detailing your changes.

🎉 Acknowledgements

Use this space to list resources you find helpful and would like to give credit to. I've included a few of my favorites to kick things off!

✍️ Authors

@Jainil Patel - Contributor
@Divyam Kumar - Contributor
@Amitesh Patra - Contributor

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
.devcontainer		.devcontainer
Images		Images
dedoc		dedoc
output		output
sratch_files		sratch_files
2019_Occupations.csv		2019_Occupations.csv
LICENSE		LICENSE
README.md		README.md
ResumeStructure.py		ResumeStructure.py
app.py		app.py
appv1.py		appv1.py
onet.py		onet.py
packages.txt		packages.txt
prompt_template.py		prompt_template.py
requirements.txt		requirements.txt
utils.py		utils.py
utils_files.py		utils_files.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ResumeRevealer: Advanced Resume Parsing Challenge

Primary Challenge

Standardization Challenge

Skill Extraction Challenge

ResumeRevealer: Advanced Resume Parsing Challenge - Primary Challenge

Environments

📝 Table of Contents

Environments

Installation

Getting Started with the Awesome Streamlit Repository

Prerequisites

Environment Installation

Create virtual environment

via python

or via anaconda

Build and run the Application Locally

2. Install necessary packages :

2. Install `Tesseract OCR 5` framework:

2.1. Install compilers and libraries required by the Tesseract OCR:

2.2. Build Tesseract from sources:

Usage

Functionality

Limitations

Contributing

🎉 Acknowledgements

✍️ Authors

About

Releases

Packages

Contributors 3

Languages

License

astro215/ResumeRevealer

Folders and files

Latest commit

History

Repository files navigation

ResumeRevealer: Advanced Resume Parsing Challenge

Primary Challenge

Standardization Challenge

Skill Extraction Challenge

ResumeRevealer: Advanced Resume Parsing Challenge - Primary Challenge

Environments

📝 Table of Contents

Environments

Installation

Getting Started with the Awesome Streamlit Repository

Prerequisites

Environment Installation

Create virtual environment

via python

or via anaconda

Build and run the Application Locally

2. Install necessary packages :

2. Install Tesseract OCR 5 framework:

2.1. Install compilers and libraries required by the Tesseract OCR:

2.2. Build Tesseract from sources:

Usage

Functionality

Limitations

Contributing

🎉 Acknowledgements

✍️ Authors

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

2. Install `Tesseract OCR 5` framework:

Packages