Processing Textbooks

We place the original pdf files under data/textbooks_pdf

Run:

python data_processing/basic_data/textbooks/get_orig_pdf_paths.py

Run:

python data_processing/basic_data/textbooks/filter_pdf_and_get_text_fitz.py

This file remove the paths to the pdf files that are duplicated or damaged file.

This step is in preparation of nougat convert. The file paths are devided into 30 .txt files so the conversion can be run in parallel.

Run:

python data_processing/basic_data/textbooks/write_file_paths_to_txt.py

Run:

bash data_processing/basic_data/textbooks/convert_multi_files.sh $IDX

Relpace $IDX with index of the parallel process.

Run:

bash data_processing/basic_data/textbooks/get_jsonl_from_mmd.py

This generates the final jsonl file at data/textbooks_filtered/textbooks_nougat-converted_final.jsonl.

Provide feedback