1st Solution For Conversational Multi-Doc QA Workshop & International Challenge @ WSDM'24 - Xiaohongshu.Inc
This repo contains the source code of our competition in WSDM Cup 2024: Conversational Multi-Doc QA
Please refer to our paper for details in this competition: The First Place Solution of WSDM Cup 2024: Leveraging Large Language Models for Conversational Multi-Doc QA
- SOLAR-10.7B-Instruct backbone
- Hybrid Training
- Noisy Document Filter
- Model Ensemble
-
Follow Installation for modelscope/swift to install swift.
-
Install vllm
-
Install deepspeed
-
Install sklearn
-
Install SentenceTransformers
Or you can run this: (Tested on V100 32G with CUDA 11.8, Ubuntu 20.04.1)
conda create -n swift python=3.10
conda activate swift
pip install ms-swift[all] -U
pip install vllm==0.3.1
pip install deepspeed
pip install scikit-learn
pip install sentence_transformers
Main package version:
python==3.10.13
ms-swift==1.6.1
scikit-learn==1.4.1.post1
sentence-transformers==2.3.1
torch==2.1.2
transformers==4.37.2
vllm==0.3.1
preprocess/data_format.py
: Format data required for train and eval
preprocess/data_format_Pseudo.py
: For hybrid training data
preprocess/score_train_eval(test).py
: Calculate scores for noisy documents filter
preprocess/score_order.py
: Interactive code to delete noisy documents
Use LLM Framework ms-swift by ModelScope
runsh/solar_instruct_sft_template.sh
runsh/solar_instruct_infer_template.sh
merge/calculate_score.py
: Calculate scores for ensemble learning
merge/merge_score.py
: Ensemble results
keyword
: Try directly generating keywords or answers by GPT
multi_stage
: Multi Stage LLM try (Not work)
You can find all intermediate files in result
folder
-
Download Pretrained Models From Huggingface
- upstage/SOLAR-10.7B-Instruct-v1.0 (10.7 B)
- nomic-ai/nomic-embed-text-v1 (0.14 B)
-
Download Our 8 Finetuned LoRA Adapters From our huggingface repository (0.03 B Each)
So our model size is 10.7B + 0.14B + 0.03B * 8 = 11.08B, much fewer than 14 billion (14B) parameters.
- Put them in the right folder. The folder should look as follows:
└── checkpoints
├── v08-20240205-114459/
├── v10-20240205-114325/
├── v13-20240202-072530/
├── v13-20240206-111010/
├── v16-20240206-224659/
├── v27-20240209-133614/
├── v33-20240210-002918/
└── v35-20240210-120550/
└── pretrained
└── nomic-ai/nomic-embed-text-v1/
├── 1_Pooling/
├── config.json
├── config_sentence_transformers.json
├── configuration_hf_nomic_bert.py
├── .gitattributes
├── .locks/
├── modeling_hf_nomic_bert.py
├── model.safetensors
├── modules.json
├── onnx/
├── pytorch_model.bin
├── README.md
├── sentence_bert_config.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── vocab.txt
└── upstage/SOLAR-10.7B-Instruct-v1.0/
├── config.json
├── generation_config.json
├── .gitattributes
├── .locks/
├── model-00001-of-00005.safetensors
├── model-00002-of-00005.safetensors
├── model-00003-of-00005.safetensors
├── model-00004-of-00005.safetensors
├── model-00005-of-00005.safetensors
├── model.safetensors.index.json
├── README.md
├── solar_logo.png
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model
Run python data_format.py
to preprocess original test data.
Then run shell script in the runsh
folder
bash runsh/v08-20240205-114459.sh
bash runsh/v10-20240205-114325.sh
bash runsh/v13-20240202-072530.sh
bash runsh/v13-20240206-111010.sh
bash runsh/v16-20240206-224659.sh
bash runsh/v27-20240209-133614.sh
bash runsh/v33-20240210-002918.sh
bash runsh/v35-20240210-120550.sh
- You can modify CUDA device at the beginning of each shell script
CUDA_VISIBLE_DEVICES=
- The result files are saved in the
merge
folder, which should look as follows:
└── merge
├── v08-20240205-114459.jsonl
├── v10-20240205-114325.jsonl
├── v13-20240202-072530.jsonl
├── v13-20240206-111010.jsonl
├── v16-20240206-224659.jsonl
├── v27-20240209-133614.jsonl
├── v33-20240210-002918.jsonl
└── v35-20240210-120550.jsonl
Besides, the results above are as follows:
File | Word-level ROUGE-L | Character-level ROUGE-L | Keywords Recall |
---|---|---|---|
v08-20240205-114459 | 0.45532953438881013 | 0.6143454883849857 | 0.6824189095928223 |
v10-20240205-114325 | 0.456275615214309 | 0.6149276913541135 | 0.6817805383022769 |
v13-20240202-072530 | 0.4554468517276402 | 0.6141346993379754 | 0.6827095609704305 |
v13-20240206-111010 | 0.456388581088847 | 0.6149210447203279 | 0.6840088655306036 |
v16-20240206-224659 | 0.45375515045837794 | 0.613359666771279 | 0.6879538939321544 |
v27-20240209-133614 | 0.45574561117381773 | 0.6145520850027292 | 0.6826942984551678 |
v33-20240210-002918 | 0.4559195951083145 | 0.6141543510329665 | 0.6865596963423041 |
v35-20240210-120550 | 0.45573339341665703 | 0.614208192382808 | 0.6813332802463232 |
So even if they are not ensembled, each of them is still way ahead of the second place.
First, calculate the embedding score
python calculate_score.py
Note that this program is accelerated by torch.multiprocessing
, you can modify the number of processes near num_group = 16
. (It works well in V100 32G)
Then generate final result,
python merge_score.py
It will generate emb_a_s_8_0_1_2_3_4_5_6_7.zip
in the root folder, which is our final result.
Word-level ROUGE-L | Character-level ROUGE-L | Keywords Recall |
---|---|---|
0.465360141853671 | 0.6208371209722543 | 0.6953475871954128 |
If you find our work helpful, please consider citing the following paper:
@misc{li2024place,
title={The First Place Solution of WSDM Cup 2024: Leveraging Large Language Models for Conversational Multi-Doc QA},
author={Yiming Li and Zhao Zhang},
year={2024},
eprint={2402.18385},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Zhao Zhang: [email protected]
Yiming Li: [email protected]