Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A question #2

Open
GCTTTTTT opened this issue Jul 7, 2022 · 7 comments
Open

A question #2

GCTTTTTT opened this issue Jul 7, 2022 · 7 comments

Comments

@GCTTTTTT
Copy link

GCTTTTTT commented Jul 7, 2022

I want to ask that what is the origin of predicted_label in MAG_candidates.json?

@yuzhimanhua
Copy link
Owner

Hi,

Those "predicted labels" come from exact name matching and BM25 retrieval. You can refer to Section 3.2 in our paper (https://arxiv.org/pdf/2202.05932.pdf).

The contribution of BM25 in the retrieval stage is not very significant. That being said, if you want to approximately get the "predicted labels", you can implement a very simple exact name matching strategy. Specifically, if the name of a label appears in a document, it will be added to the "predicted labels". The result of this strategy should approximate what we show in MAG_candidates.json well.

@GCTTTTTT
Copy link
Author

Hello, I want to ask that whether "venue","author","reference" and "citation" properties are required to run this model in {dataset}_test.json and {dataset}_train.json

@yuzhimanhua
Copy link
Owner

Hi,

These fields are NOT required in {dataset}_test.json, but they are required in {dataset}_train.json.

If your own datasets do not have such metadata information, you can use our MAG_train.json or PubMed_train.json for training and your own test set for testing. However, I cannot guarantee our model's performance in such a "transfer learning" setting.

@GCTTTTTT
Copy link
Author

oh thanks, but If I use MAG_train.json for training and testing my own test set, whether the {dataset} _label.json and the {dataset} _candidates.json should correspond to my own test set?

@yuzhimanhua
Copy link
Owner

Yes, those two json files should correspond to your own test set.

If you do not have ground truth labels and just want to do predictions, you can remove the last line in run.sh. https://github.com/yuzhimanhua/MICoL/blob/master/run.sh#L12

@GCTTTTTT
Copy link
Author

Hello!Thanks for your patient answer! I use my own data in test.json and those two json files, the prepare.sh seens successfully runned but the run.sh had some Errors as follow. What maybe the reason of the errors?

Namespace(adam_epsilon=1e-08, architecture='cross', bert_model='scibert_scivocab_uncased/', eval=False, eval_batch_size=128, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=5e-05, max_contexts_length=256, max_grad_norm=1.0, max_response_length=256, model_type='bert', num_train_epochs=1.0, output_dir='MAG_output/', poly_m=0, print_freq=500, seed=12345, test_file='MAG_input/test.txt', train_batch_size=4, train_dir='MAG_input/', use_pretrain=True, warmup_steps=100, weight_decay=0.01)
Traceback (most recent call last):
File "main.py", line 158, in
tokenizer = TokenizerClass.from_pretrained(os.path.join(args.bert_model, "vocab.txt"), do_lower_case=True, clean_text=False)
File "/home/hxx/miniconda3/envs/pytorch/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1653, in from_pretrained
f"Calling {cls.name}.from_pretrained() with the path to a single file or url is not "
ValueError: Calling BertTokenizerFast.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.
Namespace(adam_epsilon=1e-08, architecture='cross', bert_model='scibert_scivocab_uncased/', eval=True, eval_batch_size=128, fp16=False, fp16_opt_level='O1', gradient_accumulation_steps=1, learning_rate=5e-05, max_contexts_length=256, max_grad_norm=1.0, max_response_length=256, model_type='bert', num_train_epochs=1.0, output_dir='MAG_output/', poly_m=0, print_freq=500, seed=12345, test_file='MAG_input/test.txt', train_batch_size=4, train_dir='MAG_input/', use_pretrain=True, warmup_steps=100, weight_decay=0.01)
Traceback (most recent call last):
File "main.py", line 158, in
tokenizer = TokenizerClass.from_pretrained(os.path.join(args.bert_model, "vocab.txt"), do_lower_case=True, clean_text=False)
File "/home/hxx/miniconda3/envs/pytorch/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 1653, in from_pretrained
f"Calling {cls.name}.from_pretrained() with the path to a single file or url is not "
ValueError: Calling BertTokenizerFast.from_pretrained() with the path to a single file or url is not supported for this tokenizer. Use a model identifier or the path to a directory instead.

@yuzhimanhua
Copy link
Owner

yuzhimanhua commented Aug 1, 2022

Hello,

Sorry for my late reply. I re-ran the code from my side and it worked well, so I am not quite sure about the reason. I guess it is still due to the package version issues. Could you please try to switch to Python 3.6 and refer to https://github.com/yuzhimanhua/MICoL/blob/master/requirements.txt for the versions of torch and transformers?

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants