Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add a test to match author's tokenization #37

Open
wants to merge 15 commits into
base: modeling_markuplm_bis
Choose a base branch
from

Conversation

SaulLu
Copy link

@SaulLu SaulLu commented Apr 13, 2022

What does this PR do?

I propose to add a test that aims to verify that we are able to reproduce the tokenization produced by the authors of MarkUpLM on the downstream task of WebSRC.

To design this test, I therefore ran the authors' code on the WebSRC downstream task and isolated:

  • the xpaths in the get_xpath_and_treeid4tokens method

https://github.com/microsoft/unilm/blob/e4929f812398207b7fefb4dda6e9debcb8ce34b9/markuplm/examples/fine_tuning/run_websrc/utils.py#L256-L284

  • the other targets in the feature creation stage (in the convert_examples_to_features method):

https://github.com/microsoft/unilm/blob/e4929f812398207b7fefb4dda6e9debcb8ce34b9/markuplm/examples/fine_tuning/run_websrc/utils.py#L664-L684

This test already reveals some differences:

  • On the slow tokenizer the template on the phrase pair is not the same. The authors have <s> question </s> content </s> and we have <s> question </s></s> content .
  • The slow tokenizer loses the last tokens and replaces them with pad tokens
  • The fast tokenizer distorts the tokenization of the question

@SaulLu
Copy link
Author

SaulLu commented Apr 13, 2022

@NielsRogge , I can't add you as a reviewer to this PR 🙂 . I would love to have your review on it

@NielsRogge NielsRogge force-pushed the modeling_markuplm_bis branch from fd1db5e to f05b861 Compare June 6, 2022 13:24
@SaulLu SaulLu mentioned this pull request Jun 21, 2022
5 tasks
@NielsRogge NielsRogge force-pushed the modeling_markuplm_bis branch 2 times, most recently from 47c172c to c79adea Compare September 15, 2022 07:37
@NielsRogge NielsRogge force-pushed the modeling_markuplm_bis branch 3 times, most recently from 5387164 to 738a608 Compare September 29, 2022 09:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant