Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to find a correct place of original word from the list of predicted words from GPT-2 model? #3886

Closed
states786 opened this issue Apr 21, 2020 · 4 comments
Assignees
Labels

Comments

@states786
Copy link

Hi,

I would like to calculate at which place correct word option lies in the top 5 predicted words from GPT-2 model?

For this purpose, I am using following code snippet:


        subseq = "The car moves very" #sample sequence
        orignal_word="fast"
        sequence = tokenizer.encode(subseq, return_tensors="pt")
        next_word_id = tokenizer.encode(orignal_word, return_tensors="pt")
        next_word = tokenizer.decode(next_word_id[0])
        next_word_logits = model(sequence)[0][0, -1].detach()
        probabilities, word_ids = next_word_logits.topk(5) #Getting top 5 next word options
        
        rank=1.0 
        for word_id in word_ids:
            word = tokenizer.decode([word_id])
            if word == next_word:
                   break;
            rank=rank+1.0

       print("Rank of Correct option is "+ str(rank))

I am not sure whether it is done perfectly or not as GPT-2 model uses BPE tokenizer. Am I doing it in a right way? Kindly share your thoughts, and correct me if I am doing something wrong in it.

@patrickvonplaten
Copy link
Contributor

It won't be that easy since some words will be split into multiple tokens so you have to make two forward passes.

If you limit your original_word to just one token words (you can check that simply with len(tokenizer.encode(original_word))==1. Then your idea here should work.

If not it's gonna be trickier. Also this issue might be helpful:
#2311

@patrickvonplaten patrickvonplaten self-assigned this Apr 21, 2020
@states786
Copy link
Author

states786 commented Apr 21, 2020

Thanks @patrickvonplaten for your response.
Yes, the code works for len(tokenizer.encode(original_word))==1, but not for those original_word , which consist of more than one tokens.

I look at the shared issue, but I am confused, which selected word id, should I pass to the model again, as next_word_logits.topk(5) gives me 5 token ids?

Can you please share any code snippet, which will work for the second part?

@states786
Copy link
Author

Hi @patrickvonplaten,

can u plz let me know about any update?

@stale
Copy link

stale bot commented Jun 25, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the wontfix label Jun 25, 2020
@stale stale bot closed this as completed Jul 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants