Skip to content
This repository was archived by the owner on Dec 16, 2022. It is now read-only.

Vocabulary class from transformer tokenizers #3581

Closed
wants to merge 3 commits into from

Conversation

dirkgr
Copy link
Member

@dirkgr dirkgr commented Jan 3, 2020

I think I'll need this for the transformer based RC models. Those are not finished yet, so I don't know for sure, but I don't want to make an huge PR, and this change reasonably stands on its own.

@dirkgr dirkgr requested a review from brendan-ai2 January 3, 2020 00:39
@matt-gardner
Copy link
Contributor

We add tokens to the vocabulary inside the indexer; what do you need these extra fields for?

I'm not opposed to adding the methods on the vocabulary, though I'm not sure what the use case is for them just yet. I do think that we should avoid having a separate vocabulary class that you have to specify just to use the transformers. That seems really unfortunate, and harder to configure than it should be.

@dirkgr
Copy link
Member Author

dirkgr commented Jan 3, 2020

I see. PretrainedTransformerIndexer._add_encoding_to_vocabulary() doesn't do what I thought it does.

Even with that cleared up, do you think that having the indexer modify the vocab, in the "tags" namespace no less, is cleaner than this?

Relying on _add_encoding_to_vocabulary() does not give me a reliable way to find "[CLS]" and "[SEP]" token ids, which is what I have to do.

Having to specify "pretrained_transformer" a third time is annoying for sure, but we already have to do it twice, once for the indexer, and once for the tokenizer. That whole architecture doesn't lend itself to the way the transformer library tokenizes. I think cleaning that up is out of scope for this PR.

@matt-gardner
Copy link
Contributor

I don't really think there's a way around having to specify the transformer for the tokenizer, indexer, and embedder. Those three components all need to act together, and we decided to have them be separate objects, so you have to specify things three times. The original design, back in the day, had one object do all of these jobs, because the jobs really are pretty coupled, but we decided it was better software design to decouple them.

I don't think we need to add the vocabulary into this, though - yes, it's cleaner to me to not additionally couple the vocabulary into this mess, if it's reasonably avoidable. You currently don't need to configure the vocabulary at all, almost all of the time.

All of that is to say, I don't think we can clean up the architecture any more than it is (other than changing the namespace, if it makes sense to do that).

So the question is how to solve your problem. You need to get the indices of the "[CLS]" and "[SEP]" tokens. Why? It's possible there's an easier way to solve your problem. But if there's not, what's wrong with vocab.get_token_index("[CLS]")?

@dirkgr
Copy link
Member Author

dirkgr commented Jan 3, 2020

Sometimes the token is "<cls>".

Also, it would be vocab.get_token_index("[CLS]", "tags").

@matt-gardner
Copy link
Contributor

Ok, and where do you need this info? Is it in a place where you already have the tokenizer? What do you need it for?

@dirkgr
Copy link
Member Author

dirkgr commented Jan 3, 2020

Actually, I can transplant the logic for identifying the special tokens into the indexer. Then I have the special tokens without having to change the design.

@dirkgr
Copy link
Member Author

dirkgr commented Jan 3, 2020

I need the info in the model, where I either need to find CLS tokens, or construct a sequence containing CLS and SEP, to send to the embedder.

@matt-gardner
Copy link
Contributor

I'm pretty confused about why you need to do that for the embedder; maybe we can chat about what exactly you need? I don't want to block you from doing what you need to do, I just think there might be easier ways to do it that don't require the changes you've proposed here.

@dirkgr
Copy link
Member Author

dirkgr commented Jan 4, 2020

Closing after a phone call about the topic

@dirkgr dirkgr closed this Jan 4, 2020
@dirkgr
Copy link
Member Author

dirkgr commented Jan 9, 2020

@matt-gardner, I realized later that the only reason why we have to pass the name of the pretrained model to the indexer is so that it can copy the vocabulary. There is no other reason for it. So if we do it here instead, the total number of times we have to specify it stays constant.

@matt-gardner
Copy link
Contributor

But this way it makes more sense. We have a chapter in the allennlp course that talks about how we map text to features, and there are three main parts: tokenization, indexing, and embedding. It makes sense that the three places you specify a transformer model are those three places. Having a random one in the vocabulary instead of the indexer is worse design, I think.

@dirkgr dirkgr mentioned this pull request Mar 27, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants