-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Vocabulary class from transformer tokenizers #3581
Conversation
We add tokens to the vocabulary inside the indexer; what do you need these extra fields for? I'm not opposed to adding the methods on the vocabulary, though I'm not sure what the use case is for them just yet. I do think that we should avoid having a separate vocabulary class that you have to specify just to use the transformers. That seems really unfortunate, and harder to configure than it should be. |
I see. Even with that cleared up, do you think that having the indexer modify the vocab, in the "tags" namespace no less, is cleaner than this? Relying on Having to specify |
I don't really think there's a way around having to specify the transformer for the tokenizer, indexer, and embedder. Those three components all need to act together, and we decided to have them be separate objects, so you have to specify things three times. The original design, back in the day, had one object do all of these jobs, because the jobs really are pretty coupled, but we decided it was better software design to decouple them. I don't think we need to add the vocabulary into this, though - yes, it's cleaner to me to not additionally couple the vocabulary into this mess, if it's reasonably avoidable. You currently don't need to configure the vocabulary at all, almost all of the time. All of that is to say, I don't think we can clean up the architecture any more than it is (other than changing the namespace, if it makes sense to do that). So the question is how to solve your problem. You need to get the indices of the "[CLS]" and "[SEP]" tokens. Why? It's possible there's an easier way to solve your problem. But if there's not, what's wrong with |
Sometimes the token is Also, it would be |
Ok, and where do you need this info? Is it in a place where you already have the tokenizer? What do you need it for? |
Actually, I can transplant the logic for identifying the special tokens into the indexer. Then I have the special tokens without having to change the design. |
I need the info in the model, where I either need to find CLS tokens, or construct a sequence containing CLS and SEP, to send to the embedder. |
I'm pretty confused about why you need to do that for the embedder; maybe we can chat about what exactly you need? I don't want to block you from doing what you need to do, I just think there might be easier ways to do it that don't require the changes you've proposed here. |
Closing after a phone call about the topic |
@matt-gardner, I realized later that the only reason why we have to pass the name of the pretrained model to the indexer is so that it can copy the vocabulary. There is no other reason for it. So if we do it here instead, the total number of times we have to specify it stays constant. |
But this way it makes more sense. We have a chapter in the allennlp course that talks about how we map text to features, and there are three main parts: tokenization, indexing, and embedding. It makes sense that the three places you specify a transformer model are those three places. Having a random one in the vocabulary instead of the indexer is worse design, I think. |
I think I'll need this for the transformer based RC models. Those are not finished yet, so I don't know for sure, but I don't want to make an huge PR, and this change reasonably stands on its own.