HuggingfaceのTokenizerのvocabを取得する

Deep Learning

2022.05.25

Huggingfaceを使って自然言語処理を行っていた時、モデルごとのvocabularyの違いが気になった。

vocab.txt を見れば分かるのだが、Pythonで取得したかったので調べてみた。

まずはTokenizerのリファレンスを確認。

Tokenizer

A tokenizer is in charge of preparing the inputs for a model. The library contains tokenizers for all the models. Most o...

idからtokenに変換するメソッドを発見。

convert_ids_to_tokens(ids: List[int], skip_special_tokens: bool = False) → List[str]
Converts a single index or a sequence of indices in a token or a sequence of tokens, using the vocabulary and added tokens.

idの数が分かれば全探索出来そう。

tokenizerには語彙数を取得するメソッドがあるので、これを使う。

vocab_size
Size of the base vocabulary (without the added tokens).

ids = range(tokenizer.vocab_size)
vocab = tokenizer.convert_ids_to_tokens(ids)

これでtokenizerのvocabを確認することが出来るようになった。