Kerasでカテゴリ変数をEntity Embeddingする時の次元数

画像とカテゴリのデータがあって、その両方をmodelに入力したくなった時にカテゴリ変数の扱いがわからなかった。

テキストのディープラーニングで使われるEmbeddingがカテゴリ変数にも有効らしいので試してみた所、次元数を決めあぐねていたので調べてみた。

まずは文法。

tf.keras.layers.Embedding(
    input_dim, output_dim, embeddings_initializer='uniform',
    embeddings_regularizer=None, activity_regularizer=None,
    embeddings_constraint=None, mask_zero=False, input_length=None, **kwargs
)

input_dim  int > 0. Size of the vocabulary, i.e. maximum integer index + 1.

output_dim  int >= 0. Dimension of the dense embedding.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

input_dimはカテゴリ数にするのが定石みたいだが、output_dimはハイパーパラメータで特に決まった値はないようだ。

何か決め方はないか調べてみると、GoogleDeveloperのブログと元論文にヒントが書いてあった。

Why is the embedding vector size 3 in our example? Well, the following “formula” provides a general rule of thumb about the number of embedding dimensions:

embedding_dimensions = number_of_categories**0.25


That is, the embedding vector dimension should be the 4th root of the number of categories. Since our vocabulary size in this example is 81, the recommended number of dimensions is 3:

3 = 81**0.25

Note that this is just a general guideline; you can set the number of embedding dimensions as you please.

https://developers.googleblog.com/2017/11/introducing-tensorflow-feature-columns.html

The dimensions of the embedding layers Di are hyperparameters that need to be pre-defined.

The bound of the dimensions of entity embeddings are between 1 and mi − 1 where mi is the number of values for the categorical variable xi .

In practice we chose the dimensions based on experiments. The following empirical guidelines are used during this process:

First, the more complex the more dimensions.

We roughly estimated how many features/aspects one might need to describe the entities and used that as the dimension to start with.

Second, if we had no clue about the first guideline, then we started with mi − 1.

https://arxiv.org/abs/1604.06737

output_dimに迷ったら

1.カテゴリ数の0.25乗にする。

2.1〜カテゴリ数−1の間で試す。

にすると良さそうだ。

参考

tf.keras.layers.Embedding  |  TensorFlow Core v2.2.0
3 Ways to Encode Categorical Variables for Deep Learning - Machine Learning Mastery
Machine learning and deep learning models, like those in Keras, require all input and output variables to be numeric. This means that if your data contains c...
Introducing TensorFlow Feature Columns
News and insights on Google platforms, tools, and events.
Entity Embeddings of Categorical Variables
We map categorical variables in a function approximation problem into Euclidean spaces, which are the entity embeddings of the categorical variables. The mappin...

コメント

タイトルとURLをコピーしました