スポンサーリンク

# Kerasでカテゴリ変数をEntity Embeddingする時の次元数

テキストのディープラーニングで使われるEmbeddingがカテゴリ変数にも有効らしいので試してみた所、次元数を決めあぐねていたので調べてみた。

まずは文法。

``````tf.keras.layers.Embedding(
input_dim, output_dim, embeddings_initializer='uniform',
embeddings_regularizer=None, activity_regularizer=None,
)``````

input_dim　　int > 0. Size of the vocabulary, i.e. maximum integer index + 1.

output_dim　　int >= 0. Dimension of the dense embedding.

https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding

input_dimはカテゴリ数にするのが定石みたいだが、output_dimはハイパーパラメータで特に決まった値はないようだ。

Why is the embedding vector size 3 in our example? Well, the following “formula” provides a general rule of thumb about the number of embedding dimensions:

embedding_dimensions = number_of_categories**0.25

That is, the embedding vector dimension should be the 4th root of the number of categories. Since our vocabulary size in this example is 81, the recommended number of dimensions is 3:

3 = 81**0.25

Note that this is just a general guideline; you can set the number of embedding dimensions as you please.

The dimensions of the embedding layers Di are hyperparameters that need to be pre-defined.

The bound of the dimensions of entity embeddings are between 1 and mi − 1 where mi is the number of values for the categorical variable xi .

In practice we chose the dimensions based on experiments. The following empirical guidelines are used during this process:

First, the more complex the more dimensions.

We roughly estimated how many features/aspects one might need to describe the entities and used that as the dimension to start with.

Second, if we had no clue about the first guideline, then we started with mi − 1.

https://arxiv.org/abs/1604.06737

output_dimに迷ったら

１．カテゴリ数の0.25乗にする。

２．１〜カテゴリ数−１の間で試す。

にすると良さそうだ。

tf.keras.layers.Embedding  |  TensorFlow v2.11.0
Turns positive integers (indexes) into dense vectors of fixed size.
Just a moment...
Introducing TensorFlow Feature Columns
Entity Embeddings of Categorical Variables
We map categorical variables in a function approximation problem into Euclidean spaces, which are the entity embeddings of the categorical variables. The mappin...
Deep Learning
スポンサーリンク
コッコをフォローする
cocoinit23
タイトルとURLをコピーしました