Skip to content Skip to sidebar Skip to footer

Combining/adding Vectors From Different Word2vec Models

I am using gensim to create Word2Vec models trained on large text corpora. I have some models based on StackExchange data dumps. I also have a model trained on a corpus derived fro

Solution 1:

Generally, only word vectors that were trained together are meaningfully comparable. (It's the interleaved tug-of-war during training that moves them to relative orientations that are meaningful, and there's enough randomness in the process that even models trained on the same corpus will vary in where they place individual words.)

Using words from both corpuses as guideposts, it is possible to learn a transformation from one space A to the other B, that tries to move those known-shared-words to their corresponding positions in the other space. Then, applying that same transformation to the words in A that aren't in B, you can find B coordinates for those words, making them comparable to other native-B words.

This technique has been used with some success in word2vec-driven language translation (where the guidepost pairs are known translations), or as a means of growing a limited word-vector set with word-vectors from elsewhere. Whether it'd work well enough for your purposes, I don't know. I imagine it could go astray especially where the two training corpuses use shared tokens in wildly different senses.

There's a class, TranslationMatrix, that may be able to do this for you in the gensim library. See:

https://radimrehurek.com/gensim/models/translation_matrix.html

There's a demo notebook of its use at:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/translation_matrix.ipynb

(Whenever practical, doing a full training on a mixed-together corpus, with all word examples, is likely to do better.)


Solution 2:

If you want to avoid training a new model on large mixed corpora with translations, I'd recommend checking out my new Python package (transvec) that allows you to convert word embeddings between pre-trained word2vec models. All you need to do is provide a representative set of individual words in the target language along with their translations in the source language as training data, which is much more manageable (I just took a few thousand words and threw them into Google translate for some pretty good results).

It works in a similar way to the TranslationMatrix mentioned in the other answer in that it works on pre-trained word2vec models, but as well as providing you with translations it can also provide you with the translated word vectors, allowing you to do things like nearest neighbour clustering on mixed-language corpora.

It also supports using regularisation in the training phase to help improve translations when your training data is limited.

Here's a small example:

import gensim.downloader
from transvec.transformers import TranslationWordVectorizer

# Pretrained models in two different languages.
ru_model = gensim.downloader.load("word2vec-ruscorpora-300")
en_model = gensim.downloader.load("glove-wiki-gigaword-300")

# Training data: pairs of English words with their Russian translations.
# The more you can provide, the better.
train = [
    ("king", "царь_NOUN"), ("tsar", "царь_NOUN"),
    ("man", "мужчина_NOUN"), ("woman", "женщина_NOUN")
]

bilingual_model = TranslationWordVectorizer(en_model, ru_model).fit(train)

# Find words with similar meanings across both languages.
bilingual_model.similar_by_word("царица_NOUN", 1) # "queen"
# [('king', 0.7763221263885498)]

Installation guidance and more details can be found on PyPi: https://pypi.org/project/transvec/.


Post a Comment for "Combining/adding Vectors From Different Word2vec Models"