Skip to content Skip to sidebar Skip to footer

Cannot Load Doc2vec Object Using Gensim

I am trying to load a pre-trained Doc2vec model using gensim and use it to map a paragraph to a vector. I am referring to https://github.com/jhlau/doc2vec and the pre-trained model

Solution 1:

I would avoid using either the 4-year-old nonstandard gensim fork at https://github.com/jhlau/doc2vec, or any 4-year-old saved models that only load with such code.

The Wikipedia DBOW model there is also suspiciously small at 1.4GB. Wikipedia had well over 4 million articles even 4 years ago, and a 300-dimensional Doc2Vec model trained to have doc-vectors for the 4 million articles would be at least 4000000 articles * 300 dimensions * 4 bytes/dimension = 4.8GB in size, not even counting other parts of the model. (So, that download is clearly not the 4.3M doc, 300-dimensional model mentioned in the associated paper – but something that's been truncated in other unclear ways.)

The current gensim version is 3.8.3, released a few weeks ago.

It'd likely take a bit of tinkering, and an overnight or more runtime, to build your own Doc2Vec model using current code and a current Wikipedia dump - but then you'd be on modern supported code, with a modern model that better understands words coming into use in the last 4 years. (And, if you trained a model on a corpus of the exact kind of documents of interest to you – such as academic articles – the vocabulary, word-senses, and match to your own text-preprocessing to be used on later inferred documents will all be better.)

There's a Jupyter notebook example of building a Doc2Vec model from Wikipedia that either functional or very-close-to-functional inside the gensim source tree at:

https://github.com/RaRe-Technologies/gensim/blob/develop/docs/notebooks/doc2vec-wikipedia.ipynb

Post a Comment for "Cannot Load Doc2vec Object Using Gensim"