Converting String Tokens Into Integers
Solution 1:
There's no essential reason to use Word2Vec
for this. The point of Word2Vec
is to map words to multi-dimensional, "dense" vectors, with many floating-point coordinates.
Though Word2Vec
happens to scan your training corpus for all unique words, and give each unique word an integer position in its internal data-structures, you wouldn't usually make a model of only one-dimension (size=1
), or ask the model for the word's integer slot (an internal implementation detail).
If you just need a (string word)->(int id) mapping, the gensim class Dictionary
can do that. See:
https://radimrehurek.com/gensim/corpora/dictionary.html
from nltk.tokenize import word_tokenize
from gensim.corpora.dictionary import Dictionary
sometext = "hello how are you doing?"
tokens = word_tokenize(sometext)
my_vocab = Dictionary([tokens])
print(my_vocab.token2id['hello'])
Now, if there's actually some valid reason to be using Word2Vec
– such as needing the multidimensional vectors for a larger vocabulary, trained on a significant amount of varying text – and your real need is to know its internal integer slots for words, you can access those via the internal wv
property's vocab
dictionary:
print(model.wv.vocab['hello'].index)
Solution 2:
You can use gensim corpora.Dictionary to create a dictionary and ids for tokens.
from gensim import corpora
dictionary = corpora.Dictionary([tokens])
print(dictionary)
Dictionary(6 unique tokens: ['?', 'are', 'doing', 'hello', 'how']...)
token2id
print(dictionary.token2id)
{'?': 0, 'are': 1, 'doing': 2, 'hello': 3, 'how': 4, 'you': 5}
dictionary.token2id['hello']
3
Post a Comment for "Converting String Tokens Into Integers"