-
Notifications
You must be signed in to change notification settings - Fork 377
Description
Hi. I'm using Top2Vec for a project and this is how I have configured the model:
model = Top2Vec(documents = texts_unified,
min_count=10,
topic_merge_delta=0.1,
ngram_vocab=False,
ngram_vocab_args=None,
embedding_model='universal-sentence-encoder-large',
embedding_model_path=None,
embedding_batch_size=32,
split_documents=False,
document_chunker='sequential',
chunk_length=100,
max_num_chunks=None,
chunk_overlap_ratio=0.5,
chunk_len_coverage_ratio=1.0,
sentencizer=None,
speed='learn',
use_corpus_file=False,
document_ids=None,
keep_documents=True,
workers=None,
tokenizer=None,
use_embedding_model_tokenizer=True,
umap_args=None,
gpu_umap=False,
hdbscan_args = {'min_cluster_size': 50,
'metric': 'euclidean',
'cluster_selection_method': 'eom'},
gpu_hdbscan=False,
index_topics=False,
verbose=True
)
My issue is that there are certain alphanumeric words in my corpus like 'm24' or 'm4' or '1v1' which are very crucial to my domain. They are jargons. However, the model is unable to learn embeddings for such alphanumeric words and therefore these words are not in the model vocabulary, and they are not being assigned to any topic. I can't figure out why this is happening.
The issue is not with the word frequency. Those words occur more than 200 times in the corpus.
I've also tried changing the embedding_model. Even then those words are not being learnt. Are they internally being filtered out? I noticed that no numerical token is being learnt at all. How can I change it?