Issue with learning alphanumeric tokens

Hi. I'm using Top2Vec for a project and this is how I have configured the model:
```

model = Top2Vec(documents = texts_unified, 
                 min_count=10,
                 topic_merge_delta=0.1,
                 ngram_vocab=False,
                 ngram_vocab_args=None,
                 embedding_model='universal-sentence-encoder-large',
                 embedding_model_path=None,
                 embedding_batch_size=32,
                 split_documents=False,
                 document_chunker='sequential',
                 chunk_length=100,
                 max_num_chunks=None,
                 chunk_overlap_ratio=0.5,
                 chunk_len_coverage_ratio=1.0,
                 sentencizer=None,
                 speed='learn',
                 use_corpus_file=False,
                 document_ids=None,
                 keep_documents=True,
                 workers=None,
                 tokenizer=None,
                 use_embedding_model_tokenizer=True,
                 umap_args=None,
                 gpu_umap=False,
                 hdbscan_args = {'min_cluster_size': 50,
                            'metric': 'euclidean',
                            'cluster_selection_method': 'eom'},
                 gpu_hdbscan=False,
                 index_topics=False,
                 verbose=True
                 )
```

My issue is that there are certain alphanumeric words in my corpus like 'm24' or 'm4' or '1v1' which are very crucial to my domain. They are jargons. However, the model is unable to learn embeddings for such alphanumeric words and therefore these words are not in the model vocabulary, and they are not being assigned to any topic. I can't figure out why this is happening.

The issue is not with the word frequency. Those words occur more than 200 times in the corpus.
I've also tried changing the embedding_model. Even then those words are not being learnt. Are they internally being filtered out? I noticed that no numerical token is being learnt at all. How can I change it?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue with learning alphanumeric tokens #363

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue with learning alphanumeric tokens #363

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions