Skip to content

Issue with learning alphanumeric tokens #363

@yamika-g

Description

@yamika-g

Hi. I'm using Top2Vec for a project and this is how I have configured the model:


model = Top2Vec(documents = texts_unified, 
                 min_count=10,
                 topic_merge_delta=0.1,
                 ngram_vocab=False,
                 ngram_vocab_args=None,
                 embedding_model='universal-sentence-encoder-large',
                 embedding_model_path=None,
                 embedding_batch_size=32,
                 split_documents=False,
                 document_chunker='sequential',
                 chunk_length=100,
                 max_num_chunks=None,
                 chunk_overlap_ratio=0.5,
                 chunk_len_coverage_ratio=1.0,
                 sentencizer=None,
                 speed='learn',
                 use_corpus_file=False,
                 document_ids=None,
                 keep_documents=True,
                 workers=None,
                 tokenizer=None,
                 use_embedding_model_tokenizer=True,
                 umap_args=None,
                 gpu_umap=False,
                 hdbscan_args = {'min_cluster_size': 50,
                            'metric': 'euclidean',
                            'cluster_selection_method': 'eom'},
                 gpu_hdbscan=False,
                 index_topics=False,
                 verbose=True
                 )

My issue is that there are certain alphanumeric words in my corpus like 'm24' or 'm4' or '1v1' which are very crucial to my domain. They are jargons. However, the model is unable to learn embeddings for such alphanumeric words and therefore these words are not in the model vocabulary, and they are not being assigned to any topic. I can't figure out why this is happening.

The issue is not with the word frequency. Those words occur more than 200 times in the corpus.
I've also tried changing the embedding_model. Even then those words are not being learnt. Are they internally being filtered out? I noticed that no numerical token is being learnt at all. How can I change it?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions