GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling

Authors:
Jaskaran Singh, Prabhav Sanga, Arun Kumar Dubey

Abstract:
Tokenization plays a foundational yet underexplored role in biological sequence modeling. In this work, we present GeneticBPE, a biologically informed tokenisation framework that encodes prior structural knowledge such as seed motifs and conserved regions into the vocabulary construction process. Unlike standard subword methods that optimize purely for frequency or language-model likelihood, GeneticBPE integrates motif preservation objectives and generalisation-aware constraints into a modified merge scoring scheme. We evaluate our method on binary and multiclass miRNA classification tasks using the MirGeneDB v3.0 dataset and show that GeneticBPE outperforms character-level, k-mer, Unigram, and BPE tokenisations in accuracy, cross-species generalisation, and motif fidelity. Theoretical results demonstrate that tokenisation directly governs the inductive bias and domain robustness of sequence models. Our findings suggest that tokenisation should not be treated as a preprocessing utility, but rather as a design-critical component in biological NLP pipelines.

Видео GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling канала Tokenization Workshop (TokShop)

Комментарии отсутствуют

Информация о видео

14 июля 2025 г. 22:59:52

00:08:44

Tokenization Workshop (TokShop)

Правообладателям

Жалоба на материал Недопустимый материал Нарушение авторских прав

Комментарии

Другие видео канала