- Популярные видео
- Авто
- Видео-блоги
- ДТП, аварии
- Для маленьких
- Еда, напитки
- Животные
- Закон и право
- Знаменитости
- Игры
- Искусство
- Комедии
- Красота, мода
- Кулинария, рецепты
- Люди
- Мото
- Музыка
- Мультфильмы
- Наука, технологии
- Новости
- Образование
- Политика
- Праздники
- Приколы
- Природа
- Происшествия
- Путешествия
- Развлечения
- Ржач
- Семья
- Сериалы
- Спорт
- Стиль жизни
- ТВ передачи
- Танцы
- Технологии
- Товары
- Ужасы
- Фильмы
- Шоу-бизнес
- Юмор
GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling
Authors:
Jaskaran Singh, Prabhav Sanga, Arun Kumar Dubey
Abstract:
Tokenization plays a foundational yet underexplored role in biological sequence modeling. In this work, we present GeneticBPE, a biologically informed tokenisation framework that encodes prior structural knowledge such as seed motifs and conserved regions into the vocabulary construction process. Unlike standard subword methods that optimize purely for frequency or language-model likelihood, GeneticBPE integrates motif preservation objectives and generalisation-aware constraints into a modified merge scoring scheme. We evaluate our method on binary and multiclass miRNA classification tasks using the MirGeneDB v3.0 dataset and show that GeneticBPE outperforms character-level, k-mer, Unigram, and BPE tokenisations in accuracy, cross-species generalisation, and motif fidelity. Theoretical results demonstrate that tokenisation directly governs the inductive bias and domain robustness of sequence models. Our findings suggest that tokenisation should not be treated as a preprocessing utility, but rather as a design-critical component in biological NLP pipelines.
Видео GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling канала Tokenization Workshop (TokShop)
Jaskaran Singh, Prabhav Sanga, Arun Kumar Dubey
Abstract:
Tokenization plays a foundational yet underexplored role in biological sequence modeling. In this work, we present GeneticBPE, a biologically informed tokenisation framework that encodes prior structural knowledge such as seed motifs and conserved regions into the vocabulary construction process. Unlike standard subword methods that optimize purely for frequency or language-model likelihood, GeneticBPE integrates motif preservation objectives and generalisation-aware constraints into a modified merge scoring scheme. We evaluate our method on binary and multiclass miRNA classification tasks using the MirGeneDB v3.0 dataset and show that GeneticBPE outperforms character-level, k-mer, Unigram, and BPE tokenisations in accuracy, cross-species generalisation, and motif fidelity. Theoretical results demonstrate that tokenisation directly governs the inductive bias and domain robustness of sequence models. Our findings suggest that tokenisation should not be treated as a preprocessing utility, but rather as a design-critical component in biological NLP pipelines.
Видео GeneticBPE: Motif-Preserving Tokenization for Robust miRNA Modeling канала Tokenization Workshop (TokShop)
Комментарии отсутствуют
Информация о видео
14 июля 2025 г. 22:59:52
00:08:44
Другие видео канала
