Загрузка страницы

310 - Understanding sub word tokenization used for NLP

310 - Understanding sub word tokenization used for NLP

Code generated in the video can be downloaded from here:
https://github.com/bnsreenu/python_for_microscopists/tree/master/310-Understanding%20sub-word%20tokenization%20used%20for%20NLP

All other code:
https://github.com/bnsreenu/python_for_microscopists

Subword tokenization algorithm's philosophy is…​
- frequently used words should not be split into smaller sub-words ​
- rare words should be divided into meaningful sub-words. ​

Example: DigitalSreeni is not a real word and a rare word (unless I get super famous). It may be divided as:​
Digital (common word)​
Sr​
E​
E​
Ni (Common sub-word – Nice, Nickel, Nimble, etc.)​

Advantages of sub-word tokenization:​
Not very large vocabulary sizes while maintaining the ability to provide context-independent representations.​
Handle rare and out-of-vocabulary words by breaking them into known sub-word units.​

Byte Pair Encoding (BPE) reference:
​https://arxiv.org/abs/1508.07909

​BPE Starts with pre-tokenizer that splits the training data into words. Pre-tokenization can be just space tokenization where words separated by space are represented by individual tokens (e.g., GPT-2). ​

Using pre-tokenized tokens, it learns merge rules to form a new word (token) from two tokens of the base vocabulary. ​

This process is iterated until the vocabulary has attained the desired vocabulary size, set by the user (hyperparameter). ​

Both ByteLevelBPETokenizer and SentencePieceBPETokenizer are tokenizers used for subword tokenization, but they use different algorithms to learn the vocabulary and perform tokenization.

ByteLevelBPETokenizer is a tokenizer from the Hugging Face tokenizers library that learns byte-level BPE (Byte Pair Encoding) subwords. It starts by splitting each input text into bytes, and then learns a vocabulary of byte-level subwords.
using the BPE algorithm. This tokenizer is particularly useful for languages
with non-Latin scripts, where a character-level tokenizer may not work well.

On the other hand, SentencePieceBPETokenizer is a tokenizer from the SentencePiece library that learns subwords using a unigram language model. It first tokenizes the input text into sentences, and then trains a unigram language model on the resulting sentence corpus to learn a vocabulary of subwords. This tokenizer can handle a wide range of languages and text types, and can learn both character-level
and word-level subwords.

In terms of usage, both tokenizers are initialized and trained in a similar way.

Видео 310 - Understanding sub word tokenization used for NLP канала DigitalSreeni
Показать
Комментарии отсутствуют
Введите заголовок:

Введите адрес ссылки:

Введите адрес видео с YouTube:

Зарегистрируйтесь или войдите с
Информация о видео
3 мая 2023 г. 12:00:24
00:32:16
Яндекс.Метрика