310 - Understanding sub word tokenization used for NLP
310 - Understanding sub word tokenization used for NLP
Code generated in the video can be downloaded from here:
https://github.com/bnsreenu/python_for_microscopists/tree/master/310-Understanding%20sub-word%20tokenization%20used%20for%20NLP
All other code:
https://github.com/bnsreenu/python_for_microscopists
Subword tokenization algorithm's philosophy is…
- frequently used words should not be split into smaller sub-words
- rare words should be divided into meaningful sub-words.
Example: DigitalSreeni is not a real word and a rare word (unless I get super famous). It may be divided as:
Digital (common word)
Sr
E
E
Ni (Common sub-word – Nice, Nickel, Nimble, etc.)
Advantages of sub-word tokenization:
Not very large vocabulary sizes while maintaining the ability to provide context-independent representations.
Handle rare and out-of-vocabulary words by breaking them into known sub-word units.
Byte Pair Encoding (BPE) reference:
https://arxiv.org/abs/1508.07909
BPE Starts with pre-tokenizer that splits the training data into words. Pre-tokenization can be just space tokenization where words separated by space are represented by individual tokens (e.g., GPT-2).
Using pre-tokenized tokens, it learns merge rules to form a new word (token) from two tokens of the base vocabulary.
This process is iterated until the vocabulary has attained the desired vocabulary size, set by the user (hyperparameter).
Both ByteLevelBPETokenizer and SentencePieceBPETokenizer are tokenizers used for subword tokenization, but they use different algorithms to learn the vocabulary and perform tokenization.
ByteLevelBPETokenizer is a tokenizer from the Hugging Face tokenizers library that learns byte-level BPE (Byte Pair Encoding) subwords. It starts by splitting each input text into bytes, and then learns a vocabulary of byte-level subwords.
using the BPE algorithm. This tokenizer is particularly useful for languages
with non-Latin scripts, where a character-level tokenizer may not work well.
On the other hand, SentencePieceBPETokenizer is a tokenizer from the SentencePiece library that learns subwords using a unigram language model. It first tokenizes the input text into sentences, and then trains a unigram language model on the resulting sentence corpus to learn a vocabulary of subwords. This tokenizer can handle a wide range of languages and text types, and can learn both character-level
and word-level subwords.
In terms of usage, both tokenizers are initialized and trained in a similar way.
Видео 310 - Understanding sub word tokenization used for NLP канала DigitalSreeni
Code generated in the video can be downloaded from here:
https://github.com/bnsreenu/python_for_microscopists/tree/master/310-Understanding%20sub-word%20tokenization%20used%20for%20NLP
All other code:
https://github.com/bnsreenu/python_for_microscopists
Subword tokenization algorithm's philosophy is…
- frequently used words should not be split into smaller sub-words
- rare words should be divided into meaningful sub-words.
Example: DigitalSreeni is not a real word and a rare word (unless I get super famous). It may be divided as:
Digital (common word)
Sr
E
E
Ni (Common sub-word – Nice, Nickel, Nimble, etc.)
Advantages of sub-word tokenization:
Not very large vocabulary sizes while maintaining the ability to provide context-independent representations.
Handle rare and out-of-vocabulary words by breaking them into known sub-word units.
Byte Pair Encoding (BPE) reference:
https://arxiv.org/abs/1508.07909
BPE Starts with pre-tokenizer that splits the training data into words. Pre-tokenization can be just space tokenization where words separated by space are represented by individual tokens (e.g., GPT-2).
Using pre-tokenized tokens, it learns merge rules to form a new word (token) from two tokens of the base vocabulary.
This process is iterated until the vocabulary has attained the desired vocabulary size, set by the user (hyperparameter).
Both ByteLevelBPETokenizer and SentencePieceBPETokenizer are tokenizers used for subword tokenization, but they use different algorithms to learn the vocabulary and perform tokenization.
ByteLevelBPETokenizer is a tokenizer from the Hugging Face tokenizers library that learns byte-level BPE (Byte Pair Encoding) subwords. It starts by splitting each input text into bytes, and then learns a vocabulary of byte-level subwords.
using the BPE algorithm. This tokenizer is particularly useful for languages
with non-Latin scripts, where a character-level tokenizer may not work well.
On the other hand, SentencePieceBPETokenizer is a tokenizer from the SentencePiece library that learns subwords using a unigram language model. It first tokenizes the input text into sentences, and then trains a unigram language model on the resulting sentence corpus to learn a vocabulary of subwords. This tokenizer can handle a wide range of languages and text types, and can learn both character-level
and word-level subwords.
In terms of usage, both tokenizers are initialized and trained in a similar way.
Видео 310 - Understanding sub word tokenization used for NLP канала DigitalSreeni
Показать
Комментарии отсутствуют
Информация о видео
Другие видео канала
![170 - AutoKeras for structured data classification using the Wisconsin breast cancer data set](https://i.ytimg.com/vi/85IsgVSNOKE/default.jpg)
![83 - Running your Docker in the cloud](https://i.ytimg.com/vi/lrRBOOqVHRU/default.jpg)
![Python tips and tricks - 3: Be conservative with image augmentation](https://i.ytimg.com/vi/82KgwdQbZfs/default.jpg)
![What I am reading this week about Machine Learning and AI - 23 July 2021](https://i.ytimg.com/vi/zP66_JlaH_g/default.jpg)
![322 - PSO Using steel optimization](https://i.ytimg.com/vi/XNLEwhtil6A/default.jpg)
![233 - Semantic Segmentation of BraTS2020 - Part 2 - Defining your custom data generator](https://i.ytimg.com/vi/PNqnLbzdxwQ/default.jpg)
![84 - How to build a Docker (module) with your code and run it on APEER?](https://i.ytimg.com/vi/s5KbwOrtj64/default.jpg)
![339 - Surrogate Optimization explained using simple python code](https://i.ytimg.com/vi/6GxFAw2IFyw/default.jpg)
![327 - An introduction to Single Molecule Fluorescence In Situ Hybridization (smFISH)](https://i.ytimg.com/vi/sVVFxsF6Ahg/default.jpg)
![Book Review - Deep Learning with fastai Cookbook](https://i.ytimg.com/vi/uDsSYBGUTbk/default.jpg)
![What I am reading this week about Machine Learning and AI - 13 August 2021](https://i.ytimg.com/vi/p-vFKOFTeVQ/default.jpg)
![326 - Cell type annotation for single cell RNA seq data](https://i.ytimg.com/vi/NmlAgRUmMXs/default.jpg)
![65 - Image Segmentation using traditional machine learning - Part3 Feature Ranking](https://i.ytimg.com/vi/FT64YzD1KQI/default.jpg)
![39 - Introduction to Pandas - Grouping Data](https://i.ytimg.com/vi/p7KyukHE9xU/default.jpg)
![AMT1 - Extracting required information from your Outlook inbox](https://i.ytimg.com/vi/50o6RTvYIpY/default.jpg)
![108 - Analysis of COVID-19 data using Python - Part 2](https://i.ytimg.com/vi/tnOlX6_t0n4/default.jpg)
![151 Warning about JPG files when working with categorical labels](https://i.ytimg.com/vi/mwN2GGA4mqo/default.jpg)
![Generating borders around objects for use in semantic segmentation](https://i.ytimg.com/vi/65qPtD6khzg/default.jpg)
![7 (+2) AI-powered fun and useful web applications](https://i.ytimg.com/vi/OnvWgzZkN5E/default.jpg)
![My review of the 'Automated Machine Learning with AutoKeras' book](https://i.ytimg.com/vi/46QUlyENx-Y/default.jpg)
![321 - What is Particle Swarm Optimization PSO?](https://i.ytimg.com/vi/FRXsQ6qbJbs/default.jpg)