Загрузка страницы

Set-up a custom BERT Tokenizer for any language

BERT (Bidirectional Encoder Representations from Transformers) is a recent paper published by researchers at Google AI Language. BERT makes use of Transformer, an attention mechanism that learns contextual relations between words (or sub-words) in a text. In its vanilla form, Transformer includes two separate mechanisms — an encoder that reads the text input and a decoder that produces a prediction for the task.

In this video our tokenizer will be based on BERTWordPiece, which is the subword tokenization algorithm used for BERT, DistilBERT, and Electra. WordPiece first initializes the vocabulary to include every character present in the training data and progressively learns a given number of merge rules. In contrast to BPE, WordPiece does not choose the most frequent symbol pair, but the one that maximizes the likelihood of the training data once added to the vocabulary.

In many NLP (Natural Language Processing) cases we meet challenges with non-English text. In this BERT with Python tutorial, I used non-cleaned comments text data. This text data fully is in Lithuanian, not in English. This is where the challenge is coming on. We need to create a custom vocabulary (vocab.txt) file for a BERT tokenizer to train a pre-trained BERT model to solve our NLP problem. These processes can be split into four separate steps, as follows:

0:00 - Intro
0:37 - Step 1. Set-up a Python virtual environment
2:17 - Step 2. Prepare text data for training
6:11 - Step 3. Train a BERT tokenizer
9:25 - Step 4. Use a BERT tokenizer

These steps can be easily adopted to your unique NLP cases, or to develop a new BERT tokenizer by taking some crusial moments from this tutorial. By following this tutorial you should understand basically how BERT works, also you should be able to play with Python scripts to develop a new ones codes for you, and finally you should be familiar what is special tokens in BERT. At the end of this this video I returned values as PyTorch tensor, you can play around it and make it differently if you like to do it. At the end we will save a BERT model locally just to show how it is works for you. Links and resources mentioned in the video are listed below:

- BERT Tokenizer official documentation: https://huggingface.co/docs/tokenizers/python/latest/quicktour.html
- Tokenizer train method usage: https://huggingface.co/docs/tokenizers/python/latest/api/reference.html#tokenizers.trainers.Trainer
- BERT special tokens, good article to read more: https://www.analyticsvidhya.com/blog/2021/05/all-you-need-to-know-about-bert/
- BERT base uncased model for download: https://huggingface.co/bert-base-uncased/tree/main
- BERT tokenizers: https://huggingface.co/docs/transformers/main_classes/tokenizer
- BERT from_pretrained method: https://huggingface.co/docs/transformers/v4.15.0/en/main_classes/model#transformers.PreTrainedModel.from_pretrained
- BERT paper: https://arxiv.org/pdf/1810.04805.pdf

By completing this tutorial you will be able to develop a unique BERT tokenizer and adopt it for your NLP task. Also you can use your customer tokenizer in text classification (or multi-classification) with BERT. The complete code is clear to understand even for beginners of BERT like me, so I hope that the video will be easy to follow for all watchers.

#BERT #NLP #Python

Видео Set-up a custom BERT Tokenizer for any language канала Data Science Garage
Показать
Комментарии отсутствуют
Введите заголовок:

Введите адрес ссылки:

Введите адрес видео с YouTube:

Зарегистрируйтесь или войдите с
Информация о видео
5 января 2022 г. 20:27:11
00:13:30
Яндекс.Метрика