!nvidia-smi
Sat May  2 06:43:02 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   31C    P8     7W /  75W |      0MiB /  7611MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
import os
import gc
import glob
import torch
import pickle
import joblib
from tqdm.auto import tqdm

HuggingFace Recently updated their scripts, and the pip is yet to be released. So We'll build from source

!pip install tokenizers
#!pip install transformers
!git clone https://github.com/huggingface/transformers
!pip install transformers/.

Collecting Corpus

I have used Sanskrit Corpus from Kaggle dataset. Feel free to skip and use your own ddataset. The trainng data needs to be in a .txt file. and I have also used Evaluation using the same dataset.

I need Kaggle API to download the dataset. You can load your text corpus from anywhere.

You can download a corpus for your language from https://traces1.inria.fr/oscar.

I have used data from there too and appended the data to a corpus from Kaggle.

Loading From kaggle

!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/ 
!chmod 600 ~/.kaggle/kaggle.json 

Kaggle dataset link

Thanks to inltk for Wikipedia dumps and CLTK from which I am currently collecting Sanskrit scraps from open sources.

!mkdir corpus 
#directory for sac=ving all corpus in a single directory. You can save it anywhere

#From Kagle
!kaggle datasets download -d disisbig/sanskrit-wikipedia-articles
!unzip /content/sanskrit-wikipedia-articles.zip -d /content/corpus

#From OSCAR corpus
!wget https://traces1.inria.fr/oscar/files/compressed-orig/sa.txt.gz
!gunzip /content/sa.txt.gz
#Reading sample
with open("/content/sa.txt", "r") as fp:
    print(fp.read(1000))
import glob
train_list = glob.glob("/content/corpus/train/train/*.txt")
valid_list = glob.glob("/content/corpus/valid/valid/*.txt")
#readig and appending all small files to single Train and Valid files
with open("/content/corpus/train/full.txt", "wb") as outfile:
    for f in train_list:
        with open(f, "rb") as infile:
            outfile.write(infile.read())
            outfile.write(b"\n\n")
    with open("/content/sa.txt", "rb") as infile:
            outfile.write(infile.read())

with open("/content/corpus/valid/full_val.txt", "wb") as outfile:
    for f in valid_list:
        with open(f, "rb") as infile:
            outfile.write(infile.read())
            outfile.write(b"\n\n")

Tokenizer Training

Directory to save trained tokenier and configuration files in a folder

!mkdir data_dir
import sentencepiece as spm
from tokenizers import SentencePieceBPETokenizer, BertWordPieceTokenizer
%%time

#Albert Tokenizer uses Sentence piece Tokenization, so I have used sentencepiece to to train tokenizer.
#This will take a while
spm.SentencePieceTrainer.Train('--input=/content/corpus/train/full.txt \
                                --model_prefix=m \
                                --vocab_size=32000 \
                                --control_symbols=[CLS],[SEP],[MASK]')

with open("m.vocab") as v:
    print(v.read(2000))
    v.close()
!mkdir /content/data_dir/
!cp /content/m.model -d /content/data_dir/spiece.model
!cp /content/m.vocab -d /content/data_dir/spiece.vocab
mkdir: cannot create directory ‘/content/data_dir/’: File exists

Testing Tokenizer

Make sure to check out the Fast Tokenizers from Huggingface, This is really Fast! You can compare it with sentencepiece.

%time
tokenizer = SentencePieceBPETokenizer()
tokenizer.train("/content/corpus/train/full.txt")
CPU times: user 4 µs, sys: 0 ns, total: 4 µs
Wall time: 10 µs

This is a very beautiful Shlok ❤️, Let's just pray for this 🙏. Do search the quotes used in this notebook, I am sure, you will love them!

txt = "ॐ सर्वेत्र सुखिनः सन्तु| सर्वे सन्तु निरामयाः| सर्वे भद्राणि पश्यन्तु| माँ कश्चिद् दुःख माप्नुयात॥ ॐ शांतिः शांतिः शांतिः ॥"
enc = tokenizer.encode(txt)
print(tokenizer.decode(enc.ids))
ॐ सर्वेत्र सुखिनः सन्तु| सर्वे सन्तु निरामयाः| सर्वे भद्राणि पश्यन्तु| माँ कश्चिद् दुःख माप्नुयात॥ ॐ शांतिः शांतिः शांतिः ॥

The tokenizer seems to work, But since, The training script is configured to use Albert tokenizer. we need to use spiece.model and spiece.vocab, for training script

HuggingFace tokenizer creates ['/content/hft/vocab.json', '/content/hft/merges.txt']

files, while the AlbertTokenizer requires spiece.model file. So we'll use sentencepiece saved vocab and tokenizer model

!mkdir hft
tokenizer.save("/content/hft")
#we won't be using this
['/content/hft/vocab.json', '/content/hft/merges.txt']

Huggingface Training

from transformers import *
#Keep in mind, This is a tokenizer for Albert, unlike the previous one, which is a generic one.
#We'll load it in the form of Albert Tokenizer.
tokenizer = AlbertTokenizer.from_pretrained("/content/data_dir")
op = tokenizer.encode("नैनं छिन्दन्ति शस्त्राणि नैनं दहति पावकः। न चैनं क्लेदयन्त्यापो न शोषयति मारुतः॥")
tokenizer.decode(op)
'[CLS] नैनं छिन्दन्ति शस्त्राणि नैनं दहति पावकः। न चैनं क्लेदयन्त्यापो न शोषयति मारुतः॥[SEP]'

Looks like, the tokenizer is working

Model-Tokenizer Configuration

This is important. The training script needs a configuration for the model.

Architecture refers to what the model is going to be used for\ i.e., AlbertModelForLM, or for Sequence Classification. Just take a look ar left panel for Model Architectures

#Checking vocabulary size
vocab_size=tokenizer.vocab_size ; vocab_size
32000
import json

config = {
    "architectures": [
        "AlbertModel"
    ],
	"attention_probs_dropout_prob": 0.1,
	"hidden_act": "gelu",
	"hidden_dropout_prob": 0.1,
	"hidden_size": 768,
	"initializer_range": 0.02,
	"intermediate_size": 3072,
	"layer_norm_eps": 1e-05,
	"max_position_embeddings": 514,
	"model_type": "albert",
	"num_attention_heads": 12,
	"num_hidden_layers": 6,
	"type_vocab_size": 1,
	"vocab_size": vocab_size
}
with open("/content/data_dir/config.json", 'w') as fp:
    json.dump(config, fp)


#Configuration for tokenizer.
#Note: I set do_lower_case: False, and keep_accents:True

tokenizer_config = {
	"max_len": 512,
	"model_type": "albert",
	"do_lower_case":False, 
	"keep_accents":True
}
with open("/content/data_dir/tokenizer_config.json", 'w') as fp:
    json.dump(tokenizer_config, fp)

Note: While experimenting with tokenizer training, I found that encoding was done corectly, but when decoding with {do_lower_case: True, and keep_accents:False}, the decoded sentence was a bit changed.

So, by using above settings, I got the sentences decoded perfectly. a reason maybe that Sanskrit does not have 'Casing'. and the word has suffixes in the form of accents.

You should try with the settings ehich suits best for your langugae.

torch.cuda.empty_cache()
gc.collect()
157

Creating a small corpus for testing, You can skip this.

with open("/content/corpus/train/tmp.txt", "w") as fp:
    fp.write(open("/content/corpus/train/full.txt", "r").read(100000))      #250KB
with open("/content/corpus/valid/val_val.txt", "w") as fp:
    fp.write(open("/content/corpus/valid/full_val.txt", "r").read(10000000)) #

Checkpointing is very important. This is a directory where the intermediate model and tokenizer will be saved.

Note: You should checkpoint to somewhere else, Maybe to your drive. and set --save_total_limit 2

This is the training script. you should experiment with arguments.

!python /content/transformers/examples/run_language_modeling.py --help

%load_ext tensorboard
%tensorboard --logdir logs

You see the magic here.

This script can be used to train most models with for Language modelling.

Another thing, Observe that you have to directly specify --training_data_file in .txt format. No need to generate any pretraining data! all thanks to the Fast toknizers in used for loading the text.

Features are created dynamically while starting trainng script. However, This is limited to GPUs only. I would love to see a TPU version too.

Make sure to change batch_sizes according to the GPU you are having. I set to 16 because of 8 GB P4,

#To train from scratch
!python /content/transformers/examples/run_language_modeling.py \
        --model_type albert-base-v2 \
        --config_name /content/data_dir/ \
        --tokenizer_name /content/data_dir/ \
        --train_data_file /content/corpus/train/full.txt \
        --eval_data_file /content/corpus/valid/full_val.txt \
        --output_dir /content/data_dir \
        --do_train \
        --do_eval \
        --mlm \
        --line_by_line \
        --save_steps 500 \
        --logging_steps 500 \
        --save_total_limit 2 \
        --evaluate_during_training \
        --num_train_epochs 5 \
        --per_gpu_eval_batch_size 16 \
        --per_gpu_train_batch_size 16 \
        --block_size 256 \
        --seed 108 \
        --should_continue \
        --logging_dir logs \
torch.cuda.empty_cache()
gc.collect()

Continuing Training

--model_name_or_path      #Refers to the checkpoint directory
--overwrite_output_dir    #This is used to continue fro mlast checkpoint

After a checkpoint, You just need that directory and the corpus files, and toknizer. All configs, models, oprimizers are saved in --output_dir except tokenizer.

#To continue from checkpoint
#I have continued from 500 steps here, but you should use the latet saved models
!python /content/transformers/examples/run_language_modeling.py \
        --model_name_or_path /content/data_dir/checkpoint-500 \
        --model_type albert-base-v2 \
        --config_name /content/data_dir/ \
        --tokenizer_name /content/data_dir/ \
        --train_data_file /content/corpus/train/full.txt \
        --eval_data_file /content/corpus/valid/full_val.txt \
        --output_dir /content/data_dir \
        --do_train \
        --do_eval \
        --mlm \
        --line_by_line \
        --save_steps 500 \
        --logging_steps 500 \
        --save_total_limit 2 \
        --num_train_epochs 5 \
        --evaluate_during_training \
        --per_gpu_eval_batch_size 64 \
        --per_gpu_train_batch_size 64 \
        --block_size 256 \
        --seed 108 \
        --should_continue \
        --overwrite_output_dir \

Saving for Uploading

Since, training is complete, We can now upload models to Huffingface's Models

!mkdir sanskrit_albert
atokenizer = AlbertTokenizer.from_pretrained("/content/data_dir")
atokenizer.save_pretrained("/content/sanskrit_albert")
('/content/sanskrit_albert/spiece.model',
 '/content/sanskrit_albert/special_tokens_map.json',
 '/content/sanskrit_albert/added_tokens.json')
op = atokenizer.encode("ॐ असतो मा सद्गमय । तमसो मा ज्योतिर्गमय । मृत्योर्मा अमृतं गमय । ॐ शान्तिः शान्तिः शान्तिः ॥")
print(atokenizer.decode(op))
[CLS] ॐ असतो मा सद्गमय । तमसो मा ज्योतिर्गमय । मृत्योर्मा अमृतं गमय । ॐ शान्तिः शान्तिः शान्तिः ॥[SEP]
#I am using chackoint because os not much training
model = AlbertModel.from_pretrained("/content/data_dir/checkpoint-500")
model.save_pretrained("/content/sanskrit_albert")

Now All the files we want are in a separate folder, Which is all we need to upoad.

Tests

tokenizer = AlbertTokenizer.from_pretrained("/content/sanskrit_albert")
txt = "चरन्मार्गान्विजानाति ।"
op = tokenizer.encode(txt)
op
#See howw it's tokenized!
[3, 15, 4280, 1345, 82, 177, 13866, 6, 4]
tokenizer.decode(op[:5]), tokenizer.decode(op[5:])
('[CLS] चरन्मार्गान्', 'विजानाति ।[SEP]')

This is the reason I set do_lower_case:False, and keep_accents:True

ps = model(torch.tensor(op).unsqueeze(1))
print(ps[0].shape)
torch.Size([30, 1, 768])

This way you can get the embeddings for a sentence. Check ReSanskrit for some beautiful shlok quotes.

Uploading to Models

!transformers-cli login

Make sure your model name is the folder name in which this will be uploaded.

Thus, my model would be surajp/sanskrit_albert, but I won't upload this as I have already uploaded one.

!transformers-cli upload /content/sanskrit_albert

And It's done! Since, I have already uploaded a model, You can load using surajp/sanskrit-base-albert

#this way
tokenizer = AutoTokenizer.from_pretrained("surajp/albert-base-sanskrit")
model = AutoModel.from_pretrained("surajp/albert-base-sanskrit")
enc=tokenizer.encode("अपि स्वर्णमयी लङ्का न मे लक्ष्मण रोचते । जननी जन्मभूमिश्च स्वर्गादपि गरीयसी ॥")
print(tokenizer.decode(enc))
[CLS] अपि स्वर्णमयी लङ्का न मे लक्ष्मण रोचते । जननी जन्मभूमिश्च स्वर्गादपि गरीयसी ॥[SEP]
ps = model(torch.tensor(enc).unsqueeze(1))
 ps[0].shape
torch.Size([19, 1, 768])

I hope This notebook was helpful.🤗

#StaySafe

This training contained only a little portion of Sanskrit literature. There is a huge amount of literature there I am collecting. This was only a checkpoint for trainng, I will train more once I collect more data.

I am also trainig for other Indian Languages on different models (Gujarati, Hindi for now).

If you know any resources, Please write to me.

parmarsuraj99@gmail.com

I am trying to find if the structure of language can have any effect on trainng, More structured language=>faster training and if this can be useful for cross-lingual learning?

What are you thoughts about this?