Sanskrit Albert
Training a Language model from scratch on Sanskrit using the HuggingFace library, and how to train your own model too!
!nvidia-smi
import os
import gc
import glob
import torch
import pickle
import joblib
from tqdm.auto import tqdm
HuggingFace Recently updated their scripts, and the pip is yet to be released. So We'll build from source
!pip install tokenizers
#!pip install transformers
!git clone https://github.com/huggingface/transformers
!pip install transformers/.
I have used Sanskrit Corpus from Kaggle dataset. Feel free to skip and use your own ddataset. The trainng data needs to be in a .txt file. and I have also used Evaluation using the same dataset.
I need Kaggle API to download the dataset. You can load your text corpus from anywhere.
You can download a corpus for your language from https://traces1.inria.fr/oscar.
I have used data from there too and appended the data to a corpus from Kaggle.
!mkdir -p ~/.kaggle
!cp kaggle.json ~/.kaggle/ 
!chmod 600 ~/.kaggle/kaggle.json 
!mkdir corpus 
#directory for sac=ving all corpus in a single directory. You can save it anywhere
#From Kagle
!kaggle datasets download -d disisbig/sanskrit-wikipedia-articles
!unzip /content/sanskrit-wikipedia-articles.zip -d /content/corpus
#From OSCAR corpus
!wget https://traces1.inria.fr/oscar/files/compressed-orig/sa.txt.gz
!gunzip /content/sa.txt.gz
#Reading sample
with open("/content/sa.txt", "r") as fp:
    print(fp.read(1000))
import glob
train_list = glob.glob("/content/corpus/train/train/*.txt")
valid_list = glob.glob("/content/corpus/valid/valid/*.txt")
#readig and appending all small files to single Train and Valid files
with open("/content/corpus/train/full.txt", "wb") as outfile:
    for f in train_list:
        with open(f, "rb") as infile:
            outfile.write(infile.read())
            outfile.write(b"\n\n")
    with open("/content/sa.txt", "rb") as infile:
            outfile.write(infile.read())
with open("/content/corpus/valid/full_val.txt", "wb") as outfile:
    for f in valid_list:
        with open(f, "rb") as infile:
            outfile.write(infile.read())
            outfile.write(b"\n\n")
Directory to save trained tokenier and configuration files in a folder
!mkdir data_dir
import sentencepiece as spm
from tokenizers import SentencePieceBPETokenizer, BertWordPieceTokenizer
%%time
#Albert Tokenizer uses Sentence piece Tokenization, so I have used sentencepiece to to train tokenizer.
#This will take a while
spm.SentencePieceTrainer.Train('--input=/content/corpus/train/full.txt \
                                --model_prefix=m \
                                --vocab_size=32000 \
                                --control_symbols=[CLS],[SEP],[MASK]')
with open("m.vocab") as v:
    print(v.read(2000))
    v.close()
!mkdir /content/data_dir/
!cp /content/m.model -d /content/data_dir/spiece.model
!cp /content/m.vocab -d /content/data_dir/spiece.vocab
Make sure to check out the Fast Tokenizers from Huggingface, This is really Fast! You can compare it with sentencepiece.
%time
tokenizer = SentencePieceBPETokenizer()
tokenizer.train("/content/corpus/train/full.txt")
This is a very beautiful Shlok ❤️, Let's just pray for this 🙏. Do search the quotes used in this notebook, I am sure, you will love them!
txt = "ॐ सर्वेत्र सुखिनः सन्तु| सर्वे सन्तु निरामयाः| सर्वे भद्राणि पश्यन्तु| माँ कश्चिद् दुःख माप्नुयात॥ ॐ शांतिः शांतिः शांतिः ॥"
enc = tokenizer.encode(txt)
print(tokenizer.decode(enc.ids))
The tokenizer seems to work, But since, The training script is configured to use Albert tokenizer. we need to use spiece.model and spiece.vocab, for training script
HuggingFace tokenizer creates ['/content/hft/vocab.json', '/content/hft/merges.txt']
files, while the AlbertTokenizer requires spiece.model file. So we'll use sentencepiece saved vocab and tokenizer model
!mkdir hft
tokenizer.save("/content/hft")
#we won't be using this
from transformers import *
#Keep in mind, This is a tokenizer for Albert, unlike the previous one, which is a generic one.
#We'll load it in the form of Albert Tokenizer.
tokenizer = AlbertTokenizer.from_pretrained("/content/data_dir")
op = tokenizer.encode("नैनं छिन्दन्ति शस्त्राणि नैनं दहति पावकः। न चैनं क्लेदयन्त्यापो न शोषयति मारुतः॥")
tokenizer.decode(op)
Looks like, the tokenizer is working
This is important. The training script needs a configuration for the model.
Architecture refers to what the model is going to be used for\ i.e., AlbertModelForLM, or for Sequence Classification. Just take a look ar left panel for Model Architectures
#Checking vocabulary size
vocab_size=tokenizer.vocab_size ; vocab_size
import json
config = {
    "architectures": [
        "AlbertModel"
    ],
	"attention_probs_dropout_prob": 0.1,
	"hidden_act": "gelu",
	"hidden_dropout_prob": 0.1,
	"hidden_size": 768,
	"initializer_range": 0.02,
	"intermediate_size": 3072,
	"layer_norm_eps": 1e-05,
	"max_position_embeddings": 514,
	"model_type": "albert",
	"num_attention_heads": 12,
	"num_hidden_layers": 6,
	"type_vocab_size": 1,
	"vocab_size": vocab_size
}
with open("/content/data_dir/config.json", 'w') as fp:
    json.dump(config, fp)
#Configuration for tokenizer.
#Note: I set do_lower_case: False, and keep_accents:True
tokenizer_config = {
	"max_len": 512,
	"model_type": "albert",
	"do_lower_case":False, 
	"keep_accents":True
}
with open("/content/data_dir/tokenizer_config.json", 'w') as fp:
    json.dump(tokenizer_config, fp)
Note: While experimenting with tokenizer training, I found that encoding was done corectly, but when decoding with {do_lower_case: True, and keep_accents:False}, the decoded sentence was a bit changed.
So, by using above settings, I got the sentences decoded perfectly. a reason maybe that Sanskrit does not have 'Casing'. and the word has suffixes in the form of accents.
You should try with the settings ehich suits best for your langugae.
torch.cuda.empty_cache()
gc.collect()
Creating a small corpus for testing, You can skip this.
with open("/content/corpus/train/tmp.txt", "w") as fp:
    fp.write(open("/content/corpus/train/full.txt", "r").read(100000))      #250KB
with open("/content/corpus/valid/val_val.txt", "w") as fp:
    fp.write(open("/content/corpus/valid/full_val.txt", "r").read(10000000)) #
Checkpointing is very important. This is a directory where the intermediate model and tokenizer will be saved.
Note: You should checkpoint to somewhere else, Maybe to your drive. and set 
--save_total_limit 2
This is the training script. you should experiment with arguments.
!python /content/transformers/examples/run_language_modeling.py --help
%load_ext tensorboard
%tensorboard --logdir logs
You see the magic here.
This script can be used to train most models with for Language modelling.
Another thing, Observe that you have to directly specify --training_data_file in .txt format. No need to generate any pretraining data! all thanks to the Fast toknizers in used for loading the text.
Features are created dynamically while starting trainng script. However, This is limited to GPUs only. I would love to see a TPU version too.
Make sure to change batch_sizes according to the GPU you are having. I set to 16 because of 8 GB P4,
#To train from scratch
!python /content/transformers/examples/run_language_modeling.py \
        --model_type albert-base-v2 \
        --config_name /content/data_dir/ \
        --tokenizer_name /content/data_dir/ \
        --train_data_file /content/corpus/train/full.txt \
        --eval_data_file /content/corpus/valid/full_val.txt \
        --output_dir /content/data_dir \
        --do_train \
        --do_eval \
        --mlm \
        --line_by_line \
        --save_steps 500 \
        --logging_steps 500 \
        --save_total_limit 2 \
        --evaluate_during_training \
        --num_train_epochs 5 \
        --per_gpu_eval_batch_size 16 \
        --per_gpu_train_batch_size 16 \
        --block_size 256 \
        --seed 108 \
        --should_continue \
        --logging_dir logs \
torch.cuda.empty_cache()
gc.collect()
Continuing Training
--model_name_or_path      #Refers to the checkpoint directory
--overwrite_output_dir    #This is used to continue fro mlast checkpointAfter a checkpoint, You just need that directory and the corpus files, and toknizer. All configs, models, oprimizers are saved in --output_dir except tokenizer.
#To continue from checkpoint
#I have continued from 500 steps here, but you should use the latet saved models
!python /content/transformers/examples/run_language_modeling.py \
        --model_name_or_path /content/data_dir/checkpoint-500 \
        --model_type albert-base-v2 \
        --config_name /content/data_dir/ \
        --tokenizer_name /content/data_dir/ \
        --train_data_file /content/corpus/train/full.txt \
        --eval_data_file /content/corpus/valid/full_val.txt \
        --output_dir /content/data_dir \
        --do_train \
        --do_eval \
        --mlm \
        --line_by_line \
        --save_steps 500 \
        --logging_steps 500 \
        --save_total_limit 2 \
        --num_train_epochs 5 \
        --evaluate_during_training \
        --per_gpu_eval_batch_size 64 \
        --per_gpu_train_batch_size 64 \
        --block_size 256 \
        --seed 108 \
        --should_continue \
        --overwrite_output_dir \
Since, training is complete, We can now upload models to Huffingface's Models
!mkdir sanskrit_albert
atokenizer = AlbertTokenizer.from_pretrained("/content/data_dir")
atokenizer.save_pretrained("/content/sanskrit_albert")
op = atokenizer.encode("ॐ असतो मा सद्गमय । तमसो मा ज्योतिर्गमय । मृत्योर्मा अमृतं गमय । ॐ शान्तिः शान्तिः शान्तिः ॥")
print(atokenizer.decode(op))
#I am using chackoint because os not much training
model = AlbertModel.from_pretrained("/content/data_dir/checkpoint-500")
model.save_pretrained("/content/sanskrit_albert")
Now All the files we want are in a separate folder, Which is all we need to upoad.
tokenizer = AlbertTokenizer.from_pretrained("/content/sanskrit_albert")
txt = "चरन्मार्गान्विजानाति ।"
op = tokenizer.encode(txt)
op
#See howw it's tokenized!
tokenizer.decode(op[:5]), tokenizer.decode(op[5:])
This is the reason I set do_lower_case:False, and keep_accents:True
ps = model(torch.tensor(op).unsqueeze(1))
print(ps[0].shape)
This way you can get the embeddings for a sentence. Check ReSanskrit for some beautiful shlok quotes.
!transformers-cli login
Make sure your model name is the folder name in which this will be uploaded.
Thus, my model would be surajp/sanskrit_albert,
but I won't upload this as I have already uploaded one.
!transformers-cli upload /content/sanskrit_albert
And It's done! Since, I have already uploaded a model, You can load using surajp/sanskrit-base-albert
#this way
tokenizer = AutoTokenizer.from_pretrained("surajp/albert-base-sanskrit")
model = AutoModel.from_pretrained("surajp/albert-base-sanskrit")
enc=tokenizer.encode("अपि स्वर्णमयी लङ्का न मे लक्ष्मण रोचते । जननी जन्मभूमिश्च स्वर्गादपि गरीयसी ॥")
print(tokenizer.decode(enc))
ps = model(torch.tensor(enc).unsqueeze(1))
 ps[0].shape
I hope This notebook was helpful.🤗
#StaySafeThis training contained only a little portion of Sanskrit literature. There is a huge amount of literature there I am collecting. This was only a checkpoint for trainng, I will train more once I collect more data.
I am also trainig for other Indian Languages on different models (Gujarati, Hindi for now).
If you know any resources, Please write to me.
parmarsuraj99@gmail.com
I am trying to find if the structure of language can have any effect on trainng, More structured language=>faster training and if this can be useful for cross-lingual learning?
What are you thoughts about this?