Nepali Tokenizers: A Python Package for Nepali NLP
nepalitokenizers
provides access to pre-trained WordPiece and SentencePiece (Unigram) tokenizers for Nepali language, trained using HuggingFace's tokenizers
library. It is a simple and short Python package tailored specifically for Nepali language with a default set of configurations for the normalizer, pre-tokenizer, post-processor, and decoder.
It delegates further customization by providing an interface to HuggingFace's Tokenizer
pipeline, allowing users to adapt the tokenizers according to their requirements.
GitHub Link: https://github.com/basnetsoyuj/nepali-tokenizers
Installation
The package is available on PyPI. You can install nepalitokenizers
using pip:
pip install nepalitokenizers
Usage
After installing the package, you can use the tokenizers in your Python code:
WordPiece Tokenizer
from nepalitokenizers import WordPiece
text = "हाम्रा सबै क्रियाकलापहरु भोलिवादी छन् । मेरो पानीजहाज वाम माछाले भरिपूर्ण छ । इन्जिनियरहरुले गएको हप्ता राजधानीमा त्यस्तै बहस गरे ।"
tokenizer_wp = WordPiece()
tokens = tokenizer_wp.encode(text)
print(tokens.ids)
print(tokens.tokens)
print(tokenizer_wp.decode(tokens.ids))
Output
[1, 11366, 8625, 14157, 8423, 13344, 9143, 8425, 1496, 9505, 22406, 11693, 12679, 8340, 27445, 1430, 1496, 13890, 9008, 9605, 13591, 14547, 9957, 12507, 8700, 1496, 2]
['[CLS]', 'हाम्रा', 'सबै', 'क्रियाकलाप', '##हरु', 'भोलि', '##वादी', 'छन्', '।', 'मेरो', 'पानीजहाज', 'वाम', 'माछा', '##ले', 'भरिपूर्ण', 'छ', '।', 'इन्जिनियर', '##हरुले', 'गएको', 'हप्ता', 'राजधानीमा', 'त्यस्तै', 'बहस', 'गरे', '।', '[SEP]']
हाम्रा सबै क्रियाकलापहरु भोलिवादी छन् । मेरो पानीजहाज वाम माछाले भरिपूर्ण छ । इन्जिनियरहरुले गएको हप्ता राजधानीमा त्यस्तै बहस गरे ।
SentencePiece (Unigram) Tokenizer
from nepalitokenizers import SentencePiece
text = "कोभिड महामारीको पिडाबाट मुक्त नहुँदै मानव समाजलाई यतिबेला युद्धको विध्वंसकारी क्षतिको चिन्ताले चिन्तित बनाएको छ ।"
tokenizer_sp = SentencePiece()
tokens = tokenizer_sp.encode(text)
print(tokens.ids)
print(tokens.tokens)
print(tokenizer_wp.decode(tokens.ids))
Output
[7, 9, 3241, 483, 12081, 9, 11079, 23, 2567, 11254, 1002, 789, 20, 3334, 2161, 9, 23517, 2711, 1115, 9, 1718, 12, 5941, 781, 19, 8, 1, 0]
['▁', 'को', 'भि', 'ड', '▁महामारी', 'को', '▁पिडा', 'बाट', '▁मुक्त', '▁नहुँदै', '▁मानव', '▁समाज', 'लाई', '▁यतिबेला', '▁युद्ध', 'को', '▁विध्वंस', 'कारी', '▁क्षति', 'को', '▁चिन्ता', 'ले', '▁चिन्तित', '▁बनाएको', '▁छ', '▁।', '<sep>', '<cls>']
कोभिड महामारीको पिडाबाट मुक्त नहुँदै मानव समाजलाई यतिबेला युद्धको विध्वंसकारी क्षतिको चिन्ताले चिन्तित बनाएको छ ।
Configuration & Customization
Each tokenizer class has a default and standard set of configurations for the normalizer, pre-tokenizer, post-processor, and decoder. For more information, look at the training files available in the train/
directory.
The package delegates further customization by providing an interface to directly access to HuggingFace's tokenizer pipeline. Therefore, you can treat nepalitokenizers
's tokenizer instances as HuggingFace's Tokenizer
objects. For example:
from nepalitokenizers import WordPiece
# importing from the HuggingFace tokenizers package
from tokenizers.processors import TemplateProcessing
text = "हाम्रो मातृभूमि नेपाल हो"
tokenizer_sp = WordPiece()
# using default post processor
tokens = tokenizer_sp.encode(text)
print(tokens.tokens)
# change the post processor to not add any special tokens
# treat tokenizer_sp as HuggingFace's Tokenizer object
tokenizer_sp.post_processor = TemplateProcessing()
tokens = tokenizer_sp.encode(text)
print(tokens.tokens)
Output
['[CLS]', 'हाम्रो', 'मातृ', '##भूमि', 'नेपाल', 'हो', '[SEP]']
['हाम्रो', 'मातृ', '##भूमि', 'नेपाल', 'हो']
To learn more about further customizations that can be performed, visit HuggingFace's Tokenizer Documentation.
Note: The delegation to HuggingFace's Tokenizer pipeline was done with the following generic wrapper class because
tokenizers.Tokenizer
is not an acceptable base type for inheritance. It is a useful trick I use for solving similar issues:class Delegate: """ A generic wrapper class that delegates attributes and method calls to the specified self.delegate instance. """ @property def _items(self): return dir(self.delegate) def __getattr__(self, name): if name in self._items: return getattr(self.delegate, name) raise AttributeError( f"'{self.__class__.__name__}' object has no attribute '{name}'") def __setattr__(self, name, value): if name == "delegate" or name not in self._items: super().__setattr__(name, value) else: setattr(self.delegate, name, value) def __dir__(self): return dir(type(self)) + list(self.__dict__.keys()) + self._items
Training
The python files used to train the tokenizers are available in the train/
directory. You can also use these files to train your own tokenizers on a custom text corpus.
These tokenizers were trained on two datasets:
1. The Nepali Subset of the OSCAR dataset
You can download it using the following code:
import datasets
from tqdm.auto import tqdm
import os
dataset = datasets.load_dataset(
'oscar', 'unshuffled_deduplicated_ne',
split='train'
)
os.mkdir('data')
batch = []
counter = 0
for sample in tqdm(dataset):
sample = sample['text'].replace('\n', ' ')
batch.append(sample)
if len(batch) == 10_000:
with open(f'data/ne_{counter}.txt', 'w', encoding='utf-8') as f:
f.write('\n'.join(batch))
batch = []
counter += 1
2. A Large Scale Nepali Text Corpus by Rabindra Lamsal (2020)
To download the dataset, follow the instructions provided in this link: A Large Scale Nepali Text Corpus.
License
This package is licensed under the Apache 2.0 License, which is consistent with the license used by HuggingFace's tokenizers
library. Please see the LICENSE
file for more details.