Welcome to Urduhack Documentation!¶
What is Urduhack?¶
Urduhack is an open-source NLP library for urdu language. It comes with a lot of battery included features to help you process Urdu data in the easiest way possible.
Our Goal¶
- Academic users Easier experimentation to prove their hypothesis without coding from scratch.
- NLP beginners Learn how to build an NLP project with production level code quality.
- NLP developers Build a production level application within minutes.
Urduhack is maintained by Ikram Ali and Contributors.
Installation¶
Note
Urduhack is supported on the following Python versions
Python | 3.8 | 3.7 | 3.6 | 2.7 |
Urduhack | Yes |
Install Urduhack via pip¶
Note
Urduhack developed using Tensorflow. Its need Tensorflow cpu for prediction and for development and training the models its uses Tensorflow-gpu. following instructions will install Tensorflow
The easiest way to install urduhack is by pip install.
- Installing with Tensorflow cpu version.
$ pip install urduhack[tf]
- Installing with Tensorflow gpu version.
$ pip install urduhack[tf-gpu]
Package Dependencies¶
Having so many functionality, urduhack depends on a number of other packages. Try to avoid any kind of conflict. It is preferred that you create a virtual environment and install urduhack in that environment.
- Tensorflow ~= 2.4 Use for training, evaluating and testing deep neural network model.
- Transformers Use for bert implementation for training and evaluation.
- Tensorflow-datasets Use for download and prepare the dataset,read it into a model using the tf.data.Dataset API.
- Click With help of this library Urduhack commandline application developed.
Downloading Models¶
Pythonic Way¶
- You can download model using Urduhack code.
import urduhack urduhack.download()
Command line¶
- To download the models all you have to do is run this simple command in the command line.
$ urduhack download
This command will download the models which will be used by urduhack.
Quickstart¶
Every python package needs an import statement so let’s do that first.:
>>>import urduhack
Overview¶
Urdu Characters¶
The Urdu alphabet is the right-to-left alphabet used for the Urdu language. It is a modification of the Persian alphabet known as Perso-Arabic, which is itself a derivative of the Arabic alphabet. The Urdu alphabet has up to 58 letters with 39 basic letters and no distinct letter cases, the Urdu alphabet is typically written in the calligraphic Nastaʿlīq script.
46 Alphabets, 10 Digits, 6 Punctuations, 6 Diacritics.
Normalization¶
The normalization of Urdu text is necessary to make it useful for the machine
learning tasks. In the normalization
module, the very basic
problems faced when working with Urdu data are handled with ease and
efficiency. All the problems and how normalization
module handles
them are listed below.
This modules fixes the problem of correct encodings for the Urdu characters as well as replace Arabic characters with correct Urdu characters. This module brings all the characters in the specified unicode range (0600-06FF) for Urdu language.
It also fixes the problem of joining of different Urdu words. By joining we mean that when space between two Urdu words is removed, they must not make a new word. Their rendering must not change and even after the removal of space they should look the same.
>>> from urduhack import normalize >>> text = normalize(text)
Urdu Stopwords¶
Stop words are natural language words which have very little meaning, such as “and”, “the”, “a”, “an”, and similar words. These words are highly redundant in texts and do not contribute much so it is sometimes a viable approach to remove the stop words in pre-processing of the data.
>>> from urduhack.stop_words import STOP_WORDS, remove_stopwords >>> print(STOP_WORDS) >>> text = remove_stopwords(text)
Tokenization¶
This module is another crucial part of the Urduhack. This module performs tokenization on sentence. It separates different sentence from each other and converts each string into a complete sentence token. Note here you must not confuse yourself with the word token. They are two completely different things.
This library provides state of art word tokenizer for Urdu Language. It takes care of the spaces and where to connect two urdu characters and where not to.
The tokenization of Urdu text is necessary to make it useful for the NLP tasks. This module provides the following functionality:
- Sentence Tokenization
- Word Tokenization
The tokenization of Urdu text is necessary to make it useful for the machine
learning tasks. In the tokenization
module, we solved the problem related to
sentence and word tokenization.
Tutorial¶
CoNLL-U Format¶
We aspire to maintain data for all the tasks in CoNNL-U format. CoNLL-U format holds sentence and token level data along with their
attributes. Below we will show how to use urduhack’s CoNLL
module.
>>> from urduhack import CoNLL
To iterate over sentences in CoNLL-U format we will use iter_string()
function.
>>> from urduhack.conll.tests.test_parser import CONLL_SENTENCE
It will yield a sentence in proper CoNLL-U format from which we can extract sentence level and token level attributes.
>>> for sentence in CoNLL.iter_string(CONLL_SENTENCE):
sent_meta, tokens = sentence
print(f"Sentence ID: {sent_meta['sent_id']}")
print(f"Sentence Text: {sent_meta['text']}")
for token in tokens:
print(token)
{'id': '1', 'text': 'والدین', 'lemma': 'والدین', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Case=Acc|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'nsubj', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=NP|ChunkType=head'}
{'id': '2', 'text': 'معمولی', 'lemma': 'معمولی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom', 'head': '3', 'deprel': 'advmod', 'deps': '_', 'misc': 'ChunkId=JJP|ChunkType=head'}
{'id': '3', 'text': 'زخمی', 'lemma': 'زخمی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'compound', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=JJP2|ChunkType=head'}
{'id': '4', 'text': 'ہوئے', 'lemma': 'ہو', 'upos': 'VERB', 'xpos': 'VM', 'feats': 'Aspect=Perf|Number=Plur|Person=2|Polite=Form|VerbForm=Part|Voice=Act', 'head': '0', 'deprel': 'root', 'deps': '_', 'misc': 'Vib=یا|Tam=yA|ChunkId=VGF|ChunkType=head|Stype=declarative'}
{'id': '5', 'text': 'ہےں', 'lemma': 'ہے', 'upos': 'AUX', 'xpos': 'VAUX', 'feats': 'Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin', 'head': '4', 'deprel': 'aux', 'deps': '_', 'misc': 'SpaceAfter=No|Vib=ہے|Tam=hE|ChunkId=VGF|ChunkType=child'}
{'id': '6', 'text': '۔', 'lemma': '۔', 'upos': 'PUNCT', 'xpos': 'SYM', 'feats': '_', 'head': '4', 'deprel': 'punct', 'deps': '_', 'misc': 'ChunkId=VGF|ChunkType=child'}
To load a file in ConLL-U format, we will use urduhack.CoNLL.load_file()
function.
>>> sentences = ConLL.load_file("urdu_text.conll")
>>> for sentence in sentences:
sent_meta, tokens = sentence
print(f"Sentence ID: {sent_meta['sent_id']}")
print(f"Sentence Text: {sent_meta['text']}")
for token in tokens:
print(token)
{'id': '1', 'text': 'والدین', 'lemma': 'والدین', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Case=Acc|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'nsubj', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=NP|ChunkType=head'}
{'id': '2', 'text': 'معمولی', 'lemma': 'معمولی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom', 'head': '3', 'deprel': 'advmod', 'deps': '_', 'misc': 'ChunkId=JJP|ChunkType=head'}
{'id': '3', 'text': 'زخمی', 'lemma': 'زخمی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'compound', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=JJP2|ChunkType=head'}
{'id': '4', 'text': 'ہوئے', 'lemma': 'ہو', 'upos': 'VERB', 'xpos': 'VM', 'feats': 'Aspect=Perf|Number=Plur|Person=2|Polite=Form|VerbForm=Part|Voice=Act', 'head': '0', 'deprel': 'root', 'deps': '_', 'misc': 'Vib=یا|Tam=yA|ChunkId=VGF|ChunkType=head|Stype=declarative'}
{'id': '5', 'text': 'ہےں', 'lemma': 'ہے', 'upos': 'AUX', 'xpos': 'VAUX', 'feats': 'Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin', 'head': '4', 'deprel': 'aux', 'deps': '_', 'misc': 'SpaceAfter=No|Vib=ہے|Tam=hE|ChunkId=VGF|ChunkType=child'}
{'id': '6', 'text': '۔', 'lemma': '۔', 'upos': 'PUNCT', 'xpos': 'SYM', 'feats': '_', 'head': '4', 'deprel': 'punct', 'deps': '_', 'misc': 'ChunkId=VGF|ChunkType=child'}
Pipeline Module¶
Pipeline is a special module in urduhack. It’s importance can be realized by the fact that it performs operation at Document, Sentence and Token level. We can convert a document to sentence and a sentence into tokens in one go using the pipeline module. After that we can run models or any other operation at the document, sentence and token levels. Now we will go into these steps one by on.
Document¶
We can get the document using pipeline module.
>>> from urduhack import Pipeline
>>> nlp = Pipeline()
>>> text = """
گزشتہ ایک روز کے دوران کورونا کے سبب 118 اموات ہوئیں جس کے بعد اموات کا مجموعہ 3 ہزار 93 ہوگیا ہے۔
سب سے زیادہ اموات بھی پنجاب میں ہوئی ہیں جہاں ایک ہزار 202 افراد جان کی بازی ہار چکے ہیں۔
سندھ میں 916، خیبر پختونخوا میں 755، اسلام آباد میں 94، گلگت بلتستان میں 18، بلوچستان میں 93 اور ا?زاد کشمیر میں 15 افراد کورونا وائرس سے جاں بحق ہو چکے ہیں۔
"""
>>> doc = nlp(text)
>>> print(doc.text)
Sentence¶
Now to get the sentences from the Document.
>>> for sentence in doc.sentences:
print(sentence.text)
Reference¶
CoNLL-U Format¶
This module reads and parse data in the standard CONLL-U format as provided in universal dependencies. CONLL-U is a standard format followed to annotate data at sentence level and at word/token level. Annotations in CONLL-U format fulfil the below points:
- Word lines contain the annotations of a word/token in 10 fields are separated by single tab characters
- Blank lines mark sentence boundaries
- Comment lines start with hash (#)
Each word/token has 10 fields defined in the CONLL-U format. Each field represents different attributes of the token whose details are given below:
Fields¶
1. ID:
- ID represents the word/token index in the sentence, indexing starts from 1 in UD sentences.
2. FORM:
- Word/token form or punctuation symbol used in the sentence, like how a word is being used in the sentence for example organize, organizer, organization all of these are inflectional forms of word organize.
3. LEMMA:
- Root/stem of the word, lemma is used to get the vocabulary form of word which helps in understanding roots of different words. Lemmatization refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . For example:
4. UPOS:
- UPOS mark the core part-of-speech categories. They help in analysing the word usage in the sentence. Following are the UPOS tags available in UD. (ADJ, ADV, INTJ, NOUN, PROPN, VERB, ADP, AUX, CCONJ, DET, NUM, PART, PRON, SCONJ, PUNCT, SYM, X)
5. XPOS:
- Language specific part-of-speed tag. For some languages grammar rules are different and that’s is why language specific POS tags are used.
6. FEATS:
- Features are additional pieces of information about the word, its part of speech and morphosyntactic properties. Every feature has the form Name=Value and every word can have any number of features, separated by the vertical bar, as in Gender=Masc|Number=Sing. Users can extend this set of universal features and add language-specific features when necessary.
7. HEAD:
- Head of the current word, which is either a value of ID or zero (0). All words of sentence are dependent to other words of the sentence. Head shows the DEPREL of current word of the sentence to the head word.
8. DEPREL:
- Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one
9. DEPS:
- A dependency can be labeled as dep when it is impossible to determine a more precise relation. This may be because of a weird grammatical construction, or a limitation in conversion or parsing software. The use of dep should be avoided as much as possible. Enhanced dependency graph in the form of a list of head-deprel pairs.
10. MISC:
- Any other annotation apart from the above mentioned fields Commentary or other annotation
UPOS TAGS DETAILS
ADJ
Adjectives are words that typically modify nouns and specify their properties or attributes:اسمبلی انتخابات کو 'صاف و شفاف' بنانے
انتخابات مےں 'غیرسماجی' عناصر کی جانب سے بدامنی پھیلائے جانے کا خدشہ ہے
Examples
‘واضح’ ,’یقینی’ ,’ہر’ADV
Adverbs are words that typically modify verbs for such categories as time, place, direction or manner. They may also modify adjectives and other adverbs, as in ‘very briefly’ or ‘arguably’ wrong.سوہال نے 24 گیندوں مےں جاریہ سیزن کی 'تیزترین' نصف سنچری بنائی
Examples
‘ہرگز’ ,’قبل_ازیں’ ,’بعد’INTJ
An interjection is a word that is used most often as an exclamation or part of an exclamation. It typically expresses an emotional reaction, is not syntactically related to other accompanying expressions, and may include a combination of sounds not otherwise found in the language.چلو اب 'ذرا' دنیا کی سیر کر لیں
Examples
‘آہ’ ,’ذرا’ ,’بس’NOUN
Nouns are a part of speech typically denoting a person, place, thing, animal or idea.بس اس 'تنہائی' کے 'عالم' مےں اےک 'یاد' 'آواز' بن کر 'سماعت' سے ٹکرائی۔
Examples
‘جسٹس’ ,’پاکستان’ ,’حصہ’ ,’خالہ’ ,’لوگ’PROPN
A proper noun is a noun (or nominal content word) that is the name (or part of the name) of a specific individual, place, or object.تاہم 'بھارت' 'کرشنا' کو اپنی منگیتر کے چال و چلن پر شبہ ہوا
Examples
‘ڈی’ ,’سی’ ,’پی’ ,’ساؤتھ’ ,’احمد’ ,’نثار’VERB
A verb is a member of the syntactic class of words that typically signal events and actions, can constitute a minimal predicate in a clause, and govern the number and types of other constituents which may occur in the clause. Verbs are often associated with grammatical categories like tense, mood, aspect and voice, which can either be expressed inflectionally or using auxilliary verbs or particles.آزادانہ و منصفانہ انتخابات کو یقینی 'بنایا' جا سکے
Examples
‘کہا’ ,’رہے’ADP
Adposition is a cover term for prepositions and postpositions. Adpositions belong to a closed set of items that occur before (preposition) or after (postposition) a complement composed of a noun phrase, noun, pronoun, or clause that functions as a noun phrase, and that form a single structure with the complement to express its grammatical and semantic relation to another unit within a clause.اےک شخص 'نے' مبینہ طور 'پر' اپنی منگیتر اور اس 'کے' والدین 'پر' چاقو 'سے' حملہ کر کے زخمی کر دیا۔
Examples
‘پر’ ,’نے’ ,’مےں’AUX
An auxiliary is a function word that accompanies the lexical verb of a verb phrase and expresses grammatical distinctions not carried by the lexical verb, such as person, number, tense, mood, aspect, voice or evidentiality. It is often a verb (which may have non-auxiliary uses as well) but many languages have nonverbal TAME markers and these should also be tagged AUX. The class AUX also include copulas (in the narrow sense of pure linking words for nonverbal predication).انتخابات کو یقینی بنایا 'جا' سکے۔
Examples
‘جا’ ,’ہے’ ,’رہے’CCONJ
A coordinating conjunction is a word that links words or larger constituents without syntactically subordinating one to the other and expresses a semantic relationship between them.انتخابات کی راست نگرانی 'اور' غنڈہ عناصر پر کنٹرول کے لئے سخت_ترین انتظامات کئے جائیں۔
Examples
‘لیکن’ ,’بدعنوانیوں و بےقاعدگیوں in و’
DET
Determiners are words that modify nouns or noun phrases and express the reference of the noun phrase in context. That is, a determiner may indicate whether the noun is referring to a definite or indefinite element of a class, to a closer or more distant element, to an element belonging to a specified person or thing, to a particular- number or quantity, etc
ریاستی حج کمیٹی 'اس' طرح کی کوئی تجویز رکھتی ہے
Examples
‘تمام’ ,’ہر’ ,’جو’NUM
A numeral is a word, functioning most typically as a determiner, adjective or pronoun, that expresses a number and a relation to the number, such as quantity, sequence, frequency or fraction.'اےک' شخص نے مبینہ طور پر اپنی منگیتر اور اس کے والدین پر چاقو سے حملہ کر کے زخمی کر دیا۔
Examples
‘۲’ ,’۱’ ,’۰’ ,’چار’ ,’اےک’PART
Particles are function words that must be associated with another word or phrase to impart meaning and that do not satisfy definitions of other universal parts of speech (e.g. adpositions, coordinating conjunctions, subordinating conjunctions or auxiliary verbs). Particles may encode grammatical categories such as negation, mood, tense etc. Particles are normally not inflected, although exceptions may occur.اس اجلاس مےں اطفال کے حق تعلیم کے قانون کا جائزہ 'بھی' لیا جائے_گا۔
Examples
‘ہی’ ,’مسٹر’ ,’نہیں’PRON
Pronouns are words that substitute for nouns or noun phrases, whose meaning is recoverable from the linguistic or extralinguistic context.احمد کے بموجب اگر دونوں ہی ٹیمیں 'اپنے' شیڈول مےں معمولی تبدیلی کرتے ہےں
Examples
‘اپنی’ ,’ازیں’ ,’یہاں’SCONJ
A subordinating conjunction is a conjunction that links constructions by making one of them a constituent of the other. The subordinating conjunction typically marks the incorporated constituent which has the status of a (subordinate)- clause.
وہ ابھی پڑھ ہی رہے تھے 'کہ' بیٹے نے دروازہ کھٹکھٹایا۔
Examples
‘اگر’ ,’تو’PUNCT
Punctuation marks are non-alphabetical characters and character groups used in many languages to delimit linguistic units in printed text.ایسے دہشتگردوں کو اسلام سے خارج کر دیا جانا چاہیے'۔'
Examples
‘!’ ,’.’ ,’۔’SYM
A symbol is a word-like entity that differs from ordinary words by form, function, or both.ایسے'$' دہشتگردوں کو اسلام سے خارج کر دیا جانا چاہیے
Examples
‘@’, ‘%’X
The tag X is used for words that for some reason cannot be assigned a real part-of-speech category. It should be used very restrictively.
-
class
urduhack.conll.
CoNLL
[source]¶ A Conll class to easily load conll-u formats. This module can also load resources by iterating over string. This module is the main entrance to conll’s functionalities.
-
static
get_fields
() → List[str][source]¶ Get the list of conll fields
Returns: Return list of conll fields Return type: List[str]
-
static
iter_file
(file_name: str) → Iterator[Tuple][source]¶ Iterate over a CoNLL-U file’s sentences.
Parameters: file_name (str) – The name of the file whose sentences should be iterated over.
Yields: Iterator[Tuple] – The sentences that make up the CoNLL-U file.
Raises: IOError
– If there is an error opening the file.ParseError
– If there is an error parsing the input into a Conll object.
-
static
iter_string
(text: str) → Iterator[Tuple][source]¶ Iterate over a CoNLL-U string’s sentences.
Use this method if you only need to iterate over the CoNLL-U file once and do not need to create or store the Conll object.
Parameters: text (str) – The CoNLL-U string. Yields: Iterator[Tuple] – The sentences that make up the CoNLL-U file. Raises: ParseError
– If there is an error parsing the input into a Conll object.
-
static
load_file
(file_name: str) → List[Tuple][source]¶ Load a CoNLL-U file given its location.
Parameters: file_name (str) – The location of the file.
Returns: A Conll object equivalent to the provided file.
Return type: List[Tuple]
Raises: IOError
– If there is an error opening the given filename.ValueError
– If there is an error parsing the input into a Conll object.
-
static
Normalization¶
The normalization of Urdu text is necessary to make it useful for the machine
learning tasks. In the normalize
module, the very basic
problems faced when working with Urdu data are handled with ease and
efficiency. All the problems and how normalize
module handles
them are listed below.
This modules fixes the problem of correct encodings for the Urdu characters as well as replace Arabic characters with correct Urdu characters. This module brings all the characters in the specified unicode range (0600-06FF) for Urdu language.
It also fixes the problem of joining of different Urdu words. By joining we mean that when space between two Urdu words is removed, they must not make a new word. Their rendering must not change and even after the removal of space they should look the same.
You can use the library to normalize the Urdu text for correct unicode characters. By normalization we mean to end the confusion between Urdu and Arabic characters, to replace two words with one word keeping in mind the context they are used in. Like the character ‘ﺁ’ and ‘ﺂ’ are to be replaced by ‘آ’. All this is done using regular expressions.
The normalization of Urdu text is necessary to make it useful for the machine learning tasks. This module provides the following functionality:
- Normalizing Single Characters
- Normalizing Combine Characters
- Removal of Diacritics from Urdu Text
- Replace all digits with Urdu and vice versa English
-
urduhack.normalization.
normalize
(text: str) → str[source]¶ To normalize some text, all you need to do pass
Urdu
text. It will return astr
with normalized characters both single and combined, proper spaces after digits and punctuations and diacritics removed.Parameters: text (str) – Urdu
textReturns: Normalized Urdu
textReturn type: str Raises: TypeError
– If text param is not not str Type.Examples
>>> from urduhack import normalize >>> _text = "اَباُوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ۔" >>> normalized_text = normalize(_text) >>> # The text now contains proper spaces after digits and punctuations, >>> # normalized characters and no diacritics! >>> normalized_text اباوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ۔
-
urduhack.normalization.
normalize_characters
(text: str) → str[source]¶ The most important module in the UrduHack is the
character
module, defined in the module with the same name. You can use this module separately to normalize a piece of text to a proper specified Urdu range (0600-06FF). To get an understanding of how this module works, one needs to understand unicode. Every character has a unicode. You can search for any character unicode from any language you will find it. No two characters can have the same unicode. This module works with reference to the unicodes. Now as urdu language has its roots in Arabic, Parsian and Turkish. So we have to deal with all those characters and convert them to a normal urdu character. To get a bit more of what the above explanation means is.:>>> all_fes = ['ﻑ', 'ﻒ', 'ﻓ', 'ﻔ', ] >>> urdu_fe = 'ف'
All the characters in all_fes are same but they come from different languages and they all have different unicodes. Now as computers deal with numbers, same character appearing in more than one place in a different language will have a different unicode and that will create confusion which will create problems in understanding the context of the data.
character
module will eliminate this problem by replacing all the characters in all_fes by urdu_fe.This provides the functionality to replace wrong arabic characters with correct urdu characters and fixed the combine|join characters issue.
Replace
urdu
text characters with correctunicode
characters.Parameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.normalization import normalize_characters >>> # Text containing characters from Arabic Unicode block >>> _text = "مجھ کو جو توڑا ﮔیا تھا" >>> normalized_text = normalize_characters(_text) >>> # Normalized text - Arabic characters are now replaced with Urdu characters >>> normalized_text مجھ کو جو توڑا گیا تھا
-
urduhack.normalization.
normalize_combine_characters
(text: str) → str[source]¶ To normalize combine characters with single character unicode text, use the
normalize_combine_characters()
function in thecharacter
module.Replace combine|join
urdu
characters with single unicode characterParameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.normalization import normalize_combine_characters >>> # In the following string, Alif ('ا') and Hamza ('ٔ ') are separate characters >>> _text = "جرأت" >>> normalized_text = normalize_combine_characters(_text) >>> # Now Alif and Hamza are replaced by a Single Urdu Unicode Character! >>> normalized_text جرأت
-
urduhack.normalization.
remove_diacritics
(text: str) → str[source]¶ Remove
urdu
diacritics from text. It is an important step in pre-processing of the Urdu data. This function returns a String object which contains the original text minus Urdu diacritics.Parameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.normalization import remove_diacritics >>> _text = "شیرِ پنجاب" >>> normalized_text = remove_diacritics(_text) >>> normalized_text شیر پنجاب
Tokenization¶
This module is another crucial part of the Urduhack. This module performs tokenization on sentence. It separates different sentence from each other and converts each string into a complete sentence token. Note here you must not confuse yourself with the word token. They are two completely different things.
This library provides state of art word tokenizer for Urdu Language. It takes care of the spaces and where to connect two urdu characters and where not to.
The tokenization of Urdu text is necessary to make it useful for the NLP tasks. This module provides the following functionality:
- Sentence Tokenization
- Word Tokenization
The tokenization of Urdu text is necessary to make it useful for the machine
learning tasks. In the tokenization
module, we solved the problem related to
sentence and word tokenization.
-
urduhack.tokenization.
sentence_tokenizer
(text: str) → List[str][source]¶ Convert
Urdu
text into possible sentences. If successful, this function returns aList
object containing multiple urduString
sentences.Parameters: text (str) – Urdu
textReturns: Returns a list
object containing multiple urdu sentences typestr
.Return type: list Raises: TypeError
– If text is not a str TypeExamples
>>> from urduhack.tokenization import sentence_tokenizer >>> text = "عراق اور شام نے اعلان کیا ہے دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟" >>> sentences = sentence_tokenizer(text) >>> sentences ["دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟" ,"عراق اور شام نے اعلان کیا ہے۔"]
-
urduhack.tokenization.
word_tokenizer
(sentence: str, max_len: int = 256) → List[str][source]¶ To convert the raw Urdu text into tokens, we need to use
word_tokenizer()
function. Before doing this we need to normalize our sentence as well. For normalizing the urdu sentence useurduhack.normalization.normalize()
function. If the word_tokenizer runs successfully, this function returns aList
object containing urduString
word tokens.Parameters: Returns: Returns a
List[str]
containing urdu tokensReturn type: Examples
>>> sent = 'عراق اور شام نے اعلان کیا ہے دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟' >>> from urduhack.tokenization import word_tokenizer >>> word_tokenizer(sent) Tokens: ['عراق', 'اور', 'شام', 'نے', 'اعلان', 'کیا', 'ہے', 'دونوں', 'ممالک' , 'جلد', 'اپنے', 'اپنے', 'سفیروں', 'کو', 'واپس', 'بغداد', 'اور', 'دمشق', 'بھیج', 'دیں', 'گے؟']
Text PreProcessing¶
The pre-processing of Urdu text is necessary to make it useful for the machine learning tasks. This module provides the following functionality:
- Normalize whitespace
- Put Spaces Before & After Digits
- Put Spaces Before & After English Words
- Put Spaces Before & After Urdu Punctuations
- Replace urls
- Replace emails
- Replace number
- Replace phone_number
- Replace currency_symbols
You can look for all the different functions that come with pre-process
module in the reference here preprocess
.
-
urduhack.preprocessing.
digits_space
(text: str) → str[source]¶ Add spaces before|after numeric and urdu digits
Parameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.preprocessing import digits_space >>> text = "20فیصد" >>> normalized_text = digits_space(text) >>> normalized_text 20 فیصد
-
urduhack.preprocessing.
english_characters_space
(text: str) → str[source]¶ Functionality to add spaces before and after English words in the given Urdu text. It is an important step in normalization of the Urdu data.
this function returns a
String
object which contains the original text with spaces before & after English words.Parameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.preprocessing import english_characters_space >>> text = "خاتون Aliyaنے بچوںUzma and Aliyaکے قتل کا اعترافConfession کیا ہے۔" >>> normalized_text = english_characters_space(text) >>> normalized_text خاتون Aliya نے بچوں Uzma and Aliya کے قتل کا اعتراف Confession کیا ہے۔
-
urduhack.preprocessing.
all_punctuations_space
(text: str) → str[source]¶ Add spaces after punctuations used in
urdu
writingParameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str
-
urduhack.preprocessing.
preprocess
(text: str) → str[source]¶ To preprocess some text, all you need to do pass
unicode
text. It will return astr
with proper spaces after digits and punctuations.Parameters: text (str) – Urdu
textReturns: urdu text Return type: str Raises: TypeError
– If text param is not not str Type.Examples
>>> from urduhack.preprocessing import preprocess >>> text = "اَباُوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ۔" >>> normalized_text = preprocess(text) >>> # The text now contains proper spaces after digits and punctuations, >>> # normalized characters and no diacritics! >>> normalized_text اباوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ ۔
-
urduhack.preprocessing.
normalize_whitespace
(text: str)[source]¶ Given
text
str, replace one or more spacings with a single space, and one or more linebreaks with a single newline. Also strip leading/trailing whitespace.Parameters: text (str) – Urdu
textReturns: Returns a str
object containing normalized text.Return type: str Examples
>>> from urduhack.preprocessing import normalize_whitespace >>> text = "عراق اور شام اعلان کیا ہے دونوں جلد اپنے گے؟" >>> normalized_text = normalize_whitespace(text) >>> normalized_text عراق اور شام اعلان کیا ہے دونوں جلد اپنے گے؟
-
urduhack.preprocessing.
remove_punctuation
(text: str, marks=None) → str[source]¶ Remove punctuation from
text
by removing all instances ofmarks
.Parameters: Returns: returns a
str
object containing normalized text.Return type: Note
When
marks=None
, Python’s built-instr.translate()
is used to remove punctuation; otherwise, a regular expression is used instead. The former’s performance is about 5-10x faster.Examples
>>> from urduhack.preprocessing import remove_punctuation >>> output = remove_punctuation("کر ؟ سکتی ہے۔") کر سکتی ہے
-
urduhack.preprocessing.
remove_accents
(text: str) → str[source]¶ Remove accents from any accented unicode characters in
text
str, either by transforming them into ascii equivalents or removing them entirely.Parameters: text (str) – Urdu text Returns: str Examples
>>> from urduhack.preprocessing import remove_accents >>>text = "دالتِ عظمیٰ درخواست" >>> remove_accents(text)
‘دالت عظمی درخواست’
-
urduhack.preprocessing.
replace_urls
(text: str, replace_with='')[source]¶ Replace all URLs in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace url withreplace_with
text.Return type: Examples
>>> from urduhack.preprocessing import replace_urls >>> text = "20 www.gmail.com فیصد" >>> replace_urls(text) '20 فیصد'
-
urduhack.preprocessing.
replace_emails
(text: str, replace_with='')[source]¶ Replace all emails in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace emails withreplace_with
text.Return type: Examples
>>> text = "20 gunner@gmail.com فیصد" >>> from urduhack.preprocessing import replace_emails >>> replace_emails(text)
-
urduhack.preprocessing.
replace_numbers
(text: str, replace_with='')[source]¶ Replace all numbers in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace number withreplace_with
text.Return type: Examples
>>> from urduhack.preprocessing import replace_phone_numbers >>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 555-123-4567 میں ہوا تھا" >>> replace_phone_numbers(text) 'یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ میں ہوا تھا'
-
urduhack.preprocessing.
replace_phone_numbers
(text: str, replace_with='')[source]¶ Replace all phone numbers in
text
str withreplace_with
str.Parameters: Returns: Returns a
str
object replace number_no withreplace_with
text.Return type: Examples
>>> from urduhack.preprocessing import replace_numbers >>> text = "20 فیصد" >>> replace_numbers(text) ' فیصد'
-
urduhack.preprocessing.
replace_currency_symbols
(text: str, replace_with=None)[source]¶ Replace all currency symbols in
text
str with string specified byreplace_with
str.Parameters: Returns: Returns a
str
object containing normalized text.Return type: Examples
>>> from urduhack.preprocessing import replace_currency_symbols >>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33$ تھا۔" >>> replace_currency_symbols(text)
‘یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33USD تھا۔’
Utils¶
Utils module¶
Collection of helper functions.
-
urduhack.utils.
pickle_load
(file_name: str) → Any[source]¶ Load the pickle file
Parameters: file_name (str) – file name Returns: python object type Return type: Any
-
urduhack.utils.
pickle_dump
(file_name: str, data: Any)[source]¶ Save the python object in pickle format
Parameters: - file_name (str) – file name
- data (Any) – Any data type
-
urduhack.utils.
download_from_url
(file_name: str, url: str, download_dir: str, cache_dir: Optional[str] = None)[source]¶ Download anything from HTTP url
Parameters: Raises: TypeError
– If any of the url, file_path and file_name are not str Type.
-
urduhack.utils.
remove_file
(file_name: str)[source]¶ Delete the local file
Parameters: file_name (str) – File to be deleted
Raises: TypeError
– If file_name is not str Type.FileNotFoundError
– If file_name does not exist
About¶
Authors¶
Ikram ALi (Core contributor)¶
A machine learning practitioner and an avid learner with professional experience in managing Python, PHP, Javascript projects and excellent (Machine learning / Deep learning) skills.
- Personal Web: https://akkefa.com/
- Github: https://github.com/akkefa
- linkedin: https://www.linkedin.com/in/akkefa/
Drop me a line at mrikram1989@gmail.com or call me at 92 3320 453648.
Goals¶
The author’s goal is to foster and support active development of urduhack library through:
- Continuous integration testing via Travis CI
- Publicized development activity on GitHub
- Regular releases to the Python Package Index
License¶
Urduhack is licensed under MIT License.
So if you still want to support urduhack library, please report issues here,
Release Notes¶
Note
Contributors please include release notes as needed or appropriate with your bug fixes, feature additions and tests.
0.2.2¶
Changes:
- Word tokenizer
- Urdu word tokenization functionality added. To covert normalize Urdu sentence into possible word tokens,
we need to use
urduhack.tokenization.word_tokenizer
function.
0.1.0¶
Changes:
- Normalize function
- Single function added to do all normalize stuff. To normalize some text,
all you need to do is to import this function
urduhack.normalize
and it will return a string with normalized characters both single and combined, proper spaces after digits and punctuations, also remove the diacritics.
- Sentence Tokenizer
- Urdu sentence tokenization functionality added. To covert raw Urdu text into possible sentences,
we need to use
urduhack.tokenization.sentence_tokenizer
function.
Bug fixes:
- Fixed bugs in
remove_diacritics()
0.0.2¶
Changes:
- Character Level Normalization
- The
urduhack.normalization.character
module provides the functionality to replace wrong arabic characters with correct urdu characters.
- Space Normalization
- The
urduhack.normalization.space.util
module provides functionality to put proper spaces before and after numeric digits, urdu digits and punctuations (urdu text).
- Diacritics Removal
- The
urduhack.utils.text.remove_diacritics
module in the UrduHack provides the functionality to remove Urdu diacritics from text. It is an important step in pre-processing of the Urdu data.
0.0.1¶
Changes:
- Urdu character normalization api added.
- Urdu space normalization utilities functionality added.
- urdu characters correct unicode ranges added.
Deprecations and removals¶
This page lists urduhack features that are deprecated, or have been removed in past major releases, and gives the alternatives to use instead.