Welcome to Urduhack Documentation!

What is Urduhack?

Urduhack is an open-source NLP library for urdu language. It comes with a lot of battery included features to help you process Urdu data in the easiest way possible.

Our Goal

  • Academic users Easier experimentation to prove their hypothesis without coding from scratch.
  • NLP beginners Learn how to build an NLP project with production level code quality.
  • NLP developers Build a production level application within minutes.

Urduhack is maintained by Ikram Ali and Contributors.

Installation

Note

Urduhack is supported on the following Python versions

Python 3.8 3.7 3.6 2.7
Urduhack Yes      

Install Urduhack via pip

Note

Urduhack developed using Tensorflow. Its need Tensorflow cpu for prediction and for development and training the models its uses Tensorflow-gpu. following instructions will install Tensorflow

The easiest way to install urduhack is by pip install.

Installing with Tensorflow cpu version.
$ pip install urduhack[tf]
Installing with Tensorflow gpu version.
$ pip install urduhack[tf-gpu]

Package Dependencies

Having so many functionality, urduhack depends on a number of other packages. Try to avoid any kind of conflict. It is preferred that you create a virtual environment and install urduhack in that environment.

  • Tensorflow ~= 2.4 Use for training, evaluating and testing deep neural network model.
  • Transformers Use for bert implementation for training and evaluation.
  • Tensorflow-datasets Use for download and prepare the dataset,read it into a model using the tf.data.Dataset API.
  • Click With help of this library Urduhack commandline application developed.

Downloading Models

Pythonic Way
You can download model using Urduhack code.
import urduhack
urduhack.download()
Command line
To download the models all you have to do is run this simple command in the command line.
$ urduhack download

This command will download the models which will be used by urduhack.

Quickstart

Every python package needs an import statement so let’s do that first.:

>>>import urduhack

Overview

Urdu Characters

The Urdu alphabet is the right-to-left alphabet used for the Urdu language. It is a modification of the Persian alphabet known as Perso-Arabic, which is itself a derivative of the Arabic alphabet. The Urdu alphabet has up to 58 letters with 39 basic letters and no distinct letter cases, the Urdu alphabet is typically written in the calligraphic Nastaʿlīq script.

46 Alphabets, 10 Digits, 6 Punctuations, 6 Diacritics.

Normalization

The normalization of Urdu text is necessary to make it useful for the machine learning tasks. In the normalization module, the very basic problems faced when working with Urdu data are handled with ease and efficiency. All the problems and how normalization module handles them are listed below.

This modules fixes the problem of correct encodings for the Urdu characters as well as replace Arabic characters with correct Urdu characters. This module brings all the characters in the specified unicode range (0600-06FF) for Urdu language.

It also fixes the problem of joining of different Urdu words. By joining we mean that when space between two Urdu words is removed, they must not make a new word. Their rendering must not change and even after the removal of space they should look the same.

>>> from urduhack import normalize
>>> text = normalize(text)
Urdu Stopwords

Stop words are natural language words which have very little meaning, such as “and”, “the”, “a”, “an”, and similar words. These words are highly redundant in texts and do not contribute much so it is sometimes a viable approach to remove the stop words in pre-processing of the data.

>>> from urduhack.stop_words import STOP_WORDS, remove_stopwords
>>> print(STOP_WORDS)
>>> text = remove_stopwords(text)
Tokenization

This module is another crucial part of the Urduhack. This module performs tokenization on sentence. It separates different sentence from each other and converts each string into a complete sentence token. Note here you must not confuse yourself with the word token. They are two completely different things.

This library provides state of art word tokenizer for Urdu Language. It takes care of the spaces and where to connect two urdu characters and where not to.

The tokenization of Urdu text is necessary to make it useful for the NLP tasks. This module provides the following functionality:

  • Sentence Tokenization
  • Word Tokenization

The tokenization of Urdu text is necessary to make it useful for the machine learning tasks. In the tokenization module, we solved the problem related to sentence and word tokenization.

Tutorial

CoNLL-U Format

We aspire to maintain data for all the tasks in CoNNL-U format. CoNLL-U format holds sentence and token level data along with their attributes. Below we will show how to use urduhack’s CoNLL module.

>>> from urduhack import CoNLL

To iterate over sentences in CoNLL-U format we will use iter_string() function.

>>> from urduhack.conll.tests.test_parser import CONLL_SENTENCE

It will yield a sentence in proper CoNLL-U format from which we can extract sentence level and token level attributes.

>>> for sentence in CoNLL.iter_string(CONLL_SENTENCE):
        sent_meta, tokens = sentence
        print(f"Sentence ID: {sent_meta['sent_id']}")
        print(f"Sentence Text: {sent_meta['text']}")
        for token in tokens:
            print(token)
        {'id': '1', 'text': 'والدین', 'lemma': 'والدین', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Case=Acc|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'nsubj', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=NP|ChunkType=head'}
        {'id': '2', 'text': 'معمولی', 'lemma': 'معمولی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom', 'head': '3', 'deprel': 'advmod', 'deps': '_', 'misc': 'ChunkId=JJP|ChunkType=head'}
        {'id': '3', 'text': 'زخمی', 'lemma': 'زخمی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'compound', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=JJP2|ChunkType=head'}
        {'id': '4', 'text': 'ہوئے', 'lemma': 'ہو', 'upos': 'VERB', 'xpos': 'VM', 'feats': 'Aspect=Perf|Number=Plur|Person=2|Polite=Form|VerbForm=Part|Voice=Act', 'head': '0', 'deprel': 'root', 'deps': '_', 'misc': 'Vib=یا|Tam=yA|ChunkId=VGF|ChunkType=head|Stype=declarative'}
        {'id': '5', 'text': 'ہےں', 'lemma': 'ہے', 'upos': 'AUX', 'xpos': 'VAUX', 'feats': 'Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin', 'head': '4', 'deprel': 'aux', 'deps': '_', 'misc': 'SpaceAfter=No|Vib=ہے|Tam=hE|ChunkId=VGF|ChunkType=child'}
        {'id': '6', 'text': '۔', 'lemma': '۔', 'upos': 'PUNCT', 'xpos': 'SYM', 'feats': '_', 'head': '4', 'deprel': 'punct', 'deps': '_', 'misc': 'ChunkId=VGF|ChunkType=child'}

To load a file in ConLL-U format, we will use urduhack.CoNLL.load_file() function.

>>> sentences = ConLL.load_file("urdu_text.conll")
>>> for sentence in sentences:
        sent_meta, tokens = sentence
        print(f"Sentence ID: {sent_meta['sent_id']}")
        print(f"Sentence Text: {sent_meta['text']}")
        for token in tokens:
            print(token)
        {'id': '1', 'text': 'والدین', 'lemma': 'والدین', 'upos': 'NOUN', 'xpos': 'NN', 'feats': 'Case=Acc|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'nsubj', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=NP|ChunkType=head'}
        {'id': '2', 'text': 'معمولی', 'lemma': 'معمولی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom', 'head': '3', 'deprel': 'advmod', 'deps': '_', 'misc': 'ChunkId=JJP|ChunkType=head'}
        {'id': '3', 'text': 'زخمی', 'lemma': 'زخمی', 'upos': 'ADJ', 'xpos': 'JJ', 'feats': 'Case=Nom|Gender=Masc|Number=Sing|Person=3', 'head': '4', 'deprel': 'compound', 'deps': '_', 'misc': 'Vib=0|Tam=0|ChunkId=JJP2|ChunkType=head'}
        {'id': '4', 'text': 'ہوئے', 'lemma': 'ہو', 'upos': 'VERB', 'xpos': 'VM', 'feats': 'Aspect=Perf|Number=Plur|Person=2|Polite=Form|VerbForm=Part|Voice=Act', 'head': '0', 'deprel': 'root', 'deps': '_', 'misc': 'Vib=یا|Tam=yA|ChunkId=VGF|ChunkType=head|Stype=declarative'}
        {'id': '5', 'text': 'ہےں', 'lemma': 'ہے', 'upos': 'AUX', 'xpos': 'VAUX', 'feats': 'Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin', 'head': '4', 'deprel': 'aux', 'deps': '_', 'misc': 'SpaceAfter=No|Vib=ہے|Tam=hE|ChunkId=VGF|ChunkType=child'}
        {'id': '6', 'text': '۔', 'lemma': '۔', 'upos': 'PUNCT', 'xpos': 'SYM', 'feats': '_', 'head': '4', 'deprel': 'punct', 'deps': '_', 'misc': 'ChunkId=VGF|ChunkType=child'}
Pipeline Module

Pipeline is a special module in urduhack. It’s importance can be realized by the fact that it performs operation at Document, Sentence and Token level. We can convert a document to sentence and a sentence into tokens in one go using the pipeline module. After that we can run models or any other operation at the document, sentence and token levels. Now we will go into these steps one by on.

Document

We can get the document using pipeline module.

>>> from urduhack import Pipeline
>>> nlp = Pipeline()
>>> text = """
گزشتہ ایک روز کے دوران کورونا کے سبب 118 اموات ہوئیں جس کے بعد اموات کا مجموعہ 3 ہزار 93 ہوگیا ہے۔
سب سے زیادہ اموات بھی پنجاب میں ہوئی ہیں جہاں ایک ہزار 202 افراد جان کی بازی ہار چکے ہیں۔
سندھ میں 916، خیبر پختونخوا میں 755، اسلام آباد میں 94، گلگت بلتستان میں 18، بلوچستان میں 93 اور ا?زاد کشمیر میں 15 افراد کورونا وائرس سے جاں بحق ہو چکے ہیں۔
"""
>>> doc = nlp(text)
>>> print(doc.text)
Sentence

Now to get the sentences from the Document.

>>> for sentence in doc.sentences:
        print(sentence.text)
Word

To get words from sentence.

>>> for word in sentence.words:
        print(word.text)
POS tagger

Word class hold Pos tags.

>>> for word in sentence.words:
        print(word.pos)
Lemmatizer

Word class hold lemma.

>>> for word in sentence.words:
        print(word.lemma)

Reference

CoNLL-U Format

This module reads and parse data in the standard CONLL-U format as provided in universal dependencies. CONLL-U is a standard format followed to annotate data at sentence level and at word/token level. Annotations in CONLL-U format fulfil the below points:

  1. Word lines contain the annotations of a word/token in 10 fields are separated by single tab characters
  2. Blank lines mark sentence boundaries
  3. Comment lines start with hash (#)

Each word/token has 10 fields defined in the CONLL-U format. Each field represents different attributes of the token whose details are given below:

Fields
1. ID:
ID represents the word/token index in the sentence, indexing starts from 1 in UD sentences.
2. FORM:
Word/token form or punctuation symbol used in the sentence, like how a word is being used in the sentence for example organize, organizer, organization all of these are inflectional forms of word organize.
3. LEMMA:
Root/stem of the word, lemma is used to get the vocabulary form of word which helps in understanding roots of different words. Lemmatization refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . For example:
4. UPOS:
UPOS mark the core part-of-speech categories. They help in analysing the word usage in the sentence. Following are the UPOS tags available in UD. (ADJ, ADV, INTJ, NOUN, PROPN, VERB, ADP, AUX, CCONJ, DET, NUM, PART, PRON, SCONJ, PUNCT, SYM, X)
5. XPOS:
Language specific part-of-speed tag. For some languages grammar rules are different and that’s is why language specific POS tags are used.
6. FEATS:
Features are additional pieces of information about the word, its part of speech and morphosyntactic properties. Every feature has the form Name=Value and every word can have any number of features, separated by the vertical bar, as in Gender=Masc|Number=Sing. Users can extend this set of universal features and add language-specific features when necessary.
7. HEAD:
Head of the current word, which is either a value of ID or zero (0). All words of sentence are dependent to other words of the sentence. Head shows the DEPREL of current word of the sentence to the head word.
8. DEPREL:
Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one
9. DEPS:
A dependency can be labeled as dep when it is impossible to determine a more precise relation. This may be because of a weird grammatical construction, or a limitation in conversion or parsing software. The use of dep should be avoided as much as possible. Enhanced dependency graph in the form of a list of head-deprel pairs.
10. MISC:
Any other annotation apart from the above mentioned fields Commentary or other annotation
UPOS TAGS DETAILS

ADJ Adjectives are words that typically modify nouns and specify their properties or attributes: اسمبلی انتخابات کو 'صاف و شفاف' بنانے انتخابات مےں 'غیرسماجی' عناصر کی جانب سے بدامنی پھیلائے جانے کا خدشہ ہے Examples

‘واضح’ ,’یقینی’ ,’ہر’

ADV Adverbs are words that typically modify verbs for such categories as time, place, direction or manner. They may also modify adjectives and other adverbs, as in ‘very briefly’ or ‘arguably’ wrong. سوہال نے 24 گیندوں مےں جاریہ سیزن کی 'تیزترین' نصف سنچری بنائی Examples ‘ہرگز’ ,’قبل_ازیں’ ,’بعد’

INTJ An interjection is a word that is used most often as an exclamation or part of an exclamation. It typically expresses an emotional reaction, is not syntactically related to other accompanying expressions, and may include a combination of sounds not otherwise found in the language. چلو اب 'ذرا' دنیا کی سیر کر لیں Examples ‘آہ’ ,’ذرا’ ,’بس’

NOUN Nouns are a part of speech typically denoting a person, place, thing, animal or idea. بس اس 'تنہائی' کے 'عالم' مےں اےک 'یاد' 'آواز' بن کر 'سماعت' سے ٹکرائی۔ Examples ‘جسٹس’ ,’پاکستان’ ,’حصہ’ ,’خالہ’ ,’لوگ’

PROPN A proper noun is a noun (or nominal content word) that is the name (or part of the name) of a specific individual, place, or object. تاہم 'بھارت' 'کرشنا' کو اپنی منگیتر کے چال و چلن پر شبہ ہوا Examples ‘ڈی’ ,’سی’ ,’پی’ ,’ساؤتھ’ ,’احمد’ ,’نثار’

VERB A verb is a member of the syntactic class of words that typically signal events and actions, can constitute a minimal predicate in a clause, and govern the number and types of other constituents which may occur in the clause. Verbs are often associated with grammatical categories like tense, mood, aspect and voice, which can either be expressed inflectionally or using auxilliary verbs or particles. آزادانہ و منصفانہ انتخابات کو یقینی 'بنایا' جا سکے Examples ‘کہا’ ,’رہے’

ADP Adposition is a cover term for prepositions and postpositions. Adpositions belong to a closed set of items that occur before (preposition) or after (postposition) a complement composed of a noun phrase, noun, pronoun, or clause that functions as a noun phrase, and that form a single structure with the complement to express its grammatical and semantic relation to another unit within a clause. اےک شخص 'نے' مبینہ طور 'پر' اپنی منگیتر اور اس 'کے' والدین 'پر' چاقو 'سے' حملہ کر کے زخمی کر دیا۔ Examples ‘پر’ ,’نے’ ,’مےں’

AUX An auxiliary is a function word that accompanies the lexical verb of a verb phrase and expresses grammatical distinctions not carried by the lexical verb, such as person, number, tense, mood, aspect, voice or evidentiality. It is often a verb (which may have non-auxiliary uses as well) but many languages have nonverbal TAME markers and these should also be tagged AUX. The class AUX also include copulas (in the narrow sense of pure linking words for nonverbal predication). انتخابات کو یقینی بنایا 'جا' سکے۔ Examples ‘جا’ ,’ہے’ ,’رہے’

CCONJ A coordinating conjunction is a word that links words or larger constituents without syntactically subordinating one to the other and expresses a semantic relationship between them. انتخابات کی راست نگرانی 'اور' غنڈہ عناصر پر کنٹرول کے لئے سخت_ترین انتظامات کئے جائیں۔ Examples

‘لیکن’ ,’بدعنوانیوں و بےقاعدگیوں in و’

DET Determiners are words that modify nouns or noun phrases and express the reference of the noun phrase in context. That is, a determiner may indicate whether the noun is referring to a definite or indefinite element of a class, to a closer or more distant element, to an element belonging to a specified person or thing, to a particular

number or quantity, etc

ریاستی حج کمیٹی 'اس' طرح کی کوئی تجویز رکھتی ہے Examples ‘تمام’ ,’ہر’ ,’جو’

NUM A numeral is a word, functioning most typically as a determiner, adjective or pronoun, that expresses a number and a relation to the number, such as quantity, sequence, frequency or fraction. 'اےک' شخص نے مبینہ طور پر اپنی منگیتر اور اس کے والدین پر چاقو سے حملہ کر کے زخمی کر دیا۔ Examples ‘۲’ ,’۱’ ,’۰’ ,’چار’ ,’اےک’

PART Particles are function words that must be associated with another word or phrase to impart meaning and that do not satisfy definitions of other universal parts of speech (e.g. adpositions, coordinating conjunctions, subordinating conjunctions or auxiliary verbs). Particles may encode grammatical categories such as negation, mood, tense etc. Particles are normally not inflected, although exceptions may occur. اس اجلاس مےں اطفال کے حق تعلیم کے قانون کا جائزہ 'بھی' لیا جائے_گا۔ Examples ‘ہی’ ,’مسٹر’ ,’نہیں’

PRON Pronouns are words that substitute for nouns or noun phrases, whose meaning is recoverable from the linguistic or extralinguistic context. احمد کے بموجب اگر دونوں ہی ٹیمیں 'اپنے' شیڈول مےں معمولی تبدیلی کرتے ہےں Examples ‘اپنی’ ,’ازیں’ ,’یہاں’

SCONJ A subordinating conjunction is a conjunction that links constructions by making one of them a constituent of the other. The subordinating conjunction typically marks the incorporated constituent which has the status of a (subordinate)

clause.

وہ ابھی پڑھ ہی رہے تھے 'کہ' بیٹے نے دروازہ کھٹکھٹایا۔ Examples ‘اگر’ ,’تو’

PUNCT Punctuation marks are non-alphabetical characters and character groups used in many languages to delimit linguistic units in printed text. ایسے دہشتگردوں کو اسلام سے خارج کر دیا جانا چاہیے'۔' Examples ‘!’ ,’.’ ,’۔’

SYM A symbol is a word-like entity that differs from ordinary words by form, function, or both. ایسے'$' دہشتگردوں کو اسلام سے خارج کر دیا جانا چاہیے Examples ‘@’, ‘%’

X The tag X is used for words that for some reason cannot be assigned a real part-of-speech category. It should be used very restrictively.

class urduhack.conll.CoNLL[source]

A Conll class to easily load conll-u formats. This module can also load resources by iterating over string. This module is the main entrance to conll’s functionalities.

static get_fields() → List[str][source]

Get the list of conll fields

Returns:Return list of conll fields
Return type:List[str]
static iter_file(file_name: str) → Iterator[Tuple][source]

Iterate over a CoNLL-U file’s sentences.

Parameters:

file_name (str) – The name of the file whose sentences should be iterated over.

Yields:

Iterator[Tuple] – The sentences that make up the CoNLL-U file.

Raises:
  • IOError – If there is an error opening the file.
  • ParseError – If there is an error parsing the input into a Conll object.
static iter_string(text: str) → Iterator[Tuple][source]

Iterate over a CoNLL-U string’s sentences.

Use this method if you only need to iterate over the CoNLL-U file once and do not need to create or store the Conll object.

Parameters:text (str) – The CoNLL-U string.
Yields:Iterator[Tuple] – The sentences that make up the CoNLL-U file.
Raises:ParseError – If there is an error parsing the input into a Conll object.
static load_file(file_name: str) → List[Tuple][source]

Load a CoNLL-U file given its location.

Parameters:

file_name (str) – The location of the file.

Returns:

A Conll object equivalent to the provided file.

Return type:

List[Tuple]

Raises:
  • IOError – If there is an error opening the given filename.
  • ValueError – If there is an error parsing the input into a Conll object.

Normalization

The normalization of Urdu text is necessary to make it useful for the machine learning tasks. In the normalize module, the very basic problems faced when working with Urdu data are handled with ease and efficiency. All the problems and how normalize module handles them are listed below.

This modules fixes the problem of correct encodings for the Urdu characters as well as replace Arabic characters with correct Urdu characters. This module brings all the characters in the specified unicode range (0600-06FF) for Urdu language.

It also fixes the problem of joining of different Urdu words. By joining we mean that when space between two Urdu words is removed, they must not make a new word. Their rendering must not change and even after the removal of space they should look the same.

You can use the library to normalize the Urdu text for correct unicode characters. By normalization we mean to end the confusion between Urdu and Arabic characters, to replace two words with one word keeping in mind the context they are used in. Like the character ‘ﺁ’ and ‘ﺂ’ are to be replaced by ‘آ’. All this is done using regular expressions.

The normalization of Urdu text is necessary to make it useful for the machine learning tasks. This module provides the following functionality:

  • Normalizing Single Characters
  • Normalizing Combine Characters
  • Removal of Diacritics from Urdu Text
  • Replace all digits with Urdu and vice versa English
urduhack.normalization.normalize(text: str) → str[source]

To normalize some text, all you need to do pass Urdu text. It will return a str with normalized characters both single and combined, proper spaces after digits and punctuations and diacritics removed.

Parameters:text (str) – Urdu text
Returns:Normalized Urdu text
Return type:str
Raises:TypeError – If text param is not not str Type.

Examples

>>> from urduhack import normalize
>>> _text = "اَباُوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ۔"
>>> normalized_text = normalize(_text)
>>> # The text now contains proper spaces after digits and punctuations,
>>> # normalized characters and no diacritics!
>>> normalized_text
اباوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ۔
urduhack.normalization.normalize_characters(text: str) → str[source]

The most important module in the UrduHack is the character module, defined in the module with the same name. You can use this module separately to normalize a piece of text to a proper specified Urdu range (0600-06FF). To get an understanding of how this module works, one needs to understand unicode. Every character has a unicode. You can search for any character unicode from any language you will find it. No two characters can have the same unicode. This module works with reference to the unicodes. Now as urdu language has its roots in Arabic, Parsian and Turkish. So we have to deal with all those characters and convert them to a normal urdu character. To get a bit more of what the above explanation means is.:

>>> all_fes = ['ﻑ', 'ﻒ', 'ﻓ', 'ﻔ', ]
>>> urdu_fe = 'ف'

All the characters in all_fes are same but they come from different languages and they all have different unicodes. Now as computers deal with numbers, same character appearing in more than one place in a different language will have a different unicode and that will create confusion which will create problems in understanding the context of the data. character module will eliminate this problem by replacing all the characters in all_fes by urdu_fe.

This provides the functionality to replace wrong arabic characters with correct urdu characters and fixed the combine|join characters issue.

Replace urdu text characters with correct unicode characters.

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.normalization import normalize_characters
>>> # Text containing characters from Arabic Unicode block
>>> _text = "مجھ کو جو توڑا ﮔیا تھا"
>>> normalized_text = normalize_characters(_text)
>>> # Normalized text - Arabic characters are now replaced with Urdu characters
>>> normalized_text
مجھ کو جو توڑا گیا تھا
urduhack.normalization.normalize_combine_characters(text: str) → str[source]

To normalize combine characters with single character unicode text, use the normalize_combine_characters() function in the character module.

Replace combine|join urdu characters with single unicode character

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.normalization import normalize_combine_characters
>>> # In the following string, Alif ('ا') and Hamza ('ٔ ') are separate characters
>>> _text = "جرأت"
>>> normalized_text = normalize_combine_characters(_text)
>>> # Now Alif and Hamza are replaced by a Single Urdu Unicode Character!
>>> normalized_text
جرأت
urduhack.normalization.remove_diacritics(text: str) → str[source]

Remove urdu diacritics from text. It is an important step in pre-processing of the Urdu data. This function returns a String object which contains the original text minus Urdu diacritics.

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.normalization import remove_diacritics
>>> _text = "شیرِ پنجاب"
>>> normalized_text = remove_diacritics(_text)
>>> normalized_text
شیر پنجاب
urduhack.normalization.replace_digits(text: str, with_english: bool = True) → str[source]

Replace urdu digits with English digits and vice versa

Parameters:
  • text (str) – Urdu text string
  • with_english (bool) – Boolean to convert digits from one language to other
Returns:

Text string with replaced digits

Tokenization

This module is another crucial part of the Urduhack. This module performs tokenization on sentence. It separates different sentence from each other and converts each string into a complete sentence token. Note here you must not confuse yourself with the word token. They are two completely different things.

This library provides state of art word tokenizer for Urdu Language. It takes care of the spaces and where to connect two urdu characters and where not to.

The tokenization of Urdu text is necessary to make it useful for the NLP tasks. This module provides the following functionality:

  • Sentence Tokenization
  • Word Tokenization

The tokenization of Urdu text is necessary to make it useful for the machine learning tasks. In the tokenization module, we solved the problem related to sentence and word tokenization.

urduhack.tokenization.sentence_tokenizer(text: str) → List[str][source]

Convert Urdu text into possible sentences. If successful, this function returns a List object containing multiple urdu String sentences.

Parameters:text (str) – Urdu text
Returns:Returns a list object containing multiple urdu sentences type str.
Return type:list
Raises:TypeError – If text is not a str Type

Examples

>>> from urduhack.tokenization import sentence_tokenizer
>>> text = "عراق اور شام نے اعلان کیا ہے دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟"
>>> sentences = sentence_tokenizer(text)
>>> sentences
["دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟" ,"عراق اور شام نے اعلان کیا ہے۔"]
urduhack.tokenization.word_tokenizer(sentence: str, max_len: int = 256) → List[str][source]

To convert the raw Urdu text into tokens, we need to use word_tokenizer() function. Before doing this we need to normalize our sentence as well. For normalizing the urdu sentence use urduhack.normalization.normalize() function. If the word_tokenizer runs successfully, this function returns a List object containing urdu String word tokens.

Parameters:
  • sentence (str) – urdu text or list of text
  • max_len (int) – Maximum text length supported by model
Returns:

Returns a List[str] containing urdu tokens

Return type:

list

Examples

>>> sent = 'عراق اور شام نے اعلان کیا ہے دونوں ممالک جلد اپنے اپنے سفیروں کو واپس بغداد اور دمشق بھیج دیں گے؟'
>>> from urduhack.tokenization import word_tokenizer
>>> word_tokenizer(sent)
Tokens:  ['عراق', 'اور', 'شام', 'نے', 'اعلان', 'کیا', 'ہے', 'دونوں', 'ممالک'
, 'جلد', 'اپنے', 'اپنے', 'سفیروں', 'کو', 'واپس', 'بغداد', 'اور', 'دمشق', 'بھیج', 'دیں', 'گے؟']

Text PreProcessing

The pre-processing of Urdu text is necessary to make it useful for the machine learning tasks. This module provides the following functionality:

  • Normalize whitespace
  • Put Spaces Before & After Digits
  • Put Spaces Before & After English Words
  • Put Spaces Before & After Urdu Punctuations
  • Replace urls
  • Replace emails
  • Replace number
  • Replace phone_number
  • Replace currency_symbols

You can look for all the different functions that come with pre-process module in the reference here preprocess.

urduhack.preprocessing.digits_space(text: str) → str[source]

Add spaces before|after numeric and urdu digits

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.preprocessing import digits_space
>>> text = "20فیصد"
>>> normalized_text = digits_space(text)
>>> normalized_text
20 فیصد
urduhack.preprocessing.english_characters_space(text: str) → str[source]

Functionality to add spaces before and after English words in the given Urdu text. It is an important step in normalization of the Urdu data.

this function returns a String object which contains the original text with spaces before & after English words.

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.preprocessing import english_characters_space
>>> text = "خاتون Aliyaنے بچوںUzma and Aliyaکے قتل کا اعترافConfession کیا ہے۔"
>>> normalized_text = english_characters_space(text)
>>> normalized_text
خاتون Aliya نے بچوں Uzma and Aliya کے قتل کا اعتراف Confession کیا ہے۔
urduhack.preprocessing.all_punctuations_space(text: str) → str[source]

Add spaces after punctuations used in urdu writing

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str
urduhack.preprocessing.preprocess(text: str) → str[source]

To preprocess some text, all you need to do pass unicode text. It will return a str with proper spaces after digits and punctuations.

Parameters:text (str) – Urdu text
Returns:urdu text
Return type:str
Raises:TypeError – If text param is not not str Type.

Examples

>>> from urduhack.preprocessing import preprocess
>>> text = "اَباُوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ۔"
>>> normalized_text = preprocess(text)
>>> # The text now contains proper spaces after digits and punctuations,
>>> # normalized characters and no diacritics!
>>> normalized_text
اباوگل پاکستان ﻤﯿﮟ 20 سال ﺳﮯ ، وسائل کی کوئی کمی نہیں ﮨﮯ ۔
urduhack.preprocessing.normalize_whitespace(text: str)[source]

Given text str, replace one or more spacings with a single space, and one or more linebreaks with a single newline. Also strip leading/trailing whitespace.

Parameters:text (str) – Urdu text
Returns:Returns a str object containing normalized text.
Return type:str

Examples

>>> from urduhack.preprocessing import normalize_whitespace
>>> text = "عراق اور شام     اعلان کیا ہے دونوں         جلد اپنے     گے؟"
>>> normalized_text = normalize_whitespace(text)
>>> normalized_text
عراق اور شام اعلان کیا ہے دونوں جلد اپنے گے؟
urduhack.preprocessing.remove_punctuation(text: str, marks=None) → str[source]

Remove punctuation from text by removing all instances of marks.

Parameters:
  • text (str) – Urdu text
  • marks (str) – If specified, remove only the characters in this string, e.g. marks=',;:' removes commas, semi-colons, and colons. Otherwise, all punctuation marks are removed.
Returns:

returns a str object containing normalized text.

Return type:

str

Note

When marks=None, Python’s built-in str.translate() is used to remove punctuation; otherwise, a regular expression is used instead. The former’s performance is about 5-10x faster.

Examples

>>> from urduhack.preprocessing import remove_punctuation
>>> output = remove_punctuation("کر ؟ سکتی ہے۔")
کر سکتی ہے
urduhack.preprocessing.remove_accents(text: str) → str[source]

Remove accents from any accented unicode characters in text str, either by transforming them into ascii equivalents or removing them entirely.

Parameters:text (str) – Urdu text
Returns:str

Examples

>>> from urduhack.preprocessing import remove_accents
>>>text = "دالتِ عظمیٰ درخواست"
>>> remove_accents(text)

‘دالت عظمی درخواست’

urduhack.preprocessing.replace_urls(text: str, replace_with='')[source]

Replace all URLs in text str with replace_with str.

Parameters:
  • text (str) – Urdu text
  • replace_with (str) – Replace string
Returns:

Returns a str object replace url with replace_with text.

Return type:

str

Examples

>>> from urduhack.preprocessing import replace_urls
>>> text = "20 www.gmail.com  فیصد"
>>> replace_urls(text)
'20  فیصد'
urduhack.preprocessing.replace_emails(text: str, replace_with='')[source]

Replace all emails in text str with replace_with str.

Parameters:
  • text (str) – Urdu text
  • replace_with (str) – Replace string
Returns:

Returns a str object replace emails with replace_with text.

Return type:

str

Examples

>>> text = "20 gunner@gmail.com  فیصد"
>>> from urduhack.preprocessing import replace_emails
>>> replace_emails(text)
urduhack.preprocessing.replace_numbers(text: str, replace_with='')[source]

Replace all numbers in text str with replace_with str.

Parameters:
  • text (str) – Urdu text
  • replace_with (str) – Replace string
Returns:

Returns a str object replace number with replace_with text.

Return type:

str

Examples

>>> from urduhack.preprocessing import replace_phone_numbers
>>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 555-123-4567 میں ہوا تھا"
>>> replace_phone_numbers(text)
'یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ میں ہوا تھا'
urduhack.preprocessing.replace_phone_numbers(text: str, replace_with='')[source]

Replace all phone numbers in text str with replace_with str.

Parameters:
  • text (str) – Urdu text
  • replace_with (str) – Replace string
Returns:

Returns a str object replace number_no with replace_with text.

Return type:

str

Examples

>>> from urduhack.preprocessing import replace_numbers
>>> text = "20  فیصد"
>>> replace_numbers(text)
' فیصد'
urduhack.preprocessing.replace_currency_symbols(text: str, replace_with=None)[source]

Replace all currency symbols in text str with string specified by replace_with str.

Parameters:
  • text (str) – Raw text
  • replace_with (str) – if None (default), replace symbols with their standard 3-letter abbreviations (e.g. ‘$’ with ‘USD’, ‘£’ with ‘GBP’); otherwise, pass in a string with which to replace all symbols (e.g. “CURRENCY”)
Returns:

Returns a str object containing normalized text.

Return type:

str

Examples

>>> from urduhack.preprocessing import replace_currency_symbols
>>> text = "یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33$ تھا۔"
>>> replace_currency_symbols(text)

‘یعنی لائن آف کنٹرول پر فائربندی کا معاہدہ 2003 میں ہوا 33USD تھا۔’

urduhack.preprocessing.remove_english_alphabets(text: str)[source]

Removes English words and digits from a text

Parameters:text (str) – Urdu text
Returns:str object with english alphabets removed
Return type:str

Utils

Utils module

Collection of helper functions.

urduhack.utils.pickle_load(file_name: str) → Any[source]

Load the pickle file

Parameters:file_name (str) – file name
Returns:python object type
Return type:Any
urduhack.utils.pickle_dump(file_name: str, data: Any)[source]

Save the python object in pickle format

Parameters:
  • file_name (str) – file name
  • data (Any) – Any data type
urduhack.utils.download_from_url(file_name: str, url: str, download_dir: str, cache_dir: Optional[str] = None)[source]

Download anything from HTTP url

Parameters:
  • file_name (str) – Save file as provided file name
  • url (str) – HTTP url
  • download_dir (str) – location to store file
  • cache_dir (str) – Main download dir
Raises:

TypeError – If any of the url, file_path and file_name are not str Type.

urduhack.utils.remove_file(file_name: str)[source]

Delete the local file

Parameters:

file_name (str) – File to be deleted

Raises:

About

Authors

Ikram ALi (Core contributor)

A machine learning practitioner and an avid learner with professional experience in managing Python, PHP, Javascript projects and excellent (Machine learning / Deep learning) skills.

Drop me a line at mrikram1989@gmail.com or call me at 92 3320 453648.

Goals

The author’s goal is to foster and support active development of urduhack library through:

License

Urduhack is licensed under MIT License.

So if you still want to support urduhack library, please report issues here,

Release Notes

Note

Contributors please include release notes as needed or appropriate with your bug fixes, feature additions and tests.

0.2.2

Changes:

  • Word tokenizer
    Urdu word tokenization functionality added. To covert normalize Urdu sentence into possible word tokens, we need to use urduhack.tokenization.word_tokenizer function.

0.1.0

Changes:

  • Normalize function
    Single function added to do all normalize stuff. To normalize some text, all you need to do is to import this function urduhack.normalize and it will return a string with normalized characters both single and combined, proper spaces after digits and punctuations, also remove the diacritics.
  • Sentence Tokenizer
    Urdu sentence tokenization functionality added. To covert raw Urdu text into possible sentences, we need to use urduhack.tokenization.sentence_tokenizer function.

Bug fixes:

  • Fixed bugs in remove_diacritics()

0.0.2

Changes:

  • Character Level Normalization
    The urduhack.normalization.character module provides the functionality to replace wrong arabic characters with correct urdu characters.
  • Space Normalization
    The urduhack.normalization.space.util module provides functionality to put proper spaces before and after numeric digits, urdu digits and punctuations (urdu text).
  • Diacritics Removal
    The urduhack.utils.text.remove_diacritics module in the UrduHack provides the functionality to remove Urdu diacritics from text. It is an important step in pre-processing of the Urdu data.

0.0.1

Changes:

  • Urdu character normalization api added.
  • Urdu space normalization utilities functionality added.
  • urdu characters correct unicode ranges added.

Deprecations and removals

This page lists urduhack features that are deprecated, or have been removed in past major releases, and gives the alternatives to use instead.

Indices and tables