Release Notes


Contributors please include release notes as needed or appropriate with your bug fixes, feature additions and tests.



  • Word tokenizer
    Urdu word tokenization functionality added. To covert normalize Urdu sentence into possible word tokens, we need to use urduhack.tokenization.word_tokenizer function.



  • Normalize function
    Single function added to do all normalize stuff. To normalize some text, all you need to do is to import this function urduhack.normalize and it will return a string with normalized characters both single and combined, proper spaces after digits and punctuations, also remove the diacritics.
  • Sentence Tokenizer
    Urdu sentence tokenization functionality added. To covert raw Urdu text into possible sentences, we need to use urduhack.tokenization.sentence_tokenizer function.

Bug fixes:

  • Fixed bugs in remove_diacritics()



  • Character Level Normalization
    The urduhack.normalization.character module provides the functionality to replace wrong arabic characters with correct urdu characters.
  • Space Normalization
    The module provides functionality to put proper spaces before and after numeric digits, urdu digits and punctuations (urdu text).
  • Diacritics Removal
    The urduhack.utils.text.remove_diacritics module in the UrduHack provides the functionality to remove Urdu diacritics from text. It is an important step in pre-processing of the Urdu data.



  • Urdu character normalization api added.
  • Urdu space normalization utilities functionality added.
  • urdu characters correct unicode ranges added.