A comparative analysis of PoS tagging tools for Hindi and Marathi

Pratik Narayanrao Kalamkar, Prasadu Peddi, Yogesh Kumar Sharma

Abstract


Many tools exist for performing parts of speech (PoS) data tagging in Hindi and Marathi. Still, no standard benchmark or performance evaluation data exists for these tools to help researchers choose the best according to their needs. This paper presents a performance comparison of different PoS taggers and widely available trained models for these two languages. We used different granularity data sets to compare the performance and precision of these tools with the Stanford PoS tagger. Since the tag sets used by these PoS taggers differ, we propose a mapping between different PoS tagsets to address this inherent challenge in tagger comparison. We tested our proposed PoS tag mappings on newly created Hindi and Marathi movie scripts and subtitle datasets since movie scripts are different in how they are formatted and structured. We shall be surveying and comparing five parts of speech taggers viz. IMLT Hindi rules-based PoS tagger, LTRC IIIT Hindi PoS tagger, CDAC Hindi PoS tagger, LTRC Marathi PoS tagger, CDAC Marathi PoS tagger. It would also help us evaluate how the Bureau of Indian Standards’s (BIS) tag set of Indian languages compares to the Universal Dependency (UD) PoS tag set, as no studies have been conducted before to evaluate this aspect.


Keywords


Computational linguistics; Machine learning; Natural language processing; Part-of-speech tagging; Text analytics; Tokenization

Full Text:

PDF


DOI: http://doi.org/10.11591/ijict.v15i1.pp120-137

Refbacks

  • There are currently no refbacks.


Copyright (c) 2026 Pratik Narayanrao Kalamkar, Prasadu Peddi, Yogesh Kumar Sharma

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The International Journal of Informatics and Communication Technology (IJ-ICT)
p-ISSN 2252-8776, e-ISSN 2722-2616
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).

Web Analytics View IJICT Stats