MACHINE LEARNING MODELS FOR HAUSA-BASED LANGUAGE (WORDS) LEMMATIZATION

Authors

  • Muhammad Adamu
  • Rasheed Phd. Rasheed A.

DOI:

https://doi.org/10.33003/fjs-2025-0912-4396

Keywords:

Hausa NLP, Lemmatization, Low-Resource Languages, Machine Learning, Support Vector Machine, Random Forest, Morphological Analysis, African Languages, Dataset Curation

Abstract

Lemmatization, the process of reducing inflected word forms to their canonical dictionary form (lemma), is a foundational task in Natural Language Processing (NLP) that significantly enhances the performance of downstream applications like information retrieval, machine translation, and sentiment analysis. For morphologically rich, low-resource languages like Hausa—spoken by over 50 million people—this task is particularly challenging due to the absence of annotated datasets and the limitations of traditional rule-based systems, which are labor-intensive and fail to generalize. This paper presents the first comparative study of classical supervised machine learning models, Support Vector Machine (SVM) and Random Forest (RF), for Hausa word lemmatization. To address the critical data scarcity, we manually curated and linguistically validated a novel dataset of 4,530 Hausa word-lemma pairs sourced from diverse corpora including BBC Hausa, VOA Hausa, and BUK FM radio transcripts. Our methodology involved comprehensive text preprocessing (normalization of diacritics, clitic handling, stopword removal) and sophisticated feature engineering, extracting character trigrams, word length, prefix/suffix flags, and reduplication indicators. Models were trained on an 80:20 split and optimized using GridSearchCV. Our results demonstrate that the Random Forest model outperformed SVM significantly, achieving an accuracy of 63.25% and an F1-score of 60.26%, compared to SVM’s 56.73% accuracy and 54.75% F1-score. Crucially, RF was also six times faster to train (180 seconds vs. 1,094 seconds), making it far more practical for deployment. Feature importance analysis revealed that character trigrams and word length were the most predictive features, highlighting the efficacy of subword morphological cues.

References

Abdi, A., & Abdullahi, M. (2023). Lexicon and Rule-Based Word Lemmatization Approach for Somali Language. Journal of Natural Language Engineering.

Akhmetov, I., Pak, A., Ualiyeva, I., & Gelbukh, A. (2020). Highly language-independent word lemmatization using a machine-learning classifier. Computacion y Sistemas, 24(3), 1353–1364.

Freihat, A. A., Abbas, M., Bella, G., & Giunchiglia, F. (2018). Towards an Optimal Solution to Lemmatization in Arabic. Procedia Computer Science, 142, 132–140.

Islam, M. A., et al. (2022). BaNeL: an encoder-decoder based Bangla neural lemmatizer. SN Applied Sciences, 4(5).

Ibrahim, H., et al. (2018). Challenges in Developing NLP Tools for African Languages. Proc. of LREC.

Kanerva, J., Ginter, F., & Salakoski, T. (2021). Universal Lemmatizer: A sequence-To-sequence model for lemmatizing Universal Dependencies treebanks. Natural Language Engineering, 27(5), 545–574.

Mubarak, H. (2019). Build fast and accurate lemmatization for Arabic. LREC 2018.

Tukur, A., Umar, K., & Sa, A. (2020). Parts-of-Speech Tagging of Hausa-Based Texts Using Hidden Markov Model. International Journal of Computer Science.

Dataset Sample

Downloads

Published

29-12-2025

How to Cite

Adamu, M., & Rasheed A., R. P. (2025). MACHINE LEARNING MODELS FOR HAUSA-BASED LANGUAGE (WORDS) LEMMATIZATION. FUDMA JOURNAL OF SCIENCES, 9(12), 352 – 357. https://doi.org/10.33003/fjs-2025-0912-4396