The Impact of Arabic Diacritization on Word Embeddings
Journal article
Authors | Abbache, M., Abbache, A., Xu, J.W., Meziane, F. and Wen, X.B. |
---|---|
Abstract | Word embedding is used to represent words for text analysis. It plays an essential role in many Natural Language Processing (NLP) studies and has hugely contributed to the extraordinary developments in the field in the last few years. In Arabic, diacritic marks are a vital feature for the readability and understandability of the language. Current Arabic word embeddings are non-diacritized. In this paper, we aim to develop and compare word embedding models based on diacritized and non-diacritized corpora to study the impact of Arabic diacritization on word embeddings. We propose evaluating the models in four different ways: clustering of the nearest words; morphological semantic analysis; part-of-speech tagging; and semantic analysis. For a better evaluation, we took the challenge to create three new datasets from scratch for the three downstream tasks. We conducted the downstream tasks with eight machine learning algorithms and two deep learning algorithms. Experimental results show that the diacritized model exhibits a better ability to capture syntactic and semantic relations and in clustering words of similar categories. Overall, the diacritized model outperforms the non-diacritized model. Interestingly, we obtained some more interesting findings. For example, from the morphological semantics analysis, we found that with the increase in the number of target words, the advantages of the diacritized model are also more obvious, and the diacritic marks have more significance in POS tagging than in other tasks. |
Keywords | Arabic NLP; Word Embeddings; Diacritization; Morphological Semantics; Semantic Analysis |
Year | 2023 |
Journal | ACM Transactions on Asian and Low-Resource Language Information Processing |
Journal citation | pp. 1-32 |
Publisher | ACM Press |
ISSN | 2375-4699 |
Digital Object Identifier (DOI) | https://doi.org/10.1145/3592603 |
Web address (URL) | https://dl.acm.org/doi/10.1145/3592603 |
Accepted author manuscript | License File Access Level Open |
Output status | Published |
Publication dates | |
Online | 19 Apr 2023 |
Publication process dates | |
Accepted | 30 Mar 2023 |
Deposited | 21 Apr 2023 |
Supplemental file | File Access Level Open |
https://repository.derby.ac.uk/item/9y0xw/the-impact-of-arabic-diacritization-on-word-embeddings
Download files
Accepted author manuscript
The Impact of Arabic Diacritization on Word Embeddings_Final.docx | ||
License: CC BY-NC 4.0 | ||
File access level: Open |
143
total views22
total downloads70
views this month0
downloads this month