The Impact of Arabic Diacritization on Word Embeddings
Journal article
Authors | Abbache, M., Abbache, A., Xu, J.W., Meziane, F. and Wen, X.B. |
---|---|
Abstract | Word embedding is used to represent words for text analysis. It plays an essential role in many Natural Language Processing (NLP) studies and has hugely contributed to the extraordinary developments in the field in the last few years. In Arabic, diacritic marks are a vital feature for the readability and understandability of the language. Current Arabic word embeddings are non-diacritized. In this paper, we aim to develop and compare word embedding models based on diacritized and non-diacritized corpora to study the impact of Arabic diacritization on word embeddings. We propose evaluating the models in four different ways: clustering of the nearest words; morphological semantic analysis; part-of-speech tagging; and semantic analysis. For a better evaluation, we took the challenge to create three new datasets from scratch for the three downstream tasks. We conducted the downstream tasks with eight machine learning algorithms and two deep learning algorithms. Experimental results show that the diacritized model exhibits a better ability to capture syntactic and semantic relations and in clustering words of similar categories. Overall, the diacritized model outperforms the non-diacritized model. Interestingly, we obtained some more interesting findings. For example, from the morphological semantics analysis, we found that with the increase in the number of target words, the advantages of the diacritized model are also more obvious, and the diacritic marks have more significance in POS tagging than in other tasks. |
Keywords | Arabic NLP; Word Embeddings; Diacritization; Morphological Semantics; Semantic Analysis |
Year | 2023 |
Journal | ACM Transactions on Asian and Low-Resource Language Information Processing |
Journal citation | pp. 1-32 |
Publisher | ACM Press |
ISSN | 2375-4699 |
Digital Object Identifier (DOI) | https://doi.org/10.1145/3592603 |
Web address (URL) | https://dl.acm.org/doi/10.1145/3592603 |
Accepted author manuscript | License File Access Level Open |
Output status | Published |
Publication dates | |
Online | 19 Apr 2023 |
Publication process dates | |
Accepted | 30 Mar 2023 |
Deposited | 21 Apr 2023 |
Supplemental file | File Access Level Open |
https://repository.derby.ac.uk/item/9y0xw/the-impact-of-arabic-diacritization-on-word-embeddings
Download files
Accepted author manuscript
The Impact of Arabic Diacritization on Word Embeddings_Final.docx | ||
License: CC BY-NC 4.0 | ||
File access level: Open |
27
total views3
total downloads0
views this month1
downloads this month
Export as
Related outputs
Diagnosis of Breast Cancer Based on Hybrid Features Extraction in Dynamic Contrast Enhanced Magnetic Resonance Imaging
Hasan, A.M., Aljobouri, H.K., Al-Waely, K.N.A., Ibrahim, W.I., Jalab, H.A. and Meziane, F. 2023. Diagnosis of Breast Cancer Based on Hybrid Features Extraction in Dynamic Contrast Enhanced Magnetic Resonance Imaging. Neural Computing and Applications. pp. 1-14. https://doi.org/10.1007/s00521-023-08909-yClassification Model of Breast Masses in DCE-MRI Using Kinetic Curves Features with Quantum-Raina’s Polynomial Based Fusion
Hasan, A.M., Al-Waely, N.K.N., Ajobouri, H.K., Ibrahim, R.W., Jalab, H.A. and Meziane, F. 2023. Classification Model of Breast Masses in DCE-MRI Using Kinetic Curves Features with Quantum-Raina’s Polynomial Based Fusion. Biomedical Signal Processing and Control. 84, pp. 1-12. https://doi.org/10.1016/j.bspc.2023.105002
A review of the generation of requirements specification in natural language using objects UML models and domain ontology
Abdalazeima, Alaa and Meziane, Farid 2021. A review of the generation of requirements specification in natural language using objects UML models and domain ontology. Procedia Computer Science. 189, pp. 328-334. https://doi.org/10.1016/j.procs.2021.05.102Mitigation of Popularity Bias in Recommendation Systems
Karboua, S., Harrag, F., Meziane, F. and Boutadjine, A. 2022. Mitigation of Popularity Bias in Recommendation Systems. Tunisian-Algerian Joint Conference on Applied Computing. Constantine, Algeria 14 - 15 Dec 2022Describing Pulmonary Nodules Using 3D Clustering
Al-Funjan, A., Farid Meziane and Aspin, R. 2022. Describing Pulmonary Nodules Using 3D Clustering. Advanced Engineering Research. 22 (3), pp. 261-271. https://doi.org/10.23947/2687-1653-2022-22-3-261-271Credit Risk Prediction for Peer-To-Peer Lending Platforms: An Explainable Machine Learning Approach
Swee, C.P., Labadin, J. and Meziane, F. 2022. Credit Risk Prediction for Peer-To-Peer Lending Platforms: An Explainable Machine Learning Approach. Journal of Computing and Social Informatics. 1 (2), pp. 1-16. https://doi.org/10.33736/jcsi.4761.2022DCOPA: a distributed clustering based on objects performances aggregation for hierarchical communications in IoT applications
Mir, F. and Meziane, F. 2022. DCOPA: a distributed clustering based on objects performances aggregation for hierarchical communications in IoT applications. Cluster Computing. 26, p. 1077–1098. https://doi.org/10.1007/s10586-022-03741-w
Botnet detection used fast-flux technique, based on adaptive dynamic evolving spiking neural network algorithm
Almomani, Ammar, Nawasrah, Ahmad Al, Alauthman, Mohammad, Betar, Mohammed Azmi Al and Meziane, Farid 2021. Botnet detection used fast-flux technique, based on adaptive dynamic evolving spiking neural network algorithm. International Journal of Ad Hoc and Ubiquitous Computing. 36 (1), p. 50. https://doi.org/10.1016/j.cosrev.2020.100305
MRI brain classification using the quantum entropy LBP and deep-learning-based features
Hasan, Ali M., Jalab, Hamid A., Ibrahim, Rabha W., Meziane, Farid, AL-Shamasneh, Ala’a R. and Obaiys, Suzan J. 2020. MRI brain classification using the quantum entropy LBP and deep-learning-based features. Entropy. 22 (9), p. 1033. https://doi.org/10.3390/e22091033