Abstract
Part-of-speech tagging is the process of assigning words in a given text to appropriate parts-of speech in order to reduce the disambiguation which may arise depending on the contextual usage of the words. In this paper, the problem of word sense disambiguation in Azerbaijani language is addressed by applying part of speech tagging on two varying data corpora, misspelled, and edited (clean) text using 3 different machine learning algorithms: Hidden Markov Model, Long Short-Term Memory, and Conditional Random Fields. The comparative analysis on the outcomes of the mentioned algorithms and their accuracy scores were analysed in the paper. The misspelled dataset for the experiments is provided by Unibank from their chatbot dialogues while the clean textual data was retrieved from the books and newspapers in Azerbaijani. The experiments showed that the Bidirectional LSTM has the highest accuracy scores for both edited (98.2%) and noisy (96.2%) data corpora. Suggested models can be used in the application of algorithms focuses on part of speech tags and syntactic structure of Azerbaijani language which is an agglutinative language belonging to Turkic languages family, thus enabling the research to be further investigated in other agglutinative languages with similar grammatical structure.
Original language | English |
---|---|
Title of host publication | Proceedings of the 5th Artificial Intelligence and Cloud Computing Conference, AICCC 2022 |
Publisher | Association for Computing Machinery |
Publication date | 2022 |
Pages | 21-28 |
ISBN (Electronic) | 978-1-4503-9874-9 |
DOIs | |
Publication status | Published - 2022 |
Event | 5th Artificial Intelligence and Cloud Computing Conference, - Osaka, Japan Duration: 17 Dec 2022 → 19 Dec 2022 |
Conference
Conference | 5th Artificial Intelligence and Cloud Computing Conference, |
---|---|
Country/Territory | Japan |
City | Osaka |
Period | 17/12/2022 → 19/12/2022 |
Keywords
- CRFs
- HMMs
- LSTM
- Part of Speech Tagging
- PoS Tagging for agglutinative languages