Please use this identifier to cite or link to this item:
|Title||Building an Arabic Word Stemmer for Textual Document Classification|
|Title in Arabic||بناء مجذر للكلمات العربية لتصنيف الملفات النصية|
This thesis proposes a new stemming algorithm that addresses the ambiguity, irregular words and broken plural problems in current stemming algorithms, which are divided to two approaches, the root stemming and the light stemming. The proposed algorithm will depend on introducing new rules of patterns which increase efficiency of identifying words. Such algorithm will contribute to enhanced efficiency and speed of information retrieval and search engines. By using these rules, it can determine whether the sequence of affixes is a part of the real word or not. Thus the ambiguity problem can be solved. A new Arabic IR tool has been developed which has many options using java programming language with JDK 1.6; it allows user to load any data set, choose from any included stemmers, choose from the eight normalization steps, define the set of constants like “prefixes, suffixes, stopwords”, text classification, make comparisons between stemmers and extract charts that show these comparisons. The new tool used to test the proposed stemmer and the results which has been derived using CNN, BBC and OSAC corpora show that the proposed stemmer increases accuracy of text classification to an average of 91.7% which is better than using Light 10 or Khoja which achieve average accuracy of 90.2 % and 89.17% respectively.
|Publisher||الجامعة الإسلامية - غزة|
|Files in this item|