JIMMA UNIVERSITY
JIMMA INSTITUTE OF TECHNOLOGY
FACULTY OF COMPUTING
Statistical Topic Modeling for Afaan Oromoo News Articles: Using Latent Dirichlet Allocation (LDA) Algorithm
A THESIS SUBMITTED TO THE FACULTY OF COMPUTING IN PARTIAL FULFILLMENT FOR THE DEGREE OF MASTER OF SCIENCE IN
INFORMATION TECHNOLOGY
January 17, 2018
Jimma, EthiopiaJIMMA UNIVERSITY
JIMMA INSTITUTE OF TECHNOLOGY
FACULTY OF COMPUTING
DeclarationThe work contained in this thesis has not been previously submitted to meet requirements for an award at this or any other higher institutions. To the best of my knowledge and belief, the thesis is my original work and contains no material previously published or written by another person except where due reference is made.

SIGNATURE DATE

We Will Write a Custom Essay Specifically
For You For Only \$13.90/page!

order now

RESEARCH THESIS SUBMITTED BY:
SIGNATURE DATE
ADVISOR: Dr. Million Meshesha_ __ February 21, 2018
SIGNATURE DATE
CO-ADVISOR: Mr. Kibret Zewde (MSc) ________________________
SIGNATURE DATE
Approved by faculty of computing research proposal Examination members
__________________________________________________________
SIGNATUREDATE
__________________________________________________________
SIGNATUREDATE
___________________________________________________________
SIGNATURE DATE

List of figures
TOC h z c “Figure” Figure 1 Methodology of Statistical Topic Modeling PAGEREF _Toc511342080 h 11Figure 2 User Interface for Topic Modeling PAGEREF _Toc511342081 h 12Figure 3 Research Methodology and Thesis Structure PAGEREF _Toc511342082 h 17Figure 4 The diagram of topic modeling PAGEREF _Toc511342083 h 20Figure 5 Graphical model of the parameters of a dirichlet distribution PAGEREF _Toc511342084 h 26

Contents
TOC o “1-3” h z u Declaration PAGEREF _Toc516220142 h ii1.Introduction PAGEREF _Toc516220143 h 12.Motivation PAGEREF _Toc516220144 h 53.Problem of the statement PAGEREF _Toc516220145 h 74.Objective of the study PAGEREF _Toc516220146 h 95.Methodology PAGEREF _Toc516220147 h 95.1.Introduction PAGEREF _Toc516220148 h 95.2.Study design PAGEREF _Toc516220149 h 105.3.Data source and methods of data Collection PAGEREF _Toc516220150 h 105.4.Implementation tools PAGEREF _Toc516220151 h 105.5.Design procedure PAGEREF _Toc516220152 h 115.6.Evaluation procedures PAGEREF _Toc516220153 h 126.Related Works PAGEREF _Toc516220154 h 137.Scope of the study PAGEREF _Toc516220155 h 158.Significance and Application of Results PAGEREF _Toc516220156 h 169.Thesis Structure PAGEREF _Toc516220157 h 17Chapter Two PAGEREF _Toc516220158 h 182.Literature Review PAGEREF _Toc516220159 h 182.1.Introduction PAGEREF _Toc516220160 h 182.2.Supervised Learning PAGEREF _Toc516220161 h 182.3.Topic Modelling PAGEREF _Toc516220162 h 192.4.The Methods of Topic Modeling PAGEREF _Toc516220163 h 212.5.Afaan Oromoo PAGEREF _Toc516220164 h 27Chapter Three PAGEREF _Toc516220165 h 293.The Proposed Solution PAGEREF _Toc516220166 h 293.1.Introduction PAGEREF _Toc516220167 h 293.2.The proposed Framework PAGEREF _Toc516220168 h 293.3.Design PAGEREF _Toc516220169 h 303.3.1.Data Collection PAGEREF _Toc516220170 h 313.3.2.Text Pre-Processing PAGEREF _Toc516220171 h 313.3.2.1.Tokenization PAGEREF _Toc516220172 h 323.3.2.2.Finding meaningful words PAGEREF _Toc516220173 h 323.3.2.3.Parameter Selection for three models PAGEREF _Toc516220174 h 333.3.3.The LDA Algorithm PAGEREF _Toc516220175 h 333.3.4.Topic Model Labelling PAGEREF _Toc516220176 h 33References PAGEREF _Toc516220177 h 35
IntroductionWe live in a world where large amounts of data are continuously collected every day. As more unstructured information become available, it is difficult to obtain the relevant and desired information; that means it might become tedious and time consuming. These Large datasets of digitized content offer significant opportunities for researchers. An understanding of these methods and their application by researchers in the digital data is also required.
This problem raises the need for automated content analysis of text, which draws on techniques from natural language processing, machine learning, data mining and information retrieval to analyze text data at large scale CITATION Gri15 l 2057 1.

So we need tools and techniques to that automatically organize, search and understand vast quantities of textual information.
The most significant area to analyses large datasets of digitized content is the field of data mining. Data mining is the collective term for exploring large datasets using various techniques to find patterns in data. It incorporates many fields including machine learning, Information retrieval and database systems. The aim of data mining is to analyze large datasets consisting of thousands to millions of attributes and data points CITATION Zak14 l 2057 2.

Text mining or text analysis is one specific area of data mining. Any type of text file can be used in text mining CITATION Dea14 l 2057 3. There are several techniques within the area of text mining for analyzing text. One of the more recent developments in this area is topic modelling. This is a new area of research and one specifically designed for analysis of large datasets of digitized content.

The main idea behind statistical topic models is the assumption that documents are mixtures of topics, where a topic is a probability distribution over words. The discovery of topics is driven by the word co-occurrence patterns in a text collection. The majority of topic models are statistical generative models in which documents arise from a generative process. A primary goal of topic models is to invert the generative process through various standard statistical techniques and to infer the latent topics from which a collection of text documents was generated. Once the latent topics are discovered, it becomes much easier to understand these massive text collections, and they can be used as a concise representation of documents for various tasks.

A topic model is a kind of a probabilistic generative model that has been used widely in the field of text mining, Machine Learning, NLP and information retrieval in recent years CITATION Lin16 l 2057 4.
Topic models allow for discovering topic in a corpus that provides abstract view about a set of subjects encompasses of similar words CITATION 1Bl03 l 2057 5 CITATION Ble10 l 2057 6 . It helps in discovering hidden topical patterns that are present across the collection, annotating documents according to these topics and by using these annotations to organize, search and summarize texts.

Topic modeling is a form of text mining, unsupervised statistical machine learning techniques to identify patterns in a corpus CITATION 1Bl03 l 2057 5 CITATION Ble10 l 2057 6. It helps you to take huge collection of documents and groups or clusters words across the corpus into ‘topics’ by a process of similarity. In topic modeling topic are described as “a repeated pattern of co-occurring words that best represents the information in the collection” CITATION Min14 l 2057 7.

One way to understand how topic modeling works is to imagine working through an article with a set of highlighters. Suppose we are reading a newspaper and we have a set of colored highlighters in our hand. As we read through the article, we use the different color for highlighting the key words of themes within the document as we come across them. Then when we were done, we could copy out the words as grouped by the color we assigned them. That list of words is a topic, and each color represents a different topic. This is the notion of topic modeling. Suppose we want to learn something about a documents that is too big to read it takes too much time to browse the whole contents for human. Why don’t we just throw all these documents at the computer and see what interesting patterns (topics) it finds? An automated topic models facilitate understanding, organizing and summarizing huge text datasets. As noted by CITATION 1Bl03 l 2057 5 CITATION Ble10 l 2057 6, helps in:
Discovering hidden topical patterns that are present across the collection
Annotating documents according to these topics
Using these annotations to organize, search and summarize texts
A topic is defined as a collection of words that frequently appear together in the same context CITATION 1Bl03 l 2057 5 CITATION Den09 l 2057 8 CITATION Min14 l 2057 7. Words can have a different meaning depending on the context.

Most documents are about more than one subject, but the majority of natural language processing algorithms and information retrieval techniques implicitly assume that every document has just one topic CITATION Ble10 l 2057 6 CITATION Oun13 l 2057 9 CITATION Lin16 l 2057 4. To find a topic in a collection of documents, the various methods are used. Topic modeling algorithms mainly used to develop a model for search, browse and summarize large corpus of texts CITATION Ble10 l 2057 6.
Generative models are used for topic modeling to extract topic from large corpus, models such as Hidden Markov Model (HMM), Latent Semantic Analysis (LSA), and Latent Dirichlet Allocation (LDA) and traditional clustering techniques such as K-Means. In this research LDA is used for several reasons for extracting latent topics including accuracy, scalability and comprehensionCITATION Bir09 l 2057 10 CITATION PCr11 l 2057 11CITATION AKe13 l 2057 12 CITATION WZh15 l 2057 13.

Ralf Krestel, et al CITATION Ral09 l 2057 14 developed to improve searching for recommending tags using Latent Dirichlet Allocation (LDA). Resources annotated by many users and thus equipped with a fairly stable and complete tag set are used to produce latent topics to which new resources with only a few tags are mapped. Based on this, other tags belonging to a topic can be recommended for the new resource.

The authors evaluated that the approach archives better precision and recall than the use of association rule and also recommends more specific tags. Denial Ramage et al. CITATION Den09 l 2057 8 has developed a method based on Labeled LDA for multi-labeled corpora. Labelled LDA improved traditional LDA with visualization of corpus of tagged webpages. The new model improves upon LDA for labeled corpora by incorporating user supervision in the form of a one-to-one mapping between topics and labels. The topics by unsupervised variant were matched to a Labeled LDA topic highest cosine similarity. Totally 20 topics are learned. It has been concluded that the Labeled LDA out performs SVM when extract tag-specific documents. GirishMaskeri et al CITATION Gir08 l 2057 15, proposed an approach based on LDA for topic extraction from source code.

Manning et al. CITATION Dan11 l 2057 16 proposed some partial labeled topics for interpretable in text mining. These models make use of the unsupervised learning machinery of topic models to discover the hidden topics within each label, as well as unlabeled, corpus-wide latent topics. The many tags present in the dataset to quantitatively establish the new models’ higher correlation with human understanding scores over several strong baselines. High capabilities of topics space with identified teams of latent dimensions are made as output by PLDA. Correlation of PLDA, LDA, and L-LDA for similar dataset, they observed to be a constant and high performing metric of frame.

D. Ramage and E. Rosen CITATION Sar16 l 2057 17 CITATION Mao12 l 2057 18proposed a semi-supervised hierarchical topic model (SSHLDA) which aims to explore new topics automatically. SSHLDA was a probabilistic graphical model that describes a process for generating a hierarchical labeled document collection. They have compared the hierarchical latent dirichlet allocation (HLDA) with hierarchical labeled latent dirichlet allocation (HLLDA). Also proved an HLDA and HLLDA are special case of SSHLDA.
S. Vijaya CITATION Sug16 l 2057 19 extracted topics for news articles using LDA and its variants LLDA, PLDA. As far as the topic modeling is concerned, LDA is intended to predict the topic of the article whereas LLDA is intended to determine the labels of given topic. PLDA is used to find hidden sub topics from collection of all categories. LDA and LLDA help to find topics and labels of the documents in a large collection of documents and can generate the key words for document collection. PLDA enables to extract keywords from particular domain of subtopics.
The authors evaluated the approach and they achieved better precision and recall when compared to other work from their discussion. But they did not include Topic Network analysis for more accuracy. So in this work we will try to include the relationship between topics to combine interrelated topics semantically. This will improve the efficiency and effectiveness of the work.

From the above literature survey it is understood that the various LDA techniques has been applied to find the topics in dataset.

The proposed work will be done herein is about how topic modeling is used in text analysis to find significant topics for Afaan Oromoo news articles that will be collected from various categories like Sport, Health, Technology and others using Latent Dirichlet Allocation (LDA) based model. LDA is a statistical approach (Blei et al., 2003, 2010,2012) effective on building topic models for huge documents like news articles, research paper abstracts. Given sufficiently long texts, these approaches are capable of identifying significant topics, based on the co-occurrence relationship among words.
News articles are stream of stories that tell us what happened in the past and it is a contemporary witnesses of society who want to study the past and can tell us a lot; by definition they live in the present. Which news outlets cover climate change most? Which were the topics of discussion surrounding Ethiopian election of 2012?
But as the number of digitized news articles produced become overwhelming; impossibly large for manual human processing.

MotivationThe work is about techniques for improving access to information and improving the underlying language technology using Topic Modeling to discover the theme of document that abstract view of the set of topics addressed in the document. So, documents can be classified, arranged and searched according to their subjects. Topic Modeling has been the area of interest of most of the researchers from the fields of Text Mining, Natural Language Processing, Machine Learning and information Retrieval.
Most of the suggested Topic Models have been designed for many languages like English. For Afaan Oromoo language, topic modeling has not been tried as far. Although some text Summarization has been designed for Afan Oromo Language that shortening a text with the major points of the document, but yet the technique of Topic Modeling is quite different from this.

Finding out the desired information in a particular document might be difficult to grasp the information about the content quickly. Instead of reading an entire article or document to find out whether it is related to the topic of interest all documents should clustered under their corresponding theme. Topic models help tremendously to organize, summarize and search the large text collections. One of the central step in analyzing content of any text is to understand the
attention mentioned in that text, or in other words, to uncover what topics are talked about.

Challenges discussed by scholars in topic modeling in various Language
There are challenges and deliberations that can be considered in topic modeling that needs further improvement for the future work. In this section, we will discuss important issues and we
will find issues that have not been sufficiently solved. These are the gaps in the reviewed work that would prove to be directions for future work. Accordingly, some challenges are stated by some scholars CITATION Placeholder1 l 2057 20. One of the major challenges related to this work is Visualization of news topics and user interfaces. Topic models provide new exploratory structure for big data: the topics are displayed as the most frequent words. By contrast, topics that are assigned a label to make the results easier to understand for a user. Therefore, how to display a topic with a specific meaning is the key task of this work. Overall, for users to easily understand the visualization of the discovered topics in a user interface is essential for topic modeling.
Problem of the statementIn recent years, a huge number of text articles are generated every day from different blog, media and etc. This leads to the major tasks in natural Language Processing (NLP) i.e. effectively managing, searching and categorizing articles depending up on their themes or subjects. Typically, these text mining tasks include text clustering, document similarity and text categorization. Comprehensively, we have to find out some ways so that the theme of the text can be extracted. In text analytics, this is known as “Topic Modeling”. As mentioned above, topic models have emerged as an effective method for discovering useful structure in collections. In these study, we find that topic models act as more than a classification or clustering approach to extract hidden “topics” that can reflect the underlying news articles meaning more comprehensively.
The amount of information available in digital form is getting double which is leading to the information overload almost in all languages. We can find lot of documents related to a subject, but in limited span of time, human mind is not able to search all of them to get required information. Texts in any domain are written in detail and the readers must read the whole content to understand what it talks about. If the reader doesn’t understand what the contents talking about he/she needs to read again and again to extract the theme (topic) of the text. This consumes time and very tedious one. So, a tool that can process and analyze the whole content in short period and extract all much of the information out of it is required, rather than finding documents through keyword search alone.

Keyword based search is popular information retrieval scheme to discover relevant documents from document collection, but it loses semantic. For example, some relevant documents may not contain the exact keywords specified by the user. So, it is better to find the theme of the document and then search on documents of the related theme only rather than finding documents through keyword search alone. This will reduce the number of documents, because it filters the document on the basis of topic, and thus search will be more effective. Concept search is an alternative to the keyword based search scheme that can address this problem through Topic modeling. One way to perform concept search is to apply topic models that used to make implication regarding the underlying thematic or topic structure of document collection and categorize documents based on their underlying topics. It can be adapted to many kinds of data such as collections of text documents (news articles, biomedical texts and etc.), images and social networks CITATION 1Bl03 l 2057 5 CITATION Ste04 l 2057 21.

It provides a way to group vocabulary from a corpus to form latent ‘topics’. For each document there is collection of topics. Learning meaningful topic model with massive document collections which contain millions of documents and billions of tokens is challenging; because, it needs to deal with large number of topics and it needs scalable and efficient way of distributing the computation. The problem of automatic topic extraction, document categorization based on the theme is one big issue and many researches have been done in many languages like English, Portuguese, French and others.

A popular approach to address this question is text categorization that classify the content of text into one or more separate topical categories. Numerous methods for text categorization have been introduced in the literature, each one has its own costs and benefits.

Afaan Oromoo text readers are not exceptional to suffer from this problem. There are many domain areas that produce large content of textual information, such as legal, news media agencies, government offices, etc. which needs topic modeling to save the time of readers. Afaan Oromoo Textual information in digital form is increasing highly from time to time since itis an official language in Oromia regional state. Almost all Media agency like Oromia Broadcasting networks (OBN), Fana Broadcasting Corporations (FBC), etc. publish their news items in digital form. These News items will be clustered under their theme in order to make the searching and management of news easily. To extract the theme of the news from textual form is useful to categorizing the documents under its corresponding topic.

With the automatic topic modeling services that can potentially increase the users’ browsing and reading time to get the theme of the text and browsing over the content is required.

Statistical Afan Oromo topic modeling for news released online by news agencies is desirable to employ a powerful computational. As far as my knowledge is concerned, there is no attempt on topic modeling for Afaan Oromoo.

Therefore, the purpose of this study will be to explore appropriate statistical approaches for developing and implementing a statistical topic modeling for Afaan Oromoo news articles that extract topic from the content.

To this end, the following are the research problems that needs to be explored and answered in this work.

How to effectively address challenges of searching in news articles corpus and how to detect topic by applying semantically enhanced represented topics in topic models.

How to successfully implement a new semantically multi-topic model in the applications of text mining and If we use different aggregation strategies and train topic models, do we obtain similar topics or are the topics learned substantially different?
How topic modeling is efficient in organizing News item collections and how much it enhances the searching time
Objective of the studyThe general objective of the study is to construct a probabilistic topic modeling for collection of documents automatically for Afaan Oromoo news text.

The following specific objectives are formulated to achieve the general objective of the study.

To review related research works in the area of topic modeling
To review algorithms and techniques that have been used in topic modeling
To investigate the structure of Afan Oromo language to apply the topic modeling
To develop a framework for topic modeling that will serve as a model for Afan Oromo news articles topic modeling to represent an appropriate topic in a corpus.

To test and evaluate the model
To draw conclusions based on experimental result and recommend further research works
In order to achieve these goals, surveys on topic modelling and text mining will be investigated. Then we will solve the problems step by step. The objective of the research in this thesis will be to develop a better topic models to understand the structure and content of document collection.

MethodologyIntroductionThis section describes an actions that will be taken to investigate a research problem and the justification for the procedures or techniques used to identify, select, process, and analyze information applied to answer the research questions and to evaluate a study’s result and validity. It also describes how the data will be collected or generated and how it will be analyzed?

Study designAnalysis of data using standardized statistical methods like statistical formula to determine the validity of the work will be used, to form some valid and logical conclusions CITATION Lin16 l 2057 4. In this work, we will conduct extensive qualitative and quantitative experiments on the proposed schemes based on standard LDA. So the work follows empirical approach of research science; because statement of the problem and research questions are clearly defined and measured, description of process used to achieve the objective will be clearly stated, data should be presented regarding the findings and tests will be conducted for analysis and discussion of the results. We will compare a number of aspects of these schemes and models, including how the topics learned by these models differ from each other and their quality. In addition, we show how topic models can help in applications, such as classification problems
Data source and methods of data CollectionThe data (corpus) will be prepared as a sample to evaluate the proposed model that include some categories. In this phase, the corpus will be organized by considering the structure of the language like stop words existed in the language. To accomplish the proposed model, the main requirement will be to construct a topic list that contains collection of various topics that occur frequently in news articles and words corresponding to those topics. These topic lists will be prepared manually or some learning algorithm can be employed to generate such collection for the required model.

Implementation toolsFor topic modeling task software package that is used to train an LDA will be GenSim. The reason behind this software package will be used is because it is very powerful and generate an excellent output. It gives the ability to calculate the perplexity of the model with respect to ?, ?, and K which are the model parameters. All these parameters will be discussed in detail. The Topic List Corpus used for the implementation will be on collection of limited words only. For each Corpus topic lists corresponding lists of Afaan Oromoo words in each topic will be identified.

Design procedureThe process of proposed Topic Model will be performed in four major stages. These stages are shown as the following figure 1.

Figure SEQ Figure * ARABIC 1 Methodology of Statistical Topic ModelingThe first stage will be news or data pre-processing. This includes:
Tokenization- this will be done by regular expression in python
Stop words removal- nltk package will be used
Stemming- Afaan Oromoo Stemmer will be used.

The second stage is Clustering. We will cluster our news based on their category using clustering tools. In this stage text clustering technique is applied on the news collection to create the text document clusters. The purpose of this stage is to group the similar text document for making it ready for Topic modeling and ensures that all the similar set of news participates as a group in modeling process.

In the third stage Latent Dirichlet Allocation (LDA) topic modeling technique is applied on each individual News cluster to generate the cluster topics and terms belonging to each cluster topic by considering global frequent terms with semantic approach.

There are many approaches for extracting topics from a text such as – term frequency and inverse document frequency (TF-IDF), Non-Negative Matrix Factorization techniques (NMF), and Latent Dirichlet Allocation (LDA). From these, LDA is the most popular topic modeling techniques and in this work this method will be used. Generally, to achieve the goal dataset and software that will be used must clearly described.

The fourth stage is prediction of grouped topic with topic labels that indicate the exact category of the news. Here for the visualization of the extracted topics of a given documents the following interface will be used. We will set a number of topics to be generated and the directory to which all output files are written. By clicking Generate Topics button we can run the topic modeling algorithm. Once the extraction of topics complete, all generated list of topics will keep in single folder of the output directory and set of topics will be discovered. If necessary, Output results will also available as CSV files under the output_csv folder of the output directory.

Figure SEQ Figure * ARABIC 2 User Interface for Topic ModelingEvaluation proceduresThe output of the proposed model will be tested on news texts and the testing evaluation is done by comparing the obtained output theme from the topic model with the corresponding news topics as well as the content of the news text. Therefore, the performance measures used in the study will be Precision and Recall, and based on these evaluation metrics we will put the overall performance of the topic model by F-Measure.
Related WorksThe notion of statistical topic modeling is not new one and it is gaining increasingly consideration in different text mining communities. There were many schemes applied to extract hidden topics from the text. Latent Dirichlet Allocation (LDA) CITATION 1Bl03 l 2057 5 CITATION QMe15 l 2057 22 CITATION KCh14 l 2057 23 is becoming a standard tool in topic modeling; it assumed that both word-topic and topic-document distributions have a Dirichelet prior. Latent Dirichlet allocation (LDA) is a generative probabilistic model for collections of discrete data such as text corpora. It is a three-level hierarchical Bayesian model, in which each item of a collection is modeled as a finite mixture over an underlying set of topics. Each topic is, in turn, modeled as an infinite mixture over an underlying set of topic probabilities CITATION 1Bl03 l 2057 5.

There are many types of LDA-based topic models such as:
• Document-Topic Model.
• Author-Topic Model (ATM).
• Relational-Topic Models (RTM).
• Labeled LDA (LaLDA).
As a result, LDA has been extended in a variety of ways, and in particular for social networks and social media, a number of extensions to LDA have been proposed and applied in the topic identification CITATION Nit16 l 2057 24. According to most researchers, a topic is a cluster of words which are closely related to each other. Clusters depend on the stemming process that specifies the type of words (root, stem, etc.).

For the Afaan Oromoo language, there is a vigilant lack of research in the field of topic Modeling. In fact, few works dealing on language other than Afaan Oromoo on topic modeling and have been presented such as the works of Abbas et al. CITATION MAb06 l 2057 25 for Arabic language and Zrigui et al CITATION MAb06 l 2057 25 Abbas et al. CITATION MAb06 l 2057 25 CITATION Bec14 l 2057 26proved that SVM outperforms TF-IDF by having the best values of precision and F-measure.

Later, Abbas et al. CITATION AAl15 l 2057 27 proposed their own technique for topic identification named TR-Classifier. It is based on triggers which are identified by using the Average Mutual Information. In fact, topics and documents are presented by triggers which are a set of words that have the highest degree of correlation. Then, based on the TR-distance, the similarity is calculated between triggers to identify the document’s topic.

Zrigui et al CITATION MZr12 l 2057 28 have proposed a new hybrid algorithm for Arabic topic identification named LDA-SVM. This algorithm is based on the combination of LDA and SVM. The LDA method is used to classify documents. Then the SVM method is employed to attach class label. The idea of this combination is to reduce the feature dimension by LDA before applying the SVM method.

Kelaiaia and Merouani CITATION AKe13 l 2057 12 proposed another way of using LDA in topic identification. In fact, they employed topic models more directly by using the documents distribution over topics for Arabic topic identification.

Alsaad and Abbod CITATION AAl15 l 2057 27 also used the TF-IDF for Arabic topic identification. In this work, more attention has been given to the pre-processing step. Alsaad and Abbod CITATION AAl15 l 2057 27 proposed their own root-based Stemmer named Alsaad Stemmer. Then they compared it to the Light Stemmer which is a lemmatization algorithm. As result, they proved that Alsaad Stemmer outperforms Light Stemmer. In fact, it leads to a smaller index size and more important better topics representations by avoiding term repetition of similar words or words which have the same root.

In addition, in CITATION Min14 l 2057 7 another work has been tried on news items to extract topics found in the entire content of news.

The major limit of these different methods is that a training step is necessary to identify the topics and to construct a vocabulary for each topic. Thus, we select unsupervised method LDA. That means there is no need to a training step because topics are identified in the process of topic extraction.
We situate the new Document-Topic Model and LaLDA models within the general domain of topic modeling. As these models be evaluated using a document labeling task, document labels are generated from the topic label. we also review the literature on automatic topic labeling.

Scope of the studyTo find out how a subject of a given text is identified, we use some form of manual content or automatic analysis CITATION 1Bl03 l 2057 5 CITATION Ble10 l 2057 6. This is done in one of two ways. One way is simply to read the contents and produce interpretations based on understanding by defining a set of themes or codes. Another way is simply using supervised Machine learning and keyword to determine the theme of the entire documents. Approaches based on understanding of text are difficult to produce the results specially for large corpus. Keyword and supervised approaches are limited in their scope by the necessity to know a priori of what is worth looking in texts. So unsupervised methods are preferable to find an inductive analysis CITATION MAb06 l 2057 25 CITATION AAl15 l 2057 27. One of the simplest inductive approach to discover topics in some corpora is word frequency and co-occurrence analysis. For some large corpora such as in news items, topic detection is required. Therefore, topic modeling is one of the best methods for topic detection in the text analysis. To develop a topic model, a source (topic lists) containing words occurring in a language is needed, using which theme of a document can be generated. The scope of Afaan Oromoo language is so vast, as it contains rich of words. Afaan Oromoo News is the source which covers a wide scope of words occurring in Afaan Oromoo vocabulary. So, the in the proposed topic model will illustrate our approach by addressing the task of topic modeling in the news items analysis to extract the theme of digital News texts. In topic modeling the first step is data collection. These data will be collected manually from online news websites with limited instances from different news websites related to education, sports, Health, and weather, each with news articles will be collected.

The study utilizes LDA with data represented as bag of words model. The model assumes each document is mixture of topics and topics are represented by words often occurring together in the document CITATION 1Bl03 l 2057 5 CITATION Placeholder1 l 2057 20. For this study we will select limited topics based on interpretability and analytic convenience. Words with similar meaning and falling into same category are placed in same list, then each of the word from the input text can be assigned a category to which they belong (per word topic assignment) leaving helping verbs etc. This process is called Generative Process, as it justifies how the input text would have been generated CITATION 1Bl03 l 2057 5 CITATION Ble10 l 2057 6 CITATION MAb06 l 2057 25. After this process, the proportion of involvement of each topic in the input text is computed. This second process is termed as Statistical Inference Process CITATION 1Bl03 l 2057 5 CITATION Min14 l 2057 7 CITATION MZr12 l 2057 28. After the data have been collected data pre-processing will follow like tokenization, identifying the terms that represent the document and parameter selection. The absence of well-organized corpus for Afaan Oromoo language may be a great limitation. The amount of corpus that will be prepared for this study will be small to evaluate the work.

Significance and Application of ResultsTopic modeler for Afaan Oromoo texts will be an input to the development of the language text and has significance to initiate further research in the area of document similarity and Recommendation system for Afaan Oromoo language. Moreover, it can also help to initiate topic modeling for other local languages. This research will make important contributions to the domains of topic modelling, information filtering and information retrieval. Specifically, this research will propose a novel approach to incorporate data mining and topic modelling techniques for generating more accurate topic models. In order to interpret topic representation semantically and individually, we will develop a new approach to use patterns (i.e., itemsets) to represent topics instead of individual words as used in traditional topic models by integrating data mining techniques with statistical topic modelling techniques to generate itemsets topic models to represent news collections.

Thesis StructureThis thesis is organized in to 6 Chapters which follow the structure in Figure 3:

Figure SEQ Figure * ARABIC 3 Research Methodology and Thesis StructureChapter 2 This chapter presents the knowledge necessary to address the problems defined. The literature review covers useful techniques in the area of topic modelling. In chapter 3, our proposed model to address the problems and limitations mentioned in Chapter 1 and Chapter 2 will be discussed. In chapter 4, experiments with scientific datasets are conducted to verify the effectiveness of the proposed model on discovering topics and expressing topics with meaningful patterns.

Chapter 5, This chapter summarizes the key findings and highlights the significant contributions in this thesis. Chapter 6, Limitations are also pointed out and consequently highlight the need for further research in the future.
Chapter TwoLiterature ReviewIntroductionThis section, covers a critical review on the evolution of topic models and what methods have been used in the extraction of topics to address the research gap introduced in chapter 1. Many scholars have recently proposed new topic models which would learn topics from unstructured text in many dimension to optimize search. The main goal of this work is to provide an overview of the methods of topic modeling based on LDA. So, this review presents and analyses current theory of scholarly articles which are related to Topic Modeling based on LDA. In addition, exploration of topic modeling applications in various sciences and challenges in topic modeling will be analyzed. Here we review various methods for text categorization with its own costs and benefits. Following CITATION Kev10 l 2057 29, we discuss the three main sessions of text categorization methods (human coding, supervised learning, and topic modeling), differentiated by three types of costs in three stages of the analysis.

Pre-analysis cost is the cost acquired before the actual categorization happens where conceptualization and operationalization are dealt with. Methods with high pre-analysis cost are the ones that require human with fundamental knowledge to prepare the data for the text categorization to happen.

Analysis cost is the cost acquired during the categorization of the text of interest happens. Methods with high analysis cost are the ones that require humans to spend more time per text unit to categorize.

Post-analysis cost is the cost incurred after the categorization where the results are assessed and interpreted. Methods with high post-analysis cost are the ones that require humans to spend more time analyzing the results. Results that are incoherent and uninterpretable also increase this cost.

When analyzing contents concerns what topics are talked about, identifying and categorizing it is at optimal analysis cost is a much more challenging task.

Human Coding
Human coding is a standard manual content analysis method applied to the problem of identifying issues or topics in the text. It usually consists of (1) defining a coding system by domain experts (e.g., a set of diverse), (2) training human coders, and (3) coding the documents of interest manually. In practice, it involves an iterative process where coding issues are conceptualized and repeatedly refined through several studies until a final coding system is achieved. Since the codebook needs to be defined and human coders need to be trained before the actual human coding can happen, the costs in both pre-analysis and analysis phases of this approach are high.

Supervised LearningWith the increasing availability of big data, the cost of manually coding documents has become unrealistic. Many recent efforts focus on using automated content analysis approach to reduce the analysis cost. One approach to automate this type of problem is supervised learning techniques from machine learning, data mining and related fields. For example, CITATION Pur06 l 2057 29 CITATION Hil08 l 2057 30 describe the automated classification system used in the Congressional Bills Project, in which Support Vector Machines (SVMs) and other machine learning techniques are used to classify legislative text into one of the 226 subtopics in the Policy Agendas Topics codebook. A similar approach has also been used to classify German online political news CITATION Sch13 l 2057 31 CITATION Ver14 l 2057 32.

The supervised learning techniques still require labeled data for training. Thus, similar to the human coding, the supervised learning method still has high pre-analysis cost. However, after the training phase, the learned classifier can be used to automatically label new data, which reduces the analysis cost significantly. A promising approach to reduce the pre-analysis cost of labeling training data is active learning, which requires human to label only a subset of the data while still achieving comparable classification accuracy CITATION Hil08 l 2057 30 CITATION Pur06 l 2057 29. To alleviate this type of limitation, Unsupervised learning approach is an option in many fields. One of the techniques of Unsupervised learning is Topic Modeling.

Topic ModellingTopic models are generative models which can learn in an unsupervised way the underlying mixture of (unobserved) topics that compose a text document or any kind of observation that can be represented as a bag of words. It has gained exponential popularity in analyzing news text in particular, and in discovering thematic structure of large-scale text corpus in general CITATION Ble10 l 2057 6. In machine learning and natural language processing, topic models are generative models, which provide a probabilistic framework CITATION Bet11 l 2057 33.
As introduced by CITATION 1Bl03 l 2057 5, Latent Dirichlet allocation (LDA), the original unsupervised topic model-extends previous latent variable models including LSA CITATION Dee90 l 2057 34, LSI CITATION Pap98 l 2057 35, and PLSA CITATION Hof99 l 2057 36.

Recent work following this approach include applying topic models to examine the news articles from 1997 to 2004 CITATION Qui10 l 2057 37. LDA and other unsupervised extensions require no training data and thus have a low pre-analysis cost. However, each topic is just a multinomial distribution over a fixed vocabulary, which is often represented by a list of words with highest probabilities.

To better realize how to use a topic model in a given corpus, we first describe the basic ideas behind topic modeling by illustrating the key steps, including the bag of words (BoW), model training, and model output. We first assume that there are N documents, V words, and K topics in a corpus. Then, we discuss each component of this diagram in detail.

The BoW
In natural language processing, a document is frequently represented by a BoW that is actually a word-document matrix. As shown in Table 1, there are four words (weather, science, school, and League) and four documents (d1–d4) in this corpus. Value wij in the matrix represents the frequency of word i in document j. For example, w3,1 = 1 means that the frequency of the word “school” in document d1 is 1.0. It is obvious that the number of words is fixed in a corpus, and the collection of these words constitutes a vocabulary and represented by the BoW. A BoW is a simplified representation of a corpus as the input of topic modeling. After construction of the BoW, it serves as the input of the next step in topic modeling. Suppose there are N documents and V words in a corpus; thus, the BoW of this corpus is an N × V matrix.

Moreover, the documents in a corpus are independent: there is no relation among the documents. The exchangeability of words and documents could be called the basic assumptions of a topic model. These assumptions are available in LDA.

Table SEQ Table * ARABIC 1 An example of a BoW
Terms Documents
D1 D2 D3 D4
Weather 2 0 3 0
Science 0 5 0 0
School 1 2 0 0
League 0 1 4 7
Document
Corpus Bow (V)
N Output
43243502660650
Figure SEQ Figure * ARABIC 4 The diagram of topic modelingModel training
In a BoW, the dimensionality of word space may be enormous, and the BoW reflects only the words of the original texts. In contrast, the most important thing people expect to know about a document is the themes rather than words. The aim of topic modeling is to discover the themes that run through a corpus by analyzing the words of the original texts. We call these themes “topics.” The classic topic models are unsupervised algorithms (that do not require any prior annotations or labeling of the documents), and the “topics” were discovered during model training.

To have a better way of organizing the explosion of digital document archives these days, it requires using new techniques to automatically organizing, searching, indexing, and browsing large collections CITATION Placeholder1 l 2057 20.

The central importance of topic modeling is to discover patterns of word-use and how to connect documents that share similar patterns and these documents are mixtures of topics, where a topic is a probability distribution over words. In other word, topic model is a generative model for documents. It also creates a new document by choosing a distribution over topics. After that, each word in that document could choose a topic at random depends on the distribution. Then, draw a word from that topic CITATION Rub15 l 2057 38.

In this thesis, following the automated content analysis approach, we introduce novel topic models, which are guided by additional information associated with the text and designed to discover and analyze contents of Afaan Oromoo text at lower cost.
The Methods of Topic ModelingThe study of topic modelling started from the need to compress large data into more useful and manageable knowledge. There are a variety of different methods for topic modelling, using different sampling algorithms for word selection and topic creation. Latent semantic analysis (LSA) CITATION Dee90 l 2057 34 is the first method that used in topic modeling; uses a singular value decomposition of the matrix of a collection, forming a reduced linear subspace that captures the most significant features of the collection. This method is the most basic and looks at the frequency of words within a document and creates topics based on the frequencies of words occurring in each document.

Then, another remarkable step is Probabilistic LSA (PLSA) model CITATION Hof99 l 2057 36 which is a generative data model that can provide a solid statistical foundation. In the statistical mixture model, each word in a document as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of topics. Thus each independent document is represented by a list of mixing proportions of latent topics, and each of topic is represented by mixing words of multinomial random variables.
Thus, the joint probabilities of observing all terms are generated by the mixture process as follow:
pd,w=pdi=1mpwidwherepwi|d=z?Zpwi|dpzdLatent dirichlet allocation (LDA) is another basic topic model that extends the PLSA by adding the Dirichlet Process prior to optimize the topic distribution over documents and word distribution over topics. It groups words together based on how likely they are to appear in a document together CITATION 1Bl03 l 2057 5.
A recent example of the use of topic modelling in science includes the work on topic modelling for Cluster Analysis of Large Biological and Medical Datasets, CITATION Zha14 l 2057 39. In their work, they assessed whether topic modelling is useful for biology and medicine. They analyzed three different datasets for Salmonella pulse-field gel electrophoresis, lung cancer, and breast cancer and compared other data mining techniques to topic modelling using LDA algorithm. Their goal was to assess whether topic modelling gave them a better answer to a particular problem they were trying to solve for each dataset. The analysis found that topic modelling gave them a better result than the other data mining techniques. They concluded that topic modelling is beneficial for sorting through large sets of medical data with slightly better precision than other data mining methods CITATION Zha14 l 2057 39.
In this section, the LDA (Latent Dirichlet Allocation) method used in this work will be discussed that deals with words, documents and topics.

Latent Dirichlet Allocation (LDA)
In this, we review the most recently popular topic model, latent Dirichlet allocation (LDA) in detail. LDA is a simple and powerful topic models. Topic models provide an interpretable low-dimensional representation of documents (i.e., with a limited and manageable number of topics) CITATION Lin16 l 2057 4 CITATION XYa l 2057 40. As stated in CITATION 1Bl03 l 2057 5 CITATION Ble10 l 2057 6 CITATION Ral09 l 2057 14 LDA is a typical statistical topic modelling technique and the most popular topic modelling techniques currently in use, which is unsupervised learning algorithm to infer semantic word sets called “topics”. It associates each document with a probability distribution over topics, where topics are probability distributions over words.

Latent Dirichlet allocation (LDA) takes as input a set of D documents, in which word tokens {wd}D d=1 are from a vocabulary of V unique word types. LDA imagines that there are K shared topics, each of which is a multinomial distribution over the vocabulary drawn from a Dirichlet distribution prior.

A probability distribution is an equation that links every possible outcome of a random variable with the probability that the event will occur. For example, in a fair coin flip, there are two possible outcomes: – heads or tails. Heads is represented by 1 and tails by 0. There is also a random variable labelled X, with two possible outcomes represented by X:0 or X:1. The probability distribution of X or P(X=x), is:
P(X=0) = 0.5
P(X=1) = 0.5
There are two probability distributions used in topic modelling, the first being the probability distribution over topics. This is the group of topics that are most likely to be used in a specific document. An example of such is:
?’1topic1 = 0.25
?’1topic2 = 0.25
?’1topic3 = 0.25
?’1topic4 = 0.25
The second probability distribution is the probability distribution over words. These are the words most likely to be found in a specific topic. For example, if I was looking at astronomy papers the probability distribution of one topic could be:
?’1elliptical = 0.35
?’1orbit = 0.25 24
?’1satellite = 0.3
probability distribution x = (x1, …, xk) is
px|?=k=1k?kk=1k?kk=1kxk?k-1(2.1)
Where xi?(0,1) and i=1kxi=1A Dirichlet distribution can also be parameterized by a concentration parameter ? ; 0 and a mean probability distribution p. This two-parameter Dirichlet distribution, denoted by Dirichlet(? ,p), is equivalent to Dirichlet(?) if we define ?=k=1k?k and p=?/?. When the mean distribution p is a uniform distribution, the Dirichlet distribution is called symmetric and often denoted by Dirichlet(?). Here Symmetric dirichlet distribution, where all of the elements making up the parameter vector ? have the same value is used.

More specifically, the multinomial ?k of topic k is a distribution over words
p?k|?=k=1k?kk=1k?kk=1kxk?k-1Parameters are n number of trials and event probabilities pi, …, pk where pi=1A Bayesian inference model is a method used to calculate the probability of an event occurring, given the observed data. It combines common sense assumptions and the outcomes of previous related events. The model runs through several iterations of assigning words to topics to improve the model. The more iterations done in this way, the more the model will accurately reflect the topics present in a corpus CITATION Dic14 l 2057 41. The data is treated as observations arising from a generative probabilistic process that includes hidden variables. A generative probabilistic process is a process for randomly generating observable data. Hidden or latent variables are not directly observed but are inferred using posterior inference. Posterior inference is where the hidden variables are estimated based on relevant background evidence. In the case of latent dirichlet allocation, topics are hidden variables and are inferred from the words in the documents CITATION Ble10 l 2057 6. An important point of LDA is that documents consist of multiple topics. LDA is used to decide which topics are being discussed in a specific document, based on the analysis of a set of documents already observed CITATION Ste04 l 2057 21.

It can determine the hidden topics in collections of documents from the appearing words in the documents. Let D = {d1, d2, …, dM} be a collection of documents. The total number of documents in the collection is M. The idea behind LDA is that every document is considered as involving multiple topics and each topic can be defined as a distribution over a fixed vocabulary of words that appear in documents. Specifically, LDA models a document as a probabilistic mixture of topics and treats each topic as a probability distribution over words. For the ith word in document d, denoted as wd, i, the probability of wd, i, P (wd, i) is defined as:
Pw d,i=j=1vPwd,iZd,i=Zj*P(Zd,i=Zj) (2.2)
Zd,i is the topic assignment for wd,i, Zd,i = Zj means that the word wd,i is assigned to topic j and the V represents the total number of topics. Let ?j be the multinomial distribution over words for Zj, ?j = (?j,1, ?j,2, …, ?j,n), k=1n?j,k=1. ?d refers to multinomial distribution over topics in document d. ?d = (?d,1, ?d, 2, …, ?d,v), j=1v?d,j=1. ? d,j indicates the proportion of topic j in document d. LDA is a generative model in which the only observed variable is wd;i, while the others are all latent variables that need to be estimated. CITATION 1Bl03 l 2057 5 introduce Dirichlet to the posterior probabilities ?j and ?d, which contributes to optimize the distributions.

Among many available algorithms for estimating hidden variables, the Gibbs sampling method is a very operative strategy for parameter estimation CITATION Ste04 l 2057 21 that is used in this work.

Given a corpus of documents, LDA attempts to discover the following:
It identifies a set of topics
It associates a set of words with a topic
It defines a specific mixture of these topics for each document in the corpus.
The process of generating a corpus is as follows,
Randomly choose a distribution over topics
For each word in the document
randomly choose a topic from the distribution over topics
randomly choose a word from the corresponding topic
The output of a topic model is then obtained in the next two steps.
The generative process
We want to express LDA as a generative probabilistic process. To do this an assumption must be made that there are a number of topics related to a collection of documents. First, topic modeling needs to mimic the generative process of documents. Each document is assumed to be generated as follows: for each word in this document, choose a topic assignment and choose the word from the corresponding topic. Therefore, we use LDA as examples to describe the generative process in this paper.

In LDA, the two probability distributions, p(z|d) and p(w|z), are assumed to be multinomial distributions. Thus, the topic distributions in all documents share the common Dirichlet prior ?, and the word distributions of topics share the common Dirichlet prior ?. Given the parameters ? and ? for document d, parameter ?d of a multinomial distribution over K topics is constructed from Dirichlet distribution Dir(?d|?). These parameters are an initial prior distribution about the distribution. They are called hyper parameters.
Similarly, for topic k, parameter ?k of a multinomial distribution over V words is derived from Dirichlet distribution Dir(?k|?). As a conjugate prior for the multinomial, the Dirichlet distribution is a convenient choice as a prior and can simplify the statistical inference in LDA.
So, the above statement can be put as follows
For each topic k ? {1, …, K}:
Generate ?k {kw}Vw=1 ~ Dir (. |?)
For each document d ? {1, …, N}:
Generate ?d
So, in LDA, both topic distributions, over documents and over words have also correspondent priors, which are denoted usually with alpha and beta.

Machine learning researchers usually convert this kind of generative process into a graphical model representation, to convey the idea more succinctly.

The corresponding graphical model representation is shown in Figure 4 below. The repeated choices of topics and words can be conveniently illustrated using plate notation that represent replicates with the number in the lower right corner referring to the number of samples.

The Posterior Distribution
The generative model provides a general idea of how topic modelling works. Two things are needed to be able to infer the underlying topic structure. First, the topics that generated the documents are found. Secondly, for each document, the distribution over topics associated with that document must be found. The posterior distribution is a conditional distribution of all the hidden variables based on the observations, which in this case are the words in the documents. The next step is to find an algorithm that will compute the posterior distribution.

David Blei CITATION 1Bl03 l 2057 5 has created a graphical model to represent how each variable relates to other variables.

Figure SEQ Figure * ARABIC 5 Graphical model of the parameters of a dirichlet distributionEach piece of the diagram is a random variable. The circles are nodes and the rectangles are plates. The red node, ?, is a Dirichlet parameter and will be explained in more detail later on. The largest rectangle, shaded in pink, is the document plate, D and it represents the corpus. Within the corpus is, ?d, in orange, which represents the topic proportions for each document. The arrows in the diagram show that Zd,n depends on ? because it is drawn from a distribution with parameter ?. If ? has probability between topic 1, topic 2 and topic 3, then Zd,n could be topic 3 and it is drawn from that particular ?. There is a Z value, for every word, in the document and in the corpus. The green node, Wd,n is the observed word. It is the only observed random variable in the entire model. The observations are a collection of words arranged by document. Wd,n depends on both Zd,n and all the ß’s. Wd,n refers to the nth word in the dth document. ßk, the blue node, is a topic where each ß is a distribution over terms and there are K, of these terms. It is assumed that ß comes from a Dirichlet distribution. The purple node, ?, is the topic hyper parameter which we will be explained in more detail later CITATION Ste04 l 2057 21.
Arrows indicate conditional dependencies between variables that provide great convenience for inferring the latent variables.

The dirichlet distribution is an exponential family distribution over the simplex. The simplex is a space of positive vectors that sum to one. An exponential family distribution is a set of probability distributions based on a specific set of definitions CITATION And70 l 2057 42.
When fitting a LDA model, the goal is to find the best set of latent variables that can explain the observed words in documents, assuming that the model actually generated the text collection. This involves inferring the probability distribution over words ? associated with each topic, the distribution over topics ? for each document, and the topic responsible for generating each word.
As a conjugate prior to the multinomial distribution, LDA uses a Dirichlet prior to simplify posterior inference. Typically, these priors and related hyper parameters are set to be symmetrical, assuming that a priori all topics have equal probability to be assigned to a document and all words have an equal chance to be assigned to a topic. The reasons for choosing
symmetrical priors, compared to asymmetrical priors, are not explicitly stated and are often implicitly assumed to have little or no practical effect CITATION HMW09 l 2057 43 However, hyper parameters can
have a significant effect on the achieved accuracy for various inference techniques, such as Gibbs sampling, variational Bayes, or collapsed variational Bayes CITATION AAs12 l 2057 44. In fact, inference methods have relatively similar predictive performance when the hyper parameters are optimized, thereby explaining away most differences between them.

The hyper parameters ? and ? are used as a prior to smooth the distribution over topics ? and the distribution over words ?, respectively. These hyper parameters can be inferred from the observed data.

Posterior inference can be conducted via standard statistical techniques such as Gibbs sampling CITATION THo99 l 2057 45, variational methods CITATION SSy17 l 2057 46 and expectation-propagation CITATION DWa09 l 2057 47.

Throughout this thesis, we focus on Gibbs sampling since it is easy to understand and to implement. Note that direct Gibbs sampling would sample all latent random variables including ? and ? as well and not surprisingly the chain would converge (mix) very slowly. Instead, we use collapsed Gibbs sampling by integrating out ? and ? mathematically, which converges much faster due to simpler and smaller sample space.

The posterior distribution of ? can help understand the underlying semantic topics discussed in the text collection by looking at the top probable words, and the posterior distribution of ? provides a semantically meaningful low-dimensional representation of documents, which can be subsequently used for various tasks such as document classification and information retrieval.

Afaan OromooAfaan Oromo is Cushitic language which is family of Afro Asiatic languages. It has more than 40 Million speakers and most of native speakers are people living in Ethiopia, Kenya, Somalia and Egypt. It is the third largest language in Africa following Kiswahili and Hausa; 4th largest language, if Arabic is counted as Africa language 11 19.
The exact time when the Latin alphabet started being used for Afaan Oromo writing was not well known, but on November 3, 1991 it adopted as official alphabet of Afaan Oromo on. Now it is language of public media, education, social issues, religion, political affairs, and technology. As other natural language this language is rich in structure and has very complex grammar. So we need to review the overview structure of this language before going to design the framework that design the statistical topic modeling for the news articles.
Basic Sentence Structure
Afaan Oromoo follows Subject-Object-Verb (SOV) format. But because it is a declined language (nouns change based role in sentence), word order can be flexible, though verbs always come after their subjects and objects. Typically, indirects objects follow direct objects. Afaan oromoo has both prepositions and postpositions, postpositions are more common.

Prepositions and postpositions
As other language Afaan Oromoo is rich in prepositions and postpositions; preposition links a noun to an action ( e.g. go from there) or to another noun (e.g. the pen on the table). Preposition comes before noun and postposition comes after noun. Some common prepositions and postpositions:
Postpositions Meaning in English prepositions Meaning in English
Ala Out,outside gara towards
Bira Beside,with,around Erga,eega Since,from,after
Booda after Hanga,hanga Until
Cinaa Beside, near, next to Hamma Up to, as much as
Dur, dura Before Akka Like, as

Related Works
There has been a significant amount of research about topic modeling using different techniques.
Text topics modeling have received wide attention and have been expansively applied to text clustering under respective theme in recent years since it can reduce dimensions efficiently and is interpretable. This is a statistical approach that can be implemented through different methods. From those methods LDA is the easiest and recently in using method because it is efficient.
It introduces Dirichlet prior parameters to word layer and hidden topic layer in modeling and it solves the problem of overfitting generated by associated linear increase of topic parameters at the increase of training documents in PLSI model and LSI model, making it more suitable for largescale corpus processing.

Shi Jian-hong CITATION Shi14 l 2057 48 et al. applied LDA topic model to Chinese micro blog topic and carried out better micro blog topic discoveries.

Li Wen-bo CITATION LiW08 l 2057 49 et al. used a labeled LDA topic model by adding text class information to the LDA topic model, which calculated the distribution of hidden topics in each class and raised the classification ability of the traditional LDA model.

Zhang Zhi-fei CITATION Zha13 l 2057 50 et al. raised a text classifying method based on LDA topic model and an overall consideration of context.

Chapter ThreeThe Proposed SolutionIntroductionThis section explains our methodology and the system architecture or framework we propose a solution to solve the defined problem. It defines the structure and behavior of each individual system. In this work, we propose a probabilistic model using LDA model for news articles written in Afaan Oromoo.

The proposed FrameworkThe process of our proposed prototype of our system graphically represented as shown in fig and it consists of the following main modules. The following sub sections explain individual system components in detail.

Basically, the existing statistical topic modelling approaches generate multinomial distributions
over words to represent topics in a given text collection. The word distributions are derived based on word frequency in the collection. Therefore, popular words are very often chosen to represent topics.

The Topic Model for Afaan Oromoo Language aims to extract the theme of given news text written in this language using LDA algorithm. If a collection of words of Afaan Oromoo news articles is available and these words are arranged into different lists in such a way that the words with similar meaning and falling into same category are placed in same list, then each of the word from the input text can be assigned a category to which they belong (per word topic assignment) leaving helping verbs etc. This process is called Generative Process, as it justifies how the input text would have been generated. After this process, the proportion of involvement of each topic in the input text is computed. This second process is termed as Statistical Inference Process and the topic proportions present in the document can be formed that can tell the collection of topics that are involved in the text (example 10% about education, 30% about health, and so on). This information can help to a great extent in finding out what theme the text contains.

In this work we develop Topic modeling system for news articles is built using LDA and Gibbs Sampling approaches.

The proposed framework includes different phases – data collection, pre-processing, training and topic prediction. News articles are collected and stored in CSV file for processing, in the data collection phase.

In pre-processing stage, the news articles are processed to remove stop words, case fold words
filter words and numbers, and to select the model parameter. During training the words and their corresponding weights are extracted.

Final phase predicts the topic based on probability for a news article. Generally, the overall framework of this work would be depicted in the section 3.3.

DesignThis section explains our methodology and the system architecture. To accomplish the proposed model, the main requirement is to design a topic list that contains collection of various topics that occur frequently in news articles and words corresponding to those topics. These topic lists can be prepared manually or some learning algorithm can be employed to generate such collection for the required model.

The topic model for Afaan Oromoo language proposed in the study needs to perform following two major tasks, in order to find the theme of the text given as input to it:
Step 1: Matching the words of the Punjabi news text to the topic list corpus.
In this step of our algorithm, the main topics of document collection are extracted using LDA. The extracted topics present the connection between documents and query
Step 2: For each topic list, counting the words matching to that particular topic list.

The method proposed by us is based on the formation of keyword-sets from news articles and finding closed frequent co-occurrence keyword to form the topics. These keywords represent the document and we propose a topic modeling approach based to search similar documents using LDA to induce probabilistic topics.

Figure 6 illustrates the proposed framework of feature representations for the model. In our proposed framework The proposed framework includes different phases data collection, pre-processing, training and topic prediction., each approach is described in details as follows.

Figure SEQ Figure * ARABIC 6 The proposed framework
Data CollectionNews articles are collected from different sources and stored in TXT file for processing, in the data collection phase. During training the words and their corresponding weights are extracted. Final phase predicts the topic based on probability for a news article.

Data are collected manually from news websites. 400 instances from 4 different news websites related to Education, sports, Health, and weather, each with 100 news articles are collected.

Education news articles are collected from Oromia Education Bureau website. Sports related news articles are collected from Oromia Broadcasting Corporation Network (OBN) website. Health news is collected from Fana Broadcasting Corporation (FBC). Weather categories news are collected from OBN website. We illustrate our approach by addressing the task of topic modeling in the news items analysis
Text Pre-ProcessingIn pre-processing stage, the news articles are processed to removing stop words (frequently occurring but no significant meanings which need to be removed), punctuations, case fold words filter words, numbers, selecting the model parameter and normalize the corpus.

This steps would be performed to acquire the most representative words. As others natural language, Afaan Oromoo language is a rich in structure and it needs this step to hold the necessary words that represent the corpus. The news documents would be stored as TXT file and given as input and pre-processing steps are carried out for learning models using LDA. The cleaned corpus is then converted to tf- idf matrix to represent the document in LDA. The pre-processing tasks like tokenization, case folding, stop words removal and parameter selection are discussed below.

TokenizationFirst, the all-news documents as an input would be tokenized into individual tokens and represented as bag-of-word features. Punctuation would be removed from the ends of words. The list of tokens becomes input for further processing in text mining. Here, the “Case Folder” which will make lower and upper case words as one word i.e. “The=tHE=ThE=thE=tHe” will become “the”. Apart from grouping lowercase and uppercase words, no normalization method (e.g. stemming or lemmatization) would applied to reduce inflectional and derivational forms of words to a common base form. In this work, stemming algorithms could result in unrecognizable words that reduce interpretability when labeling the topics since human topic ranking was part of our topic quality evaluation. So we would not apply this task.

Finding meaningful wordsThe use of very common words like a, of, and the do not indicate the type of similarity between documents in which one is interested. Single letters or other small sequences are also rarely useful for understanding content. So there is need for removing term which appears very few times in documents, because very rare words tell little about the similarity of documents, and most common words in the corpus, because words that are ubiquitous also tell little about the similarity of documents. As other language, Afaan Oromoo language has its own list of stop words that would be removed. Example akka, malee, fi, f.
One of the principal problems with LDA is that for useful results, stop-words must be removed in a pre-processing step. Without this filtering, very common words such as the, of, to, and, a, etc. will pervade the learned topics, hiding the statistical semantic word patterns that are of interest. While stop-word removal does a good job at solving this problem, it is an ad hoc measure that results in a model resting on a non-coherent theoretical basis. Further, stop-word removal is not without problems. Stop-word lists must often be domain-dependent, and there are inevitably cases where filtering results in under-coverage or over- coverage, causing the model to continue being plagued by noise, or missing patterns that may be of interest to us.

One approach to keep stop-words out of the topic distributions is to imagine all stop-words being generated
Parameter Selection for three modelsThe number of topics can be specified by using the parameter selection process and parameter
for the dirichlet term and topic smoothing used by the LDA model can also be provided by Gibbs sampling are used under the same fixed hyper-parameter.
Hyper parameter ? controls the shape of the document–topic distribution, whereas ? controls the shape of the topic–word distribution. A large ? leads to documents containing many topics, and a large ? leads to topics with many words. In contrast, small values for ? and ? result in sparse distributions: documents containing a small number of topics and topics with a small number of words. In essence, the hyper parameters ? and ? have a smoothing effect on the multinomial variables ? and ?, respectively.

The LDA AlgorithmIn this work, we engage the LDA model described using the plate notation in Section 2.4 Fig. 5. We noticed that this model relies on two Dirichlet priors, ? and ?. The ? parameter controls the topic distributions ? for each document d in D. The ? parameter controls the topic-word
distributions ?. For more details about the model, we refer the reader to CITATION Ble10 l 2057 6 CITATION Hof99 l 2057 36 CITATION Ste07 l 2057 51.

The core idea of the proposed model is to discover the inner relations among all associated terms within each topic. The model tries to construct new topical datasets and generate new representations, in order to extract and highlight the semantics of topic representations.

Topic modeling system for news articles is built using LDA with Collapsed Gibbs Sampling approaches. LDA learns the unobserved groups from similar groups of data, where the words are assigned to particular topics from documents.

A representation is usually defined as a set of related terms or words. In LDA the idea of the topics representations starts from the knowledge of frequent pattern mining CITATION Ste04 l 2057 21. It plays an essential role in many data mining tasks directed toward finding interesting patterns in datasets CITATION Dav11 l 2057 48 CITATION Cli12 l 2057 49. We believe that related terms representations are more meaningful and more accurately represent topics than word-based representations. After generating the patterns, we have to find the meaningful term that represent the category. In order to discover semantically meaningful patterns to represent topics and documents, two steps are proposed: firstly, construct a new transactional dataset from the LDA results of the document collection D; secondly, generate pattern-based representations from the transactional dataset to represent user needs of the collection D.

LDA takes document as input and number of topic K and decomposed into two low rank matrices (Document – topic probability matrix and topic word probability matrix).

This means, it can automatically discover topics that documents contain. To understand how the LDA performs this we need to see with example.

Let assume we have the following sentences and we want to extract two topics (Topic 1 and Topic 2)
Sentence 1: I eat fish and vegetables
Sentence 2: Fish are pets
Sentence 3: my kitten eats fish
So first we can infer content of each sentences by words count. i.e.
Sentence 1 100% talks about topic 1, sentence 2 100% talks about topic 2 and sentence 3 33% talks about topic 1 and 67% talks about topic 2
And again, we can derive the proportions that each words constitutes in a given topic. i.e. Topic 1 might comprise words in the following proportions.

40% eat, 40% fish and 20% vegetables. So we can deduce that this topic is about Food.

The general algorithm of LDA is defined as follows:
Start
Import Corpus
Train the model including preprocessing
-select number of topics
-select number of iterations
Generally, LDA assumes that new documents are created in the following way particularly.

Determine the number of words in the corpus
Choose a topic mixture for the document over a fixed set of topics i.e. 20 % topic 1, 30% topic 2 and 50 % topic 3
Generate the words in the document by:
First pick a topic based on the documents multinomial distribution above
Next pick a word based on the topics multinomial distribution
The Pseudocode for LDA
Procedure:
Input: Number of Document M
Number of Topics K
? Vocabulary matrix
Output: Topic probability distribution for each word in a document
Steps:
Choose a topic distribution ?
Assign each word w in a document d to one of the topics
For each word w in a document d:
For each topic calculate P (Topic t|Document d)
Calculate P (word w|Topic t)
The selection word w for a topic t is depends on the distribution of ? vocabulary matrix

Topic Model LabellingWord-based multinomial distribution is used to represent topics based on the statistical topic model, but it works less well on explicitly interpreting the semantics of the topics.

Normally words with high probability of a topic tend to suggest the meaning of the topic, but single words have the problems of polysemy and synonymy. Thus, people tend to label topics with semantic phrases CITATION Dan11 l 2057 16 CITATION Den09 l 2057 8 .

First, a set of candidate phrases are generated, either by parsing the text collection or using statistical measures. Second, these candidate phrases are ranked based on a probabilistic measure, which indicates how well a phrase can characterize a topic model. Finally, a few top-ranked phrases would be chosen as labels for a topic model. The selected labels can be diversified though eliminating redundancy.

Two popular methods are normally used. In the first, phrases are simply ranked based on the likelihood of the phrase given in the topic model CITATION AAl15 l 2057 27. Automatically this would give meaningful phrases with high probabilities according to the word distribution of the topic model to be labelled. In the second method, phrases are ranked based on the expectation of the mutual information between a word and the phrase taken under the word distribution of the topic model. This second method is shown to be better than the first because it would favour a phrase that has an overall similarity to the high probability words of the topic model. Furthermore, a topic can also be labelled with respect to an arbitrary reference/context collection to enable an interpretation of the topic in different contexts. In LDA when we generate a topic model, for a group of documents, in the labeling it gives the highest probability to the most natural label of topic. For example, Under the following topic “Children”, women, people, child, years, families, work, parents, says, family, welfare, men, percent, care, life. This words are categorized under a given topic, here it is labeled as Children because of this top word has highest probability. But this may argue little problematic because choosing the top word only may make no sense. So, in our work, we use LabeledLDA prepare topic list that represent the category.

Chapter Four
Experiment and Evaluation
This section illustrates the experiments designed to perform the topic modeling on the digitized news articles data to explore the theme of the articles and create new knowledge on the text.

We have conducted experiments to evaluate the performance of the proposed topic modelling methods. The experiments have been carried out using LDA to identify the topics for various articles. In this section, we present the results of the evaluation.

Specifically, we used the GUI developed in java concerning to the executors in order to executor memory input parameter.
The experiments have been carried out using LDA to identify the topics for various articles of Afaan Oromoo.
Datasets
Afaan Oromoo text articles related to domains like Education, Sports, Health, and weather are collected from multiple sources and text-corpus is created. Various pre-processing tasks have been carried out as described earlier to facilitate training. The model has been tested on 400 news texts. The testing is done by comparing the obtained output theme from the topic model with the corresponding news generated pattern as well as the content of the news text.

LDA and choosing the parameters for our problem
In the work, we developed the Java implementation of LDA to learn the topic and topic-word distributions and finally generate the topics profiles for our categories. This implementation relies on Gibbs sampling to learn the distributions, which requires the following parameters. K: number of topics, ? and ?, the Dirichlet priors, and N, number of iterations as discussed in section 3.3.3.

Implementation procedure
The model has been implemented using java programming language. The operation of Topic Modeling has been carried out by different methods of three C# classes namely “InputNewText”, “TopicModeling” and “OutputNewsText”. A sample output of the topic model
The whole procedure taken in the experiments is depicted in Fig. 7. The first step is dataset preparation and preprocessing. Then in the step of topic generation, we utilize the sampling-based to generate LDA topic models. The number of topics V = 7, the number of iterations of Gibbs sampling is 200, the hyper parameters of LDA ? = 50/V = 2:5; ? = 0:01 in this experiment as used in Steyvers and Gri_ths, 2007. In the last step we construct the topical representations, and to generate the frequent term-based topic representations using the proposed method.

Figure SEQ Figure * ARABIC 7 steps for generating topic representations

References BIBLIOGRAPHY
1 J. Grimmer, “We are all social scientists now: How big data, machine learning,and causal inference work together,” PS: Political Science & Politics, pp. 80-85, 2015.
2 M. J. &. M. W. Zaki, “Data Mining and Analysis: Fundamental Concepts and Algorithms,” Cambridge University Press, 2014.
3 J. Dean, “Big Data, Data Mining, and Machine Learning : Value Creation for Business Leaders and Practitioners,” 2014.
4 L. T. W. D. S. Y. a. W. Z. Lin Liu, “An overview of topic modeling and its current applications in bioinformatics,” SpringerPlus, 2016.

5 A. Y. N. M. I. J. David M. Blei, “Latent Dirichlet Allocation,” Journal of Machine Learning Research, vol. 3, pp. 993-1022, 2003.
6 C. L. D. D. Blei D, “Probabilistic topic models,” IEEE Signal Process Mag, 2010.
7 L. Q. Minglai Shao, “Text Similarity Computing Based on LDA Topic Model and Word Co-occurrence,” in 2nd International Conference on Software Engineering, Knowledge Engineering and Information Engineering, China, 2014.
8 D. H. R. N. a. C. D. M. Denial Ramage, “Labeled LDA: A supervised topic model for credit attribution in multilabeled corpora,” proceeding of the 2009 Conference on Empirical Methods in Natural Language Processing, 2009.
9 L. H. F. a. O. B. OunasAsfari, “Ontological Topic Modeling to Extract Twitter users’ Topics of Interest,” International Conference on Information Technology and Applications, vol. 8, 2013.
10 I. Biro, “Document classification with latent dirichlet allocation,” Ph.D.dissertation,” 2009.

11 A. W. T. S. a. D. D. P. Crossno, “Topicview: Visually comparing topic models of text collections,” in Tools with Artificial Intelligence (ICTAI) 3rd IEEE International Conference , p. 936–943, 2011.
12 A. K. a. H. Merouani, “Clustering with Probabilistic Topic Models on Arabic Texts,” In Modeling Approaches and Algorithms for AdvancedComputer Applications, pp. 65-74, 2013.
13 J. C. a. W. Z. W. Zhao, “Best Practices in Building Topic Models withLDAfor Mining Regulatory Textual Documents,” CDER, 9th November, 2015.
14 P. F. a. W. N. Ralf Krestel, “Latent Dirichlet Allocation for Tag Recommendation,” ACM, 2009.
15 S. K. H. GirishMaskeri, “Mining Business Topics in Source Code using Latent Dirichlet Allocation,” ACM, 2008.
16 C. D. M. a. S. D. Daniel Ramage, “Partially Labeled Topic Models for Interpretable Text Mining,” 2011.
17 M. M. J. S. J. D. a. Y. M. Sarah ElShal, “Topic modeling of biomedical text,” IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2016 .
18 M. Z.-Y. C. T.-S. L. S. Y. H. e. a. Mao X-L, “SSHLDA: a semi-supervised hierarchical topic model,” In Proceedings of the 2012 joint conference on empirical, 2012.
19 V. M. S. Suganya C, “Statistical topic Modeling for News Articles,” International Journal of Engineering Trends and Technology (IJETT), vol. Volume 31, Number 5- January 2016.
20 K. A. Rubayyi Alghamdi, “A Survey of Topic Modeling in Text Mining,” International Journal of Advanced Computer Science and Applications, Vols. Vol. 6, No. 1, 2015.
21 S. P. R.-Z. M. G. T. Steyvers M, “Probabilistic author-topic models for information discovery,” Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, p. 306–15, 2004.
22 Q. M. e. al, “Automatic labeling of multinomial topic models,” in Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, vol. 1, no. San Jose, California, USA, 2015.
23 D. A. a. G. M. K. Christidis, “Exploring Customer Preferences with Probabilistic Topics Models.,” 2014.
24 M. T. N. B. M. V. M. P. R. a. S. C. Nitin Sukhija, “Topic Modeling and Visualization for Big Data in Social Sciences,” Intl IEEE Conferences on Ubiquitous Intelligence ; Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress, 2016.
25 M. A. a. D. Berkani, “A Topic Identification Task for Modern Standard Arabic,” In Proceedings of The 10th Wseas InternationalConference On Computers, pp. 1145-1149, 2006.
26 M. W. Beck, “Average dissertation and thesis length,” https://github. com/fawda123/diss, 2014.
27 A. A. a. M. Abbod, “Enhanced Topic Identification Algorithm forArabic Corpora,” UKSIM-AMSS International Conference on Modellingand Simulation, pp. 90-94, 2015.
28 R. A. M. M. a. M. M. M. Zrigui, “Arabic text classificationframework based on latent dirichlet allocation,” Journal of Computing and Information Technology, vol. 3, pp. 125-140, 2012.
29 S. a. H. D. Purpura, “Automated classification of congressional legislation.,” In Proceedings of the 2006 international conference on Digital government research, no. Digital Government Society of North America, pp. 219-225, 2006.
30 D. P. S. a. W. J. Hillard, “Computer-assisted topic classification for mixed-methods social science research,” Journal of Information Technology and Politics, vol. 4, pp. 31-46, 2008.
31 M. Scharkow, “Thematic content analysis using supervised machine learning:An empirical evaluation using german online news.,” Quality ; Quantity, vol. 2, pp. 761-773, 2013.
32 S. D. E. v. d. B. A. a. M. M. Verberne, “Automatic thematic Classification of election Manifestos,” Information Processing ; Management, vol. 4, pp. 554-567, 2014.
33 k. H. Bettina Grun, “Topicmodels: An R Package for Fitting Topic Model,” Journal of Statistical Software, vol. Vol. 40 No. 13., 2011.
34 S. C. D. S. T. L. T. K. F. G. W. a. H. Deerwester, “Indexing by latent semantic analysis,” JASIS, vol. 6, pp. 381-407, 1990.
35 C. H. T. H. R. P. a. V. S. Papadimitriou, “Latent semantic indexing: A probabilistic analysis,” In Proceedings of the Seventeenth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pp. 159-168, 1998.
36 T. Hofmann, “Probabilistic latent semantic analysis.,” In Proceedings of Uncertainity in Artificial Intelligence, pp. 289-296, 1999.
37 K. M. M. B. L. C. M. C. M. H. a. R. D. R. Quinn, “How to analyze political attention with minimal assumptions and costs,” American Journal of Political Science, vol. 1, pp. 209-228, 2010.
38 K. A. Rubayyi Alghamdi, “A Survey of Topic Modeling in Text Mining,” International Journal of Advanced Computer Science and Applications, Vols. Vol. 6, No. 1, 2015.
39 W. Z. W. ;. C. J. J. Zhao, “Topic modeling for cluster analysis of large biological and medical datasets,” BMC bioinformatics, 2014.
40 S. Y. Y. R. Z. Z. Z. X. Yang, “Topic Modeling on Short Texts withCrowdsourcing”.
41 J. Dickman, “Topic Modeling Explained: LDA to Bayesian Inference,” Tech Talk, 2014.
42 E. B. Andersen, “Sufficiency and exponential families for discrete sample spaces,” Journal of the american statistical association, pp. 1248-1255, 1970.
43 D. M. a. A. M. H. M. Wallach, “Rethinking LDA : Why Priors Matter,” in Advances in Neural Information Processing Systems, vol. vol. 22, p. 1973–1981, 2009.
44 M. W. P. S. a. Y. W. T. A. Asuncion, “On Smoothing and Inference for Topic Models,” Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, p. 27–34, may 2012.
45 T. Hofmann, “Probabilistic latent semantic indexing,” in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, p. 50–57, 1999.
46 S. S. a. M. Spruit, “Full-Text or Abstract? Examining Topic Coherence Scores Using Latent Dirichlet Allocation,” in The 4th IEEE International Conference on Data Science and Advanced Analytics, p. 165–174, 2017.
47 H. M. M. I. S. R. a. M. D. Wallach, “Evaluation Methods for Topic Models,” in ICML 09 Proceedings of the 26th Annual International Conference on Machine Learning, p. 1105–1112, 2009.
48 C. C. P. S. a. M. S. David Newman, “Analyzing Entities and Topics in News Articles using Statistical Topic Models,” 2011.
49 D. Z. W. a. J. N. W. L. M. E. P. G. a. A. S. Clint P. George, “A Machine Learning based Topic Exploration and Categorization on Surveys,” International Conference on Machine Learning and Applications, 2012.
50 J. J. D. S. L. L. Jianguang Duy, “Topic Modeling with Document Relative Similarities,” Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, 2015.
51 . A. K. a. H. Merouani, “Clustering with probabilistic topic models on arabic texts,” in Modeling Approaches and Algorithms for Advanced Computer Applications, ser. Studies in Computational Intelligence,A. Amine, A. M. Otmane, and L. Bellatreche, Eds. Springer International Publishing, vol. vol. 488, p. pp. 65–74, 2013.