Information Retrieval Models
retrieval is an emerging field of computer science that is based on the storage
of documents and retrieving them on user’s request. It includes the most
essential task of retrieving relevant document according to the requested
query. For this task efficient and effective retrieve models have been made and
proposed. Our survey paper sheds light on some of these information retrieval
models. These models have been built for different datasets and purposes. A
healthy comparison among these models is also shown
retrieval, retrieval models.
amount of information is available in electronic form and its size is
continuously increasing. Handling information without any information retrieval
system would be impossible. As the size of data increases researchers start
paying attention on how to obtain or extract relevant information from it.
Initially much of the information retrieval technology was based on
experimentation and trial error.
the increasing amount of textual information available in electronic form
efficiently and effectively is very critical. Different retrieval models were
formed based on different terminologies to manage and extract information.
Information is mostly stored in form of documents. The main purpose of these
retrieval systems is to find information needed. An information retrieval
system is a software program that stores and manages information on documents,
often textual documents but possibly multimedia. The system assists users in
finding the information they required. A perfect retrieval system would
retrieve only the relevant documents but practically it is not possible as
relevance depends on the subjective opinion of the user.
1.1 Basic model
every retrieval model includes following basic steps:
and collection comparison
Figure 1 information retrieval process (Hiemstra, November 2009)
models represent documents in indexed form as it is efficient approach.
Different algorithms are used and developed especially for indexing purpose as
better the data is stored more accurately and efficiently it is retrieved.
Query formulation is the next important step.
User tries to search data using keywords or phrases. In order to search these
phrases in indexed collection, the query must be present in same form. Indexing
can be done by different ways according to content representation of both the
documents in the collection and the user query. (Cerulo,
2004) (Hiemstra, November 2009)
of any retrieval system depend on its comparison algorithm therefore it
determines the accuracy of the system. The better the comparison better the
results are obtained. A list of documents is obtained as the outcome this
comparison that can be relevant or irrelevant. The main objective of a
retrieval model is to measure the degree of relevance of a document with
respect to the given query. (Paik, August 13,
of relevant documents is higher as compared to irrelevant documents and they
are shown at the top of the list to minimize user time and efforts spend in
searching the documents
The paper is divided in different sections
with each section explaining different models & their results with their
advantages and limitations.
2 Retrieval Models
2.1 Exact match models
labels the documents as relevant or irrelevant. It is also known as Boolean Model, the earliest and the
easiest model to retrieve documents. It uses logical functions in the query to
retrieve the required data. George Boole’s mathematical logic operators are
combined with query terms and their respective documents to form new sets of
documents. There are three basic operators AND (logical product) OR (logical
sum) and NOT (logical difference)
(Ricardo Baeza-Yates, 2009). The resultant of AND operator is a set of
documents smaller than or equal to the document sets of any of the terms. OR
operator results in a document set that is bigger than or equal to the document
sets of single terms.
model gives users a sense of control over the system. It distinguishes between
relevant and irrelevant documents clearly if the query is accurate. This model
does not rank any document as the degree of relevance is totally ignored. This
model either retrieves a document or not, that might cause frustration for end
extension of the Boolean model that reason about arbitrary parts of textual
data, called segments, extents or regions. A region might be a word, a phrase,
a text element such as a title, or a complete document. Regions are identified
by a start position and an end position. Region systems are not restricted to
region models did not have a big impact on the information retrieval research
community, not on the development of new retrieval systems. The reason for this
is quite obvious: region models do not explain in anyway how search results
should be ranked. In fact, most region models are not concerned with ranking at
all; one might say they – like the relational model – are actually data models
instead of information retrieval models. (Mihajlovi´)
models may skip important data as they do not support ranking mechanism.
Therefore there was a need to introduce ranking algorithms in retrieval system.
The results are ranked on the basis of occurrence of terms in the queries. Some
ranking algorithms depend only on the link structure of the documents while
some use a combination of both that is they use document content as well as the
link structure to assign a rank value for a given document.(Gupta, 2013)
Vector based model
The Vector Space Model (VSM) is a conventional information
retrieval model that represents documents and queries by vectors in a
multidimensional space. The basic idea is that when indexing terms are
extracted from a document collection, each document or query is represented as
a vector of weighted term frequencies Similarity comparisons among documents
and/or between documents and queries are made via the similarity between two
vectors (e.g. cosine similarity).
Using document sets and query, a similarity measure, compare
them and the documents with more similarities are returned to the user. Many
methods are user to measure the similarity that are cosine similarity, tf-idf
The cosine similarity compute the angles between the vectors
in n dimensional space. The cosine similarity in d documents and d’ is given by
( d * d’ ) / | d | *
| d’ |
The performance of retrieval vector base model can be improve by
utilizing user-supplied information of those documents that are relevant to the
query in question. (Kita, oct 1 , 2000)
Vaibhav Kant Singh, Vinay Kumar Singh (Vaibhav Kant Singh, 2015) describes vector space model for
information retrieval. The VSM provide a guide to the user that are more
similar and have more significance by calculate the angle between query and the
terms or the documents. Here documents are represented as term-vectors
d = (t1, t2,