Biological Model for Information Retrieval
Girgis, Moheb R., Aly, Abd El-Mgeid A., Abdel Latef, Bahgat A., El-Gamil, Boutros R., Journal of Digital Information Management
ABSTRACT: Heuristic methods based on biological aspects have been successfully used in many computer science areas of research, including information retrieval (IR). This let us ask: why there is no biological environment for the problem of information retrieval. In this paper, we tried to introduce a new biological environment and model for information retrieval that broad the modeling concept from the mathematical formula that simulates the basic elements of IR problem to their dual schemas in biology. It's a new way of thinking of, and dealing with, the problem of information retrieval.
Categories and Subject Descriptors
H.3.3[Information Search and Retrieval]; I .6 [Modeling and Simulation] Multimedia Databases: J.3 [Life and Medical Sciences]
Information Retrieval, Biological modeling, Genetics application, Mathematical model, Simulation Keywords: Information retrieval, Biological environment, Biological model, Genetic algorithms
Probabilistic methods are relatively recent in computer science, but their range of applications has increased rapidly in many research areas, including information retrieval. Genetic algorithms (GAs) specifically have been used by many researches to solve IR problems. In this vein, the main directions concern modifying the document indexing (Gordon ; Blair ), the clustering problem (Raghavan and Agarval, ; Gordon ) and improving query formulation (Yang et al. ; Perty et al. ; Chen ; Horng and Yeh ). Also, there are good experiments to improve GA operators that are applied to IR (Vrajitoru ; Vrajitoru ).
[FIGURE 1 OMITTED]
However, every time we adopt a heuristic method to an IR problem, we force the given method in order to suit the task at hand. This paper introduces a new ideology of dealing with the problem of information retrieval. It builds a new IR environment based on biological aspects. This model used the similarity between textual material and the biological chromosome. From this point, we move, step by step, through document representation, term indexing, term classification, and search strategy. Finally, we tested two retrieval models on our new IR biological schema.
The paper is organized as follows: Section 2 shows the similarity between the textual document and the biological chromosome. Section 3 describes biological environment phases we adapt in our schema. Section 4 describes query manipulation and section 5 describes the search strategy of our system. Our biological model for IR is introduced in section 6. Experiments are given in section 7. In section 8, we compared our model with two probabilistic IR models, which are vector space model (VSM) with new weightings, and Okapi model. In section 9, we introduced some modifications to our model to improve its performance.
2. Similarity between Biological Chromosome and Textual Document
By comparing the structure and functionality of the biological chromosome and the textual document, we can discover great similarity between them. As the chromosome consists of a series of nucleotides, the document consists of a series of tokens (words). Also, within the chromosome, a sub series of nucleotides, with some known function, called a gene, corresponds to the group of tokens that we extract from the document and represent it within the IR system, called the indexing language. Figure 1 shows the similarity between both components.
If we give a more deeply looking to the structure of both chromosome and document, we find that as the chromosome consists of four types of nucleotides (thymine, cytosine, adenine, guanine), the document consists of four basic types of tokens (lower-case, capitals, abbreviations, and digits). Figure 2 shows this duality between both chromosome and document.
3. Biological Environment Phases
3.1 Text Analysis and Gene Extraction
In order to extract the indexing language (gene) of the document, we follow the steps given below :