Academic journal article Educational Technology & Society

Designing a Syntax-Based Retrieval System for Supporting Language Learning

Academic journal article Educational Technology & Society

Designing a Syntax-Based Retrieval System for Supporting Language Learning

Article excerpt

Introduction

The need for a well-design search engine is dramatically increasing because of the growing amount of data in real world, such as that on web pages, in standard text corpora and in movie databases. In addition to effective searching from massive amounts of data, recent search engines feature a flexible query language providing a wider variety of targeted search items. Compared to traditional text retrieval systems using keywords as basic query symbols, a system which provides regular expression as one of query languages seems more suited to meeting this requirement.

The main purpose of the syntax-based text retrieval system is to support grammatical querying of tagged corpora for language learners and teachers. While syntax-based queries contribute improvements to general-purpose queries of massive amounts of data, the power of regular expressions provide even further advantages when the users are language learners or teachers and their purpose is finding examples of specific types of language message. Consider the following example.

Example 1: In ESL teaching, it is important to teach learners the particular forms the certain verbs require of their complements, for example, keep requires a following verb in its--ing form (keep trying but not keep to try). If we want to find the form in any text corpora, traditional text retrieval systems have no good solutions but might scan all documents which include the keyword keep in the collection. Variant searching results are found, such as keep in touch and noun phrases followed by keep. That is because of the approaches to index construction, usually using inverted indices as index structure, in traditional text retrieval systems are extracting useful keywords first and using these keywords as indices. A regular expression can easily present such a pattern. The regex (regular expression) below presents one possible way to describe this pattern.

keep\s+ \w+ing

Even though regular expressions provide more flexible querying, they still create a serious problem in terms of search response time. For example, a text collection with 1 million documents and 1,000 words average length would take an unacceptable response time, say, a couple hours, by match the above sample query pattern to strings of text. Without further processing, the only way to find the pattern is scanning each document one by one in the text collection.

There have been several proposals made to solve this search time problem. Most of them use k-gram index construction and build efficient index structures for quick searching of index terms. The index terms extracted from a text collection would be every sequence of characters of length k in each data unit. Once the index terms are extracted, the systems can construct suffix trees by using the technique described by Baeza-Yates and Gonnet (1996) or inverted indices in (Baeza-Yates and Navarro, 2004) (the most commonly used as index structure for k-gram indexing) to identify the data unit positions of each index term. By using the index, the search engine can only scan the documents or any data units which contain the specific targeted string in the regular expression query. For example, if a system performs k-gram index extraction from k=3-10, then the query in Example 1 can be matched just in data unit positions of index terms "keep" and "ing" rather than in the whole text collection. This simple idea can substantially reduce the search time. Every system would decide different ranges of k for specific purposes or considerations, such as limited secondary memory size. In this paper, the main approaches to building regex search engine are based on (Cho and Rajagopalan, 2002). The approaches use minimum index storage space and provide short enough search response time for most regex queries.

Example 1 can be implemented in any regex enabled search engine. What if, however, we want to find the syntax without any specific string, such as ing? …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.