Academic journal article Journal of Digital Information Management

Framework for Mixed Entity Resolving System Using Unsupervised Clustering

Academic journal article Journal of Digital Information Management

Framework for Mixed Entity Resolving System Using Unsupervised Clustering

Article excerpt

I. Introduction

According to recent U.S. Census Bureau reports, about 30% queries include person names. However, considering 100 million persons share only about 90,000 person names, a search result is a mixture of web pages of different people with the same name spellings. In general, this problem is known as Mixed Entity Resolution Problem for named entity search tasks on the Web (D. Lee, 2005). To demonstrate the need for a solution to mixed entities, let us present a real case drawn from Google, shown in Figure 1. In the search result, there exist a mixture of web pages of a professor at CMU, an actor, a hockey player, a historian, a Jazz guitarist, etc. who have the same name spellings of Tom Mitchell. There are 37 different Tom Mitchells among top 100 ranked web pages as illustrated in Table 1. Furthermore, mixed entities commonly occur on the Web when we are searching information about a product by name. For instance, if a user searches for a product name such as Oracle, user also finds different web pages of Oracle Database, Oracle Audio, Oracle Academy, and so forth.

In this case, unlike traditional search engines, we focus on developing an effective system that identifies mixed entities such as person or product names as a query on the Web, and then displays its query result containing ranked groups, each of which contains URL links and corresponds to each different entity with the same description.

However, it is non-trivial to resolve mixed entities due to the following four challenges. First, since the number of clusters within top-k ranked web pages is not given a priori, we cannot take advantage of supervised clustering schemes such as K-means and K-spectral clustering. Second, skewed cluster sizes make it difficult to group web pages correctly. Next, the running time of clustering web pages should be instantaneous that users do not feel bored waiting search results for a long time. Finally, a set of clusters is required to be re-ranked. For instance, take a look at the name data set of Tom Mitchell in Table 1, where we observed that 92 top ranked web pages are grouped to 37 clusters of CMU professor, hockey player, historian, and so on. In the next step, the 37 clusters should be ranked in a certain order. In other words, the CMU professor cluster should be first ranked, and then the historian cluster is ranked, and so on. To cope with these challenges, we develop an effective framework for resolving mixed entities on the Web. In this paper, we propose a web service based interface, an unsupervised clustering, and several cluster ranking schemes. The system outline is shown in Figure 2. In particular, we devise an unsupervised clustering technique using similarity propagation.

The rest of this paper is organized as follows. We formally define mixed entity resolution problem. Then, we introduce an overview of our framework followed by discussion of our main ideas. Preliminary experimental results with name data sets are described next. Finally, some discussion and conclusion follow at the end.

[FIGURE 1 OMITTED]

2. Mixed Entity Resolution

As proof-of-concept, let us show two common, possible motivated examples as follows:

Example 1. An applicant A is applying for a job opening in a company. The search committee members in the company would like to understand A more than his resume. Thus they search for him in Google. When they query A's name spellings in Google, it retrieves the web pages related to A. Unfortunately, there exist a number of web pages of different As with the same name spellings. In addition, since A is not celebrity, his web pages are located far away from top ranked web pages in the search results. If we suppose that each result page of Google contains 10 links of web pages and the links related to A's actual web pages are ranked between 90th and 100th positions, the committee members need to visit ten result pages to find A's actual web pages. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.