Academic journal article Journal of Social Structure

Choosing a Clustering: An A Posteriori Method for Social Networks

Academic journal article Journal of Social Structure

Choosing a Clustering: An A Posteriori Method for Social Networks

Article excerpt

Introduction

In his comment on Handcock, Raftery and Tantrum's 2007 paper, sociologist David Krackhardt characterized cluster analysis of social networks as follows:

Cleanly identifying clusters of actors in a social system on the basis of their social ties is an age old pursuit of generations ... UCINET, the most commonly used package for analysis of network data, has 20 distinct methods for finding clusters or groups, each with a plethora of suboptions and choices of parameter which, depending on the data, may yield wildly differing results. This dizzying array of 'solutions' begs the central question: given the observed data, what is the right number of clusters and what is their composition?1

Krackhardt's comment identifies the key problem that network researchers face in choosing an appropriate clustering method for a particular data set: determining what type of structure each method will pick out in a particular data set. The fruitful literature of cluster analysis provides plenty of methods from which to choose but describes and distinguishes them primarily by their under-the-hood operation rather than by their qualitative results. One might hope that understanding the fine details of a clustering algorithm would lead to an understanding of what kind of cluster structures it will recognize. However, clustering algorithms are often based on heuristics and justified by practical performance considerations rather than by formally proven guarantees due to the difficulty of clustering problems (Ovelgönne and Geyer-Schulz, 2012). While an algorithm of this type may have been shown to do well in certain situations, the full range of cases where it performs best and worst may not be known definitively even by its developers. Furthermore, constraints on time and energy make it difficult even to develop a practical working knowledge of each of the many algorithms and its particular parameters and idiosyncrasies.

Reviews of clustering from the network literature do provide a helpful general taxonomy for clustering algorithms, which the decision tree shown in Figure l attempts to summarize. The series of questions represented highlights key attributes of clustering results that are well-known a priori and allows a researcher who plans to use clustering to eliminate wide class of inappropriate methods immediately. First, a researcher should distinguish between local clustering methods that classify only a few nodes or actors of interest (usually within a graph so large that clustering every point is computationally infeasible) from global clusterings that decompose an entire network into subsections (Schaeffer, 2007). Next, the researcher should define his or her concept of a cluster more clearly by asking several further questions (SAS Institute, 1990): Can a node belong to several clusters or only one? Should cluster membership be viewed as a stochastic process or as a fixed labeling? Can clusters have subclusters or not? Finally, the researcher should make some preliminary decisions about the type of similarity or homogeneity he or she wishes to see in members of an individual cluster. Two (among many) possible clustering rationales are mentioned in the diagram: positional analysis, in which clusters correspond to distinct social positions and co-membership in a cluster indicates a form of structural equivalence (Scott, 2000) or at least a similar tie pattern (Gest, Moody and Rulison, 2007), and community detection, in which clusters represent subgroups of actors who tend to have more ties to one another than to other actors in the network (Fortunato, 2010).

Note that the diagram in Figure 1 could reasonably be expanded at most leaves. For example, input network size and available computational resources could be considered since many clustering algorithms become computationally intractable with sufficiently large networks (Schaeffer, 2007). Some methods are also specific to undirected graphs (Newman, 2006), while others are more appropriate for graphs with valued and/or directed ties, or for networks with attribute data for their actors (Schaeffer, 2007). …

Search by... Author
Show... All Results Primary Sources Peer-reviewed

Oops!

An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.