Web-based social media systems such as blogs, wikis, media-sharing sites, and message forums have become an important new way to transmit information, engage in discussions, and form communities on the Internet. Their reach and impact is significant, with tens of millions of people providing content on a regular basis around the world. Recent estimates suggest that social media systems are responsible for as much as one third of new web content. Corporations, traditional media companies, governments, and nongovernmental organizations (NGOs) are working to understand how to adapt to them and use them effectively. Citizens, both young and old, are also discovering how social media technology can improve their lives and give them more voice in the world. We must better understand the information ecology of these new publication methods in order to make them and the information they provide more useful, trustworthy, and reliable.
The blogosphere is part of the web and therefore shares most of its general characteristics. It differs, however, in ways that affect how it should be modeled, analyzed, and exploited. The common model for the general web is as a directed graph of web pages with undifferentiated links between pages. The blogosphere has a much richer network structure in that there are more types of nodes that have more kinds of relations between them (figure 1). For example, the people who contribute to blogs and author blog posts form a social network with their peers, which can be induced by the links between blogs. The blogs themselves form a graph, with direct links to other blogs through blog rolls and indirect links through their posts. Blog posts are linked to their host blogs and typically to other blog posts and web resources as part of their content. A typical blog post has a set of comments that link back to people and blogs associated with them. Finally, the blogosphere trackback protocol generates implicit links between blog posts. Still more detail can be added by taking into account post tags and categories, syndication feeds, and semistructured metadata in the form of extensible markup language (XML) and resource description framework (RDF) content.
In this article, we discuss our ongoing research in modeling the blogosphere and extracting useful information from it. We begin by describing an overarching task of discovering which blogs and bloggers are most influential within a community or about a topic. Pursuing this task uncovers a number of problems that must be addressed, three of which we describe in more detail. The first is recognizing spam in the form of spare blogs (splogs) and spam comments. The second is developing more effective techniques to recognize the social structure of blog communities. The final one involves devising a better abstract model for the underlying blog network structure and how it evolves.
[FIGURE 1 OMITTED]
Modeling Influence in the Blogosphere
The blogosphere provides an interesting opportunity to study online social interactions including spread of information, opinion formation, and influence. Through original content and sometimes through commentary on topics of current interest, bloggers influence each other and their audience. We are working to study and characterize these social interactions by modeling the blogosphere and providing novel algorithms for analyzing social media content. Figure 2 shows a hypothetical blog graph and its corresponding flow of information in the influence graph.
Studies on influence in social networks and collaboration graphs have typically focused on the task of identifying key individuals who play an important role in propagating information. This is similar to finding authoritative pages on the web. Epidemic-based models like linear threshold and cascade models (Kempe, Kleinberg, and Tardos 2003 and 2005; Leskovec et al. 2007) have been used to find a small set of individuals who are most influential in a social network. …