Much progress has been made in recent years in several areas within natural language processing. However, so far there has been less work related to microtext (for example, instant messaging, transcribed speech, and microblogs such as Twitter and Facebook). Microtext is made up of semistructured pieces of text that are distinguished by their brevity, informality, idiosyncratic lexicon, nonstandard grammar, misspelling, use of emoticons, and sometimes simultaneous interwoven conversation. These characteristics make microtext challenging to analyze. Most existing tools are trained on properly spelled and well-punctuated corpora, and therefore have problems correctly tagging and parsing microtext.
The 15 presentations focused on a broad range of microtext data sources: chat from online games, microblogs from Twitter, Facebook posts, and SMS communications. Some of the themes included creating a part of speech tagger for Twitter; sentiment extraction from tweets; gender and author detection in short noisy text; personality trait identification based on language used in social media; clustering of microtext by topic; detection of hedging and its relationship to gender, among many others. In addition to the contributed presentations and posters, the symposium included two invited talks from the leading researchers in microtext and social network analysis. Noah Smith (Carnegie Mellon University) spoke regarding the challenges and novelties of tagging and parsing microtext. His talk highlighted the need to reconsider what we call "noise" in data, for example, numerous abbreviations such as "SMH" (shake my head) and "OMG" that would be considered noise in some types of text are important "parts of speech" in Twitter and even warrant their own tag! Sofus Macskassy (Facebook) spoke about discovering Twitter users' topics of interest by examining the entities they mention in their tweets as well as various types of tweeting behavior (social banter versus event-based tweets). He also discussed an approach that leverages Wikipedia to disambiguate and categorize the entities in the tweets.
The symposium also included an invited panel of prominent researches in the area of microtext that was augmented by lively audience participation. The topic of the panel was the future of microtext. The panelists included Susan Herring (Indiana University), Bernardo Huberman (HP Labs), Rachel Greenstadt (Drexel University), and Alek Kolcz (Twitter). The panel included representatives from both academe and industry to give a fuller, more rounded perspective on the topic. Among the topics discussed were questions regarding the importance of analyzing microtext not just for research, but also for business. The discussion touched upon how improving tools for dealing with microtext can help inform business intelligence technologies of tomorrow. From a research perspective, we asked questions such as what defines microtext? Is social interactivity required (for example, the ability to comment or retweet) or can any news headline be considered mictortext just because it is short? When is microtext too long to be considered microtext? These questions were also echoed in the brilliant and engaging plenary talk by Doug Oard (University of Maryland).
An important question that emerged regarding the future of microtext research is whether microtext should be merged with other domains. This discussion lead to the observation that microtext research presently is fragmented across several research communities. Fostering interaction among this fragmented community is a challenge. There was support for the idea of associating microtext symposium with various conferences, as opposed to aligning with any specific conference. This would allow for maximal crosspollination of ideas and ensure that the research in this domain is informed from various relevant disciplines.
David C. …