In the early days of web search engines, the basic idea of building a database was fairly simple. Send a spider out to crawl the web, index the words of each webpage, and then follow all the links to find more pages. The harder part was figuring out a useful ranking for results--and the early relevance ranking often missed its mark.
In those days, many librarians wished for better indexing data beyond just a straight full-text index. We were used to well-structured bibliographic databases, with standard citation information about the document--such as author, title, source, and date. The only fields in early web searching were the URL and the HTML title.
Relevance has improved substantially since the search engines' early days. In an effort to make yet more improvements, Google is now (finally) moving into reading more structured data from webpages. With an authorship initiative and a growing use of rich snippets, Google is taking small steps toward relying on a more semantic web.
THE METATAG DEBACLE
One early failure that illustrates the problems with structured data on the web occurred back in the mid-1990s. The early search engine AltaVista introduced the idea of using metatags in the header of a webpage to describe its content more accurately. In particular, the meta keywords and meta description tags were designed to be used like the author-supplied keywords and the abstract sections of scholarly articles. Such an approach worked well in the scholarly literature for years, so the reasoning was that if webpage creators would describe their own webpages with descriptions and keywords, it could help search engines.
Unfortunately, the web turned out to be a very different and much more commercial environment than the scholarly world. The design of metatags resulted in their being located in the header of the webpage. This content is not visible to most human viewers (unless you choose to view the source of the webpage). With companies making money on the web, based on how well webpages ranked in search results, the incentive to misuse the tags proved overwhelming.
Within a few years, the meta keywords tag was no longer used by most search engines for indexing. The vast majority of webpages using the metatag were just stuffing the field with popular keywords in the hopes of attracting more traffic rather than using the keywords to accurately represent the content. Since its intended use failed, the meta keywords tag story became a lesson in the problem of relying on webpage builders to create honest and accurate descriptions of their own content.
INCREASED META INTEREST
While metatags continue to be used for other functions, the idea of having creators tag their own content was abandoned by search engines for many years. Meanwhile, librarians and others were using metatags successfully with metadata standards, such as Dublin Core, to accurately represent the content, but this use was so small compared to the vast number of webpages misusing metatags that it did not move the major web search engines to index them or make them searchable. [On Sept. 19, 2012, after Greg wrote this column, Rudy Galfi, product manager of Google News, announced the "newly hatched way" for publishers to add metatags to news stories. The news_keywords metatag encourages writers to add descriptive terms that might not actually appear in the story. Searchers will not see these tags, as they are part of the page's HTML code (http:// googlenewsblog.blogspot.com/2012/09/a-newly-hatchedway-to-tag-your-news.html).--Ed.]
In 2001, Tim Berners-Lee co-authored an article in Scientific American describing his hopes for the development of the semantic web that could bring "structure to the meaningful content of Webpages" (Tim Berners-Lee, James Hendler, and Ora Lassila, "The Semantic Web," May 2001, pp. 29-37). The idea was to create more structured documents on the web where the structured elements can be read by software to eventually create a "web of data. …