Academic journal article Washington Law Review

How Copyright Law Can Fix Artificial Intelligence's Implicit Bias Problem

Academic journal article Washington Law Review

How Copyright Law Can Fix Artificial Intelligence's Implicit Bias Problem

Article excerpt


In 2013, Google announced the release of word2vec, a toolkit capable of representing how words are used in relation to one another so as to better understand their meanings.1 Word2vec can recognize that Beijing is to China in the same way as Warsaw is to Poland, as capital and country, but not in the same way as Paris relates to Germany.2 This technique, called "word embedding,"3 plays a role in many downstream uses of artificial intelligence (AI) tasks; Google uses it to improve its search engine, image recognition, and email auto-response tools.4 Since its launch, word2vec has become one of the most popular embedding models.5

There is a significant problem with word2vec: it is sexist. More specifically, word2vec reflects the gendered bias embedded in the Google News corpus used to train it.6 In 2016, researchers from Boston University and Microsoft Research New England uncovered that word2vec was riddled with gender bias exemplified by a particularly noteworthy word embedding, which projected that man is to computer programmer in the same way that woman is to homemaker.7 Word embeddings are used in many downstream Ai tasks, including improving web search. Thus, if an underlying dataset reflects gendered bias, those biases would be reinforced and amplified by sexist search results that, for example, rank results for computer programmers with male-sounding names more highly than those of female-sounding names.8 "Due to their wide-spread usage as basic features," the researchers warned, "word embeddings not only reflect such stereotypes but can amplify them."9

AI systems are commonly "taught" by reading, viewing, and listening to copies of works created by humans. Many of those works are protectable by copyright law.10 Google, for example, negotiated with multiple global news agencies to license articles for Google News after the company was sued for copyright infringement.11 For Google, the articles used to create the Google News corpus, which were ultimately used to create word2vec, were easily available and legally low-risk.12

Although Google released the word2vec toolkit as open source, the underlying Google News corpus was not released at all.13 It is all but unimaginable that a researcher could hope to strike comparable licensing deals, even in a bid to create a less biased corpus. And without access to the underlying corpus, downstream researchers cannot examine whether a news outlet or journalist exhibits gender bias across multiple articles, nor could researchers supplement the corpus with data derived from additional, less biased works. Indeed, as the researchers who identified the biases embedded in the Google News corpus noted, locking up the dataset makes it "impracticable and even impossible . . . to reduce the [biased] stereotypes during the training of the word vectors."14

Even as our banks and our bosses,15 our cars and our courts16 increasingly adopt AI, bias remains a significant and complex problem.17 one source of bias in Ai systems is, as exemplified by word2vec, data that reflect implicit bias. Indeed, as the Obama White House aptly identified in its whitepaper on AI, "AI needs good data. If the data is incomplete or biased, AI can exacerbate problems of bias."18 AI's largely homogenous community of creators, which skews toward white men, is another source of bias.19 Flawed algorithms can also contribute to bias, evident in Google search algorithms that featured Barbie as the lone woman in top image results for CEO,20 or serve up ads implying the existence of criminal records when running for black-sounding names.21 incomplete datasets are another common source of bias, particularly datasets that fail to reflect a diversity of facial features and skin tones.22 Commercial facial detection AI systems, for example, have been plagued with racial bias.23 In 2017, a mobile app called FaceApp introduced a "hot" photo editing feature that conflated attractiveness with whiteness by automatically lightening users' skin tones in photos, which the CEO attributed to "an unfortunate side-effect of the underlying neural network caused by training set bias. …

Search by... Author
Show... All Results Primary Sources Peer-reviewed


An unknown error has occurred. Please click the button below to reload the page. If the problem persists, please try again in a little while.