Skip to content

Common Crawl and AI Ethics

One of the datasets commonly used to train language AI algorithms, such as ChatGPT, is Common Crawl. It is an open repository of web crawl data in 40+ languages.

Datasets like this can often have massive embedded problems that reflect the systemic oppression of our world.

Oppressions such as imperialism, ableism, racism, and homophobia can all show up in these AI training datasets.

For example, in Common Crawl, most of the crawled websites are hosted in the U.S.. So though there are over 40 languages it is crawling, the perspective is US-dominant.

There are also filters implemented, such as banned words that are intended to try to prevent the algorithms from learning those words.

However, often these attempts to filter words and expressions are the very vehicles that lead to excluding community-specific vernacular.

For instance, in those banned word lists are words and expressions that BIPOC often use amongst ourselves, to describe ourselves, with our own vernacular.

Similarly, there are words that LGBTQIA people use amongst ourselves, to describe ourselves, for ourselves.

These learning models then end up not including context, references, and language around BIPOC, LGBTQIA culture, and the culture of many historically marginalized and excluded groups.

These are just some aspects of so many to keep in mind when you make decisions to interact with these chatbots and conversation algorithms.

Published inAI EthicsSystems of Oppression

Comments are closed.