Rising Covert Racism in LLMs due to Human Intervention

Even when the two statements shared identical connotations, the computational models tended to pin negative adjectives such as “dirty”, “lazy”, and “stupid” more to African American English (AAE) speakers, as compared to Standard American English (SAE) speakers. The models associated AAE speakers with less esteemed jobs, and when tasked to rule on a theoretical criminal defendant, they were more prone to propose capital punishment.

A standout discovery might be a defect the research highlights in the methods which analysts employ to address such biases. Firms like OpenAI, Meta, and Google employ feedback training to eradicate malignant views from their models, where human operators manually tailor the way the model reacts to specific commands. This procedure, commonly referred to as “alignment”, strives to readjust the countless connections in the artificial neural network and assist the model in aligning better with the preferred values. The technique effectively counters explicit stereotypes, with leading organizations utilizing it for close to a decade. For instance, should users prompt GPT-2 to mention stereotypes pertaining to the Black community, it would likely generate terms like “suspicious”, “radical”, and “aggressive”, but GPT-4 would not produce such associations, as pointed out in the research paper.

Nonetheless, the approach struggles with subtle stereotypes unearthed when researchers employed African-American English in their investigation, as per the report published on arXiv and yet to undergo peer review. In part, this is because firms have had less cognizance of dialect prejudice being a problem, the researchers suggest. It’s also simpler to instruct a model not to react to explicit racist queries than it is to guide it to avoid negative reactions to an entire dialect. Valentin Hofmann, an AI researcher participating in the research, observes that feedback training imparts to models the requirement to consider their racist tendencies. He further clarifies, “But dialect bias uncovers a deeper layer.”

Avijit Ghosh, an ethics scholar at Hugging Face and unaffiliated with the investigation, opines that the discovery casts doubt on the methodology employed by firms to mitigate bias. “This alignment strategy, where the model refrains from generating racist outputs, is merely a fragile filter that can be readily bypassed,” Ghosh warns.