Why AI Systems Should Be Designed with Indigenous Language Communities, Not Just for Them

The Problem No One Wants to Name

Most large language models are overwhelmingly trained on English-language data. The next tier includes other major European languages, Mandarin, Japanese, Arabic. And then there is a long tail of several thousand languages — many of them spoken by millions of people — essentially invisible to the AI systems currently being deployed globally.

In a 2022 paper co-authored with Paul Akinmayowa Akin-Otiko, I examined this problem through the lens of Yoruba — one of the most widely spoken African languages, with roughly 47 million native speakers. Our argument: existing digital humanities tools built for African language studies were failing because they were designed for African language communities without being designed with them.

Communication Is Not Universal

One of the core assumptions embedded in most NLP pipelines is that communication works roughly the same way across languages. This model maps reasonably well onto Indo-European languages. Yoruba challenges it at multiple levels. Tone is grammatically significant: the same sequence of consonants and vowels means entirely different things depending on pitch contour. Yoruba has a rich system of oral formulaic expression — proverbs, praise poetry, indirect speech — where meaning is contextual and relational in ways that sentence-level tokenisation cannot capture.

We proposed a Yoruba Indigenous Model of Communication for software development: a framework that takes the communicative norms and structures of Yoruba speakers as design requirements, not edge cases to be handled after the fact.

What Participatory Design for AI Looks Like

Community experts as co-designers, not just data labellers
Oral-first data collection: many African languages have richer oral than written traditions
Evaluation by speakers, not just benchmark scores from held-out text samples
Epistemic humility in deployment: systems should communicate their limitations when operating outside their training contexts

This Is an AI Ethics Issue

When AI systems cannot process African languages adequately, the communities who speak those languages are systematically excluded from the benefits those systems offer — healthcare information retrieval, legal document processing, educational tools, public service interfaces. Exclusion at this scale, built into the infrastructure of AI, is a form of epistemic injustice. The solution begins not with better algorithms but with different questions: Who is invited to define the problem? Whose language community's norms shape the design?

Why AI Systems Should Be Designed with Indigenous Language Communities, Not Just for Them

The Problem No One Wants to Name

Communication Is Not Universal

What Participatory Design for AI Looks Like

This Is an AI Ethics Issue

Comments (0)