Considerations in Building Philippine Language Models

The Gap

Large language models are trained overwhelmingly on English text. The next tier includes Chinese, Spanish, French, German, Japanese. Filipino — the national language of 115 million people — barely registers. Cebuano, Ilocano, Hiligaynon, Waray, Kapampangan, Pangasinan, Bikol, Tausug, Maranao, Maguindanaon — languages spoken by millions — are effectively invisible to the current generation of AI.

This is not a technical limitation. It is a data problem, a funding problem, and an infrastructure problem. And solving it would require rethinking how language resources are collected, curated, and governed.

The Scale of the Challenge

The Philippines has over 180 living languages (Ethnologue, 27th edition). They span at least five major subgroups of the Austronesian family. Some have millions of speakers and robust written traditions. Others are spoken by a few hundred people in a single mountain valley or on a single island.

Building language models that serve this diversity means confronting several realities:

1. The Data Desert

Current open datasets for Philippine languages are thin. WikiAnn, CC-100, and mC4 contain Filipino and Cebuano text, but coverage drops off sharply after that. For languages like Isnag, Kalinga, Tboli, or Hanunuo Mangyan, usable training data is measured in thousands of sentences — not the billions that modern language models expect.

The data that does exist is often:

Biased toward formal registers — news text and Wikipedia, not conversation or oral tradition
Manila-centric — over-representing Tagalog-region usage patterns
Poorly annotated — lacking the morphological and syntactic markup needed for fine-tuning
Not consented — scraped from the web without the knowledge or agreement of the communities who produced it

2. The Code-Switching Problem

Filipinos do not speak one language at a time. A single sentence might contain Tagalog grammar, English nouns, a Visayan interjection, and a Spanish loanword. The linguist Andrew Gonzalez described this as “the most multilingual society in Asia.”

Code-switching is not noise. It is the actual linguistic reality of the Philippines. A language model that cannot handle Taglish — or Bislish, or Ilocano-English — is a language model that cannot understand how Filipinos actually communicate.

Standard tokenizers (BPE, SentencePiece) fragment code-switched text badly, splitting common Filipino affixed forms into meaningless subwords. The verb nakapagpapabagabag — a perfectly regular Tagalog word — gets shredded into six or seven tokens by tokenizers trained on English-dominant corpora. This is not an edge case. Filipino morphology is agglutinative; long affixed forms are the norm, not the exception.

3. The Script Question

Most Philippine text today is written in the Latin alphabet. But Baybayin revival text is increasingly common in cultural and artistic contexts. Arabic script (Jawi) is used in Moro communities. The Hanunuo Mangyan of Mindoro still use their indigenous script daily. A truly Philippine language model would need to handle multiple scripts — or at minimum, transliterate gracefully between them.

What a Crowd-Sourced Approach Could Look Like

The most promising path is not a single, well-funded lab building a Filipino GPT. It is a distributed, community-driven effort — modeled on projects like Mozilla Common Voice, Masakhane (for African languages), and AI4Bharat (for Indian languages) — that treats language communities as partners, not data sources.

A Philippine Language Commons

Imagine an open infrastructure with these components:

Text Collection Platform. A web and mobile tool where speakers contribute text in their language — sentences, paragraphs, translations, transcriptions of oral material. Contributors would tag text by language, register (formal/informal/oral), region, and domain (legal, agricultural, maritime, culinary, religious). Crucially, contributors would retain rights over their contributions under a Creative Commons or community-specific license.

Speech Corpus. Following the Common Voice model: speakers record prompted sentences and validate each other’s recordings. Priority languages would be those with the largest speaker-to-data-ratio gap — Ilocano, Hiligaynon, Waray, Bikol, and Kapampangan all have millions of speakers but minimal speech data.

Parallel Corpus. Aligned translations between Philippine languages and between Philippine languages and English/Spanish. This is the foundation for machine translation that actually works. Community translators — teachers, writers, linguists, bilingual speakers — would contribute translations at sentence and paragraph level.

Morphological Lexicon. A structured database of word forms, affixes, and derivations for each language. Filipino’s complex morphology means that a brute-force word-list approach fails; the system needs to understand how luto (cook) becomes nagluluto (is cooking), niluto (was cooked), pagluluto (the act of cooking), lutuan (cooking place), and dozens of other forms.

Historical Text Archive. Digitized and OCR’d primary sources — Blair and Robertson’s 55 volumes, the Boxer Codex, Doctrina Christiana, colonial-era grammars and dictionaries written by Spanish friars (artes and vocabularios), and the rich periodical literature of the Propaganda Movement. These are invaluable for understanding the evolution of Philippine languages and for building models that can process historical text.

Governance

The critical question is not technical. It is political: who owns the data, who decides how it is used, and who benefits?

The CARE Principles for Indigenous Data Governance — Collective benefit, Authority to control, Responsibility, Ethics — offer a framework. Applied to a Philippine Language Commons, this would mean:

Community consent before any language data is collected from indigenous or minority language speakers
Community review of models trained on their language data before public release
Benefit sharing — if commercial applications emerge, value flows back to contributing communities
Withdrawal rights — communities can remove their data from the commons at any time
Tiered access — some linguistic material (sacred narratives, restricted cultural knowledge) may be contributed for preservation but not for open training

This is harder than scraping the web. It is also the only approach that is ethically defensible.

Technical Considerations

Tokenization

Any Philippine language model needs a tokenizer trained on Philippine text — not an afterthought bolted onto an English-first tokenizer. The key requirements:

Handle agglutinative morphology without excessive fragmentation
Recognize code-switched sequences as coherent rather than anomalous
Support multiple scripts (Latin, Baybayin, Arabic) at minimum through graceful transliteration
Include vocabulary from all major Philippine languages, not just Tagalog

Kudo and Richardson’s SentencePiece framework allows training custom tokenizers on domain-specific corpora. A Philippine-specific SentencePiece model, trained on a balanced corpus of Philippine languages, would be a foundational contribution — usable by any downstream model.

Model Architecture

The economics of Philippine NLP favor smaller, efficient models over massive ones:

Quantized models (4-bit precision) can run on consumer hardware, reducing the infrastructure barrier
Adapter-based fine-tuning (QLoRA) allows a single base model to be specialized for different languages or domains without retraining from scratch
On-device inference via WebGPU could enable privacy-preserving applications — important for indigenous communities and human rights contexts where data should not traverse foreign servers

A 7-billion parameter model, quantized to 4-bit, fits in approximately 4 GB — within reach of most modern laptops and smartphones. This matters in a country where cloud computing costs are prohibitive for many potential users.

Evaluation

Standard NLP benchmarks (GLUE, SuperGLUE, MMLU) are English-centric and culturally specific. Evaluating a Philippine language model requires Philippine benchmarks:

Reading comprehension in Filipino and major regional languages
Code-switching detection and generation
Legal text understanding (Philippine jurisprudence)
Historical text processing (Spanish-era and American-era documents)
Translation quality across Philippine language pairs

Building these benchmarks is itself a significant research contribution — and another opportunity for crowd-sourced community involvement.

Who Would Build This?

The realistic answer is: a coalition. No single institution has the linguistic coverage, the community relationships, and the technical capacity to do this alone. A viable consortium might include:

University of the Philippines (Diliman, Visayas, Mindanao) — linguistic expertise and fieldwork infrastructure
Komisyon sa Wikang Filipino — the national language commission, with a mandate to develop Philippine languages
DOST-ASTI — the Philippines’ advanced science and technology institute, with computing resources
SIL Philippines — extensive language documentation across minority languages
Community organizations — indigenous peoples’ organizations, cultural preservation groups, the Mangyan Heritage Center, the Tausug Heritage Foundation
Open-source contributors — Filipino developers and linguists in the diaspora and at home

The model is not a product built by a company. It is a commons built by a nation.

The Stakes

When a language has no representation in AI, its speakers are excluded from an increasing share of the world’s information infrastructure. They cannot search, cannot dictate, cannot translate, cannot interact with automated systems in their own language. They are forced into someone else’s tongue.

For the 28 Philippine languages classified as “in trouble” and the 11 classified as “dying,” the stakes are existential. A language model trained on a language helps keep that language alive — it creates tools, it generates text, it makes the language useful in digital contexts. It cannot replace living speakers, but it can extend the reach and utility of a language in ways that pure archival cannot.

The Philippines does not need another English language model. It needs infrastructure that treats 180 languages as an asset, not an inconvenience — and that trusts communities to lead the effort.

References: Ethnologue: Languages of the World, 27th ed., ed. David M. Eberhard, Gary F. Simons, and Charles D. Fennig (SIL International, 2024); Andrew Gonzalez, “The Language Planning Situation in the Philippines,” Journal of Multilingual and Multicultural Development 19:5 (1998); Komisyon sa Wikang Filipino, “Atlas of Philippine Languages” (2020). Technical: Taku Kudo and John Richardson, “SentencePiece: A Simple and Language Independent Subword Tokenizer,” EMNLP 2018; Tim Dettmers et al., “QLoRA: Efficient Finetuning of Quantized Language Models,” NeurIPS 2023; Corbin Foucart et al., “WebGPU — All of the Cores, None of the Canvas,” W3C Working Draft (2024); Pratik Joshi et al., “The State and Fate of Linguistic Diversity and Inclusion in the NLP World,” ACL 2020. Community models: Masakhane, “Participatory Research for Low-Resourced Machine Translation,” Findings of EMNLP 2020; AI4Bharat, “IndicNLP Suite,” ACL 2020; Mozilla Common Voice. Governance: CARE Principles for Indigenous Data Governance (Global Indigenous Data Alliance, 2019); Republic Act No. 10173, Data Privacy Act of 2012; United Nations Declaration on the Rights of Indigenous Peoples (2007), Article 13.