iconLogo
Published:2026/1/8 10:04:49

最強ギャルAI、爆誕~!😎✨

アイルランド語LLM「Qomhrá」爆誕!🎉

超要約:低リソース言語(マイナー言語)向けAI「Qomhrá」がスゴすぎ!翻訳とか色々できるってよ💖

✨ ギャル的キラキラポイント ✨

● マイナー言語でもAI様が活躍できる時代が来たってこと!✨ ● アイルランド語話者(約100万人!)向けのビジネスチャンス到来!💰 ● オープンソース(誰でも使える!)で、みんなで言語を盛り上げれる!🙌

続きは「らくらく論文」アプリで

Qomhra: A Bilingual Irish and English Large Language Model

Joseph McInerney / Khanh-Tung Tran / Liam Lonergan / Ailbhe N\'i Chasaide / Neasa N\'i Chiar\'ain / Barry Devereux

Large language model (LLM) research and development has overwhelmingly focused on the world's major languages, leading to under-representation of low-resource languages such as Irish. This paper introduces \textbf{Qomhr\'a}, a bilingual Irish and English LLM, developed under extremely low-resource constraints. A complete pipeline is outlined spanning bilingual continued pre-training, instruction tuning, and the synthesis of human preference data for future alignment training. We focus on the lack of scalable methods to create human preference data by proposing a novel method to synthesise such data by prompting an LLM to generate ``accepted'' and ``rejected'' responses, which we validate as aligning with L1 Irish speakers. To select an LLM for synthesis, we evaluate the top closed-weight LLMs for Irish language generation performance. Gemini-2.5-Pro is ranked highest by L1 and L2 Irish-speakers, diverging from LLM-as-a-judge ratings, indicating a misalignment between current LLMs and the Irish-language community. Subsequently, we leverage Gemini-2.5-Pro to translate a large scale English-language instruction tuning dataset to Irish and to synthesise a first-of-its-kind Irish-language human preference dataset. We comprehensively evaluate Qomhr\'a across several benchmarks, testing translation, gender understanding, topic identification, and world knowledge; these evaluations show gains of up to 29\% in Irish and 44\% in English compared to the existing open-source Irish LLM baseline, UCCIX. The results of our framework provide insight and guidance to developing LLMs for both Irish and other low-resource languages.

cs / cs.CL