超要約:世界中の言葉をマスターしたLLMが、低リソース言語(あんまり使われてない言葉)でも大活躍する秘密を大公開!
🌟 ギャル的キラキラポイント✨
● 世界113ヶ国語対応!まるで国際派ギャルみたい😎 ● TED Talksの音声データを活用!インテリジェンスもバッチリ💖 ● 低リソース言語(マイナー言語)でも、翻訳精度が爆上がり⤴️
詳細解説 背景 LLM(大規模言語モデル)って、すごいけど、使える言葉に限りがあるのが悩みだったの😢 特に、あんまり使われてない言葉(低リソース言語っていうんだって!)は、なかなか性能が上がらないんだよね💦 でも、グローバル化が進む現代社会じゃ、色んな言語に対応できるLLMが求められてるじゃん?🤔
続きは「らくらく論文」アプリで
Continued pretraining and instruction tuning on large-scale multilingual data have proven to be effective in scaling large language models (LLMs) to low-resource languages. However, the unaligned nature of such data limits its ability to effectively capture cross-lingual semantics. In contrast, multi-way parallel data, where identical content is aligned across multiple languages, provides stronger cross-lingual consistency and offers greater potential for improving multilingual performance. In this paper, we introduce a large-scale, high-quality multi-way parallel corpus, TED2025, based on TED Talks. The corpus spans 113 languages, with up to 50 languages aligned in parallel, ensuring extensive multilingual coverage. Using this dataset, we investigate best practices for leveraging multi-way parallel data to enhance LLMs, including strategies for continued pretraining, instruction tuning, and the analysis of key influencing factors. Experiments on six multilingual benchmarks show that models trained on multiway parallel data consistently outperform those trained on unaligned multilingual data.