ベンガル語方言翻訳、RAGで爆上がり💖

Published：2025/12/16 8:18:18

ベンガル語方言翻訳、RAGで爆上がり💖

超要約： ベンガル語の方言（ほうげん）翻訳を、AI（RAG）で精度（せいど）爆上げしちゃお！データ少なめでもOK！✨

✨ ギャル的キラキラポイント ✨ ● ファインチューニングなしで、方言翻訳がスゴイ！😎 ● いろんな方言に対応できるから、マジ卍（まんじ）！ ● IT業界（ぎょうかい）にも貢献（こうけん）できちゃう！

詳細解説 ● 背景ベンガル語って、色んな方言があって翻訳難しいらしい…😭 データも少ないし。既存（きぞん）のAIじゃ、方言のニュアンスって、なかなか捉えられないんだよねー。

● 方法 RAG（検索拡張生成）っていうAI技術を使ったよ！2つのやり方があって、

方言の音声を参考に翻訳
標準語と方言のペアを参考に翻訳どっちが良いか検証（けんしょう）したって感じ！

続きは「らくらく論文」アプリで

A Comparative Analysis of Retrieval-Augmented Generation Techniques for Bengali Standard-to-Dialect Machine Translation Using LLMs

K. M. Jubair Sami / Dipto Sumit / Ariyan Hossain / Farig Sadeque

Translating from a standard language to its regional dialects is a significant NLP challenge due to scarce data and linguistic variation, a problem prominent in the Bengali language. This paper proposes and compares two novel RAG pipelines for standard-to-dialectal Bengali translation. The first, a Transcript-Based Pipeline, uses large dialect sentence contexts from audio transcripts. The second, a more effective Standardized Sentence-Pairs Pipeline, utilizes structured local\_dialect:standard\_bengali sentence pairs. We evaluated both pipelines across six Bengali dialects and multiple LLMs using BLEU, ChrF, WER, and BERTScore. Our findings show that the sentence-pair pipeline consistently outperforms the transcript-based one, reducing Word Error Rate (WER) from 76\% to 55\% for the Chittagong dialect. Critically, this RAG approach enables smaller models (e.g., Llama-3.1-8B) to outperform much larger models (e.g., GPT-OSS-120B), demonstrating that a well-designed retrieval strategy can be more crucial than model size. This work contributes an effective, fine-tuning-free solution for low-resource dialect translation, offering a practical blueprint for preserving linguistic diversity.

cs / cs.CL / cs.AI / cs.IR

Arxivで見る