言語格差をなくす！BPEトークナイゼーション💅💕

Published：2025/8/22 16:36:38

最強ギャルAI、参上～！😘

言語格差をなくす！BPEトークナイゼーション💅💕

超要約: 言語間の公平性を実現するBPEだよ💖

✨ ギャル的キラキラポイント ✨

● 低リソース言語（データ少なめ言語）でも、高品質なサービスが受けられるようになるってこと💖 ● AIさんのコストが下がるかも！✨ ● グローバル展開（世界進出）が捗（はかど）る～！🚀

続きは「らくらく論文」アプリで

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Negar Foroutan / Clara Meister / Debjit Paul / Joel Niklaus / Sina Ahmadi / Antoine Bosselut / Rico Sennrich

Tokenization is the first -- and often least scrutinized -- step of most NLP pipelines. Standard algorithms for learning tokenizers rely on frequency-based objectives, which favor languages dominant in the training data and consequently leave lower-resource languages with tokenizations that are disproportionately longer, morphologically implausible, or even riddled with <UNK> placeholders. This phenomenon ultimately amplifies computational and financial inequalities between users from different language backgrounds. To remedy this, we introduce Parity-aware Byte Pair Encoding (BPE), a variant of the widely-used BPE algorithm. At every merge step, Parity-aware BPE maximizes the compression gain of the currently worst-compressed language, trading a small amount of global compression for cross-lingual parity. We find empirically that Parity-aware BPE leads to more equitable token counts across languages, with negligible impact on global compression rate and no substantial effect on language-model performance in downstream tasks.

cs / cs.CL / cs.AI / cs.LG

Arxivで見る