超要約: 児童福祉分野でAI(LLM)を使って、リスクを早く見つけたり、お仕事楽々にする研究だよ!
✨ ギャル的キラキラポイント ✨ ● 小型モデル最強!💻✨ コストもセキュリティも心配なし! ● 拡張推論で賢さUP⤴️ 難しいこともバッチリ理解! ● 社会貢献もバッチリ👍 子供たちの未来を守るんだ!
詳細解説いくよ~💖
背景 最近のIT技術すごいじゃん?✨ でも、児童福祉(子供のこととか助けるお仕事)の分野では、まだAIの活用が遅れてるんだよね😢 難しい言葉とか、プライバシーの問題とかあって…💦
続きは「らくらく論文」アプリで
Objective: This study develops a systematic benchmarking framework for testing whether language models can accurately identify constructs of interest in child welfare records. The objective is to assess how different model sizes and architectures perform on four validated benchmarks for classifying critical risk factors among child welfare-involved families: domestic violence, firearms, substance-related problems generally, and opioids specifically. Method: We constructed four benchmarks for identifying risk factors in child welfare investigation summaries: domestic violence, substance-related problems, firearms, and opioids (n=500 each). We evaluated seven model sizes (0.6B-32B parameters) in standard and extended reasoning modes, plus a mixture-of-experts variant. Cohen's kappa measured agreement with gold standard classifications established by human experts. Results: The benchmarking revealed a critical finding: bigger models are not better. A small 4B parameter model with extended reasoning proved most effective, outperforming models up to eight times larger. It consistently achieved "substantial" to "almost perfect" agreement across all four benchmark categories. This model achieved "almost perfect" agreement (\k{appa} = 0.93-0.96) on three benchmarks (substance-related problems, firearms, and opioids) and "substantial" agreement (\k{appa} = 0.74) on the most complex task (domestic violence). Small models with extended reasoning rivaled the largest models while being more resource-efficient. Conclusions: Small reasoning-enabled models achieve accuracy levels historically requiring larger architectures, enabling significant time and computational efficiencies. The benchmarking framework provides a method for evidence-based model selection to balance accuracy with practical resource constraints before operational deployment in social work research.