超速レコメンド(おすすめ)システム、爆誕って感じ💖
✨ ギャル的キラキラポイント ✨
● 推論(すいろん)しながら、サクサクLoRAでモデル更新って、めっちゃ時短じゃん? ● モデルの精度(せいど)を保ちつつ、CPUを有効活用(ゆうこうかつよう)するなんて、エコで天才👏 ● レコメンドの鮮度(せんど)が上がるから、推し活(おすすめ)も捗(はかど)るってこと🫶
詳細解説いくよ~!
続きは「らくらく論文」アプリで
Deep Learning Recommendation Models (DLRMs) underpin personalized services but face a critical freshness-accuracy tradeoff due to massive parameter synchronization overheads. Production DLRMs deploy decoupled training/inference clusters, where synchronizing petabyte-scale embedding tables (EMTs) causes multi-minute staleness, degrading recommendation quality and revenue. We observe that (1) inference nodes exhibit sustained CPU underutilization (peak <= 20%), and (2) EMT gradients possess intrinsic low-rank structure, enabling compact update representation. We present LiveUpdate, a system that eliminates inter-cluster synchronization by colocating Low-Rank Adaptation (LoRA) trainers within inference nodes. LiveUpdate addresses two core challenges: (1) dynamic rank adaptation via singular value monitoring to constrain memory overhead (<2% of EMTs), and (2) NUMA-aware resource scheduling with hardware-enforced QoS to eliminate update inference contention (P99 latency impact <20ms). Evaluations show LiveUpdate reduces update costs by 2x versus delta-update baselines while achieving higher accuracy within 1-hour windows. By transforming idle inference resources into freshness engines, LiveUpdate delivers online model updates while outperforming state-of-the-art delta-update methods by 0.04% to 0.24% in accuracy.