iconLogo
Published:2026/1/5 1:34:06

RelayGRって最強!長い行動シーケンスのレコメンデーション爆誕🎉(超要約:高速&高精度な次世代レコメンド)

🌟 ギャル的キラキラポイント✨

● 長い履歴(ユーザーの行動)を活かせるから、マジでパーソナルなオススメをしてくれる💖 ● 処理速度が爆速!待たされないから、ストレスフリーでサービスを楽しめる👯‍♀️ ● 色んなWebサービスで使えるから、色んなお店で「私だけのオススメ」に出会えるかも⁉️

詳細解説いくよ~!

背景 レコメンドシステムって、みんな大好きじゃん?🥺 でも、長い行動履歴を考慮すると、処理が遅くなっちゃうって問題があったの。RelayGRは、その問題を解決するんだって!

続きは「らくらく論文」アプリで

RelayGR: Scaling Long-Sequence Generative Recommendation via Cross-Stage Relay-Race Inference

Jiarui Wang / Huichao Chai / Yuanhang Zhang / Zongjin Zhou / Wei Guo / Xingkun Yang / Qiang Tang / Bo Pan / Jiawei Zhu / Ke Cheng / Yuting Yan / Shulan Wang / Yingjie Zhu / Zhengfan Yuan / Jiaqi Huang / Yuhan Zhang / Xiaosong Sun / Zhinan Zhang / Hong Zhu / Yongsheng Zhang / Tiantian Dong / Zhong Xiao / Deliang Liu / Chengzhou Lu / Yuan Sun / Zhiyuan Chen / Xinming Han / Zaizhu Liu / Yaoyuan Wang / Ziyang Zhang / Yong Liu / Jinxin Xu / Yajing Sun / Zhoujun Yu / Wenting Zhou / Qidong Zhang / Zhengyong Zhang / Zhonghai Gu / Yibo Jin / Yongxiang Feng / Pengfei Zuo

Real-time recommender systems execute multi-stage cascades (retrieval, pre-processing, fine-grained ranking) under strict tail-latency SLOs, leaving only tens of milliseconds for ranking. Generative recommendation (GR) models can improve quality by consuming long user-behavior sequences, but in production their online sequence length is tightly capped by the ranking-stage P99 budget. We observe that the majority of GR tokens encode user behaviors that are independent of the item candidates, suggesting an opportunity to pre-infer a user-behavior prefix once and reuse it during ranking rather than recomputing it on the critical path. Realizing this idea at industrial scale is non-trivial: the prefix cache must survive across multiple pipeline stages before the final ranking instance is determined, the user population implies cache footprints far beyond a single device, and indiscriminate pre-inference would overload shared resources under high QPS. We present RelayGR, a production system that enables in-HBM relay-race inference for GR. RelayGR selectively pre-infers long-term user prefixes, keeps their KV caches resident in HBM over the request lifecycle, and ensures the subsequent ranking can consume them without remote fetches. RelayGR combines three techniques: 1) a sequence-aware trigger that admits only at-risk requests under a bounded cache footprint and pre-inference load, 2) an affinity-aware router that co-locates cache production and consumption by routing both the auxiliary pre-infer signal and the ranking request to the same instance, and 3) a memory-aware expander that uses server-local DRAM to capture short-term cross-request reuse while avoiding redundant reloads. We implement RelayGR on Huawei Ascend NPUs and evaluate it with real queries. Under a fixed P99 SLO, RelayGR supports up to 1.5$\times$ longer sequences and improves SLO-compliant throughput by up to 3.6$\times$.

cs / cs.DC / cs.AI / cs.LG