Generative Recommendation, Part 1: Before Generation

Collaborative filtering. Matrix factorization. Deep learning. Three generations, architecturally one. They all scored a fixed candidate set. The assumption is breaking.

May 25, 2026

Recommendation is in the middle of a paradigm rupture as deep as the one LLMs caused in language. Three independent forces are converging at the same moment. Scaling laws have arrived at user-behavior data. The cascade architecture that powered the modern attention economy has reached engineering diminishing returns. Item representation is going language-like. None of these alone would matter. Their convergence is what makes this a turn rather than another round of incremental improvement.

The thesis: the first three eras of recommender systems, the ones built between 1992 and roughly 2020, look radically different from one another and are architecturally a single object. Collaborative filtering, matrix factorization, deep-learning recommendation. Each one scores a fixed candidate set, and the field’s three decades of progress was the refinement of that scoring, not the questioning of it. Call this the discriminative paradigm.

Recommendation is also the largest production AI deployment in the economy. TikTok, YouTube, Amazon, Netflix, Spotify, Instagram, every major advertising system. The discriminative paradigm did not produce a research toy. It built the infrastructure of modern attention. Understanding what that paradigm actually was, and what it could not see, is the prerequisite for judging what replaces it.

The candidate-set assumption is the silent constraint that organized recommendation for thirty years. Naming it is the first step in seeing what generation actually changes.

1. First Era: Collaborative Filtering

In the early 1990s, Usenet was beginning to fail under its own success. The network was carrying more than 100 megabytes of news traffic per day, and by early 1994 over 140,000 people had posted articles in the preceding two weeks, alongside a much larger lurking readership. There were too many articles for any reader to follow, and the existing categorical newsgroups were too coarse to help. The problem was less about access than about filtering. Of the day’s flood, which articles would a given reader actually want to see?

The first technical response came in 1992. Goldberg, Nichols, Oki, and Terry at Xerox PARC published Tapestry in Communications of the ACM, coining the term “collaborative filtering” and proposing a system that let users query each other’s annotations. Tapestry was monolithic, designed for a single site, and required users to write filter queries by hand. It established the conceptual move that other readers’ judgments could substitute for content classification. It did not scale.

Two years later, GroupLens did. Paul Resnick and Mitesh Suchak at MIT’s Center for Coordination Science teamed up with Neophytos Iacovou, Peter Bergstrom, and John Riedl at the University of Minnesota’s Department of Computer Science. Their paper, “GroupLens: An Open Architecture for Collaborative Filtering of Netnews,” appeared at CSCW ‘94 in Chapel Hill in late October. The architecture was open in two senses. Different news clients could plug into the rating servers, and different rating servers could be added without restructuring the system. More importantly, the prediction step happened automatically, without users having to write queries about which other readers to trust.

The mechanic was a user-item matrix of ratings. For a target user and an unseen article, compute the similarity between the target user and other users who had rated the article. Use the similarity as a weight, average their ratings, return the predicted score. Rank the unseen articles by predicted score. Show the top K.

GroupLens was a real breakthrough. It removed the need for hand-curated content categories, and it generalized. A system that knew nothing about hockey or quantum field theory could still tell a reader whose past ratings correlated with other quantum-field-theory readers that they would probably like the article currently circulating. The opinions were the categorization.

But the architectural shape of the system was what mattered most for what came later. Take an existing set of items. Compute a per-pair score between users and items. Rank. Return the top of the ranking. The set was implicit, never built as a separate stage and never named in the code: every article the target user had not yet seen, but fixed by what the user had already seen.

2. Second Era: Matrix Factorization

In October 2006, Netflix announced a contest. The company offered one million dollars to anyone who could improve its existing recommendation algorithm, Cinematch, by ten percent on root mean squared error against a held-out test set. The Netflix Prize ran for nearly three years and became the most public, most documented, most fought-over recommendation problem of its era.

Cinematch was already a deployed system, processing millions of ratings per day. The contest’s structure made improvement legible. An exact dataset, an exact metric, a public leaderboard, a single prize. Researchers could iterate against the same test, compare techniques honestly, and watch each other’s work in real time.

On December 11, 2006, two months into the contest, a developer using the name Simon Funk posted to his personal blog under the title “Netflix Update: Try This at Home”. Funk was, at the time, tied for third place on the leaderboard, kept there by a fifty-fifty blend with another competitor’s submission. He was not winning. He had no particular institutional affiliation. He wrote, in his usual unpretentious tone, that he was going to share the math behind his approach.

The math was a stripped-down singular value decomposition adapted to the recommendation problem. Each user got a low-dimensional latent vector. Each movie got a low-dimensional latent vector. The predicted rating was the dot product of the two. The vectors were learned by minimizing squared error against observed ratings, one rating at a time, with regularization to prevent overfitting on the sparse data.

Funk’s post did not win the prize. What it did was give the rest of the field a template. Over the next two and a half years, the contest’s leading teams refined and extended his approach, adding implicit feedback, temporal dynamics, neighborhood effects, and ensemble blending. Yehuda Koren, Robert Bell, and Chris Volinsky of AT&T Research codified the resulting techniques in a 2009 paper in IEEE Computer titled “Matrix Factorization Techniques for Recommender Systems,” which became the canonical reference for the era. All three were also on the team that ultimately won.

The endgame, in July 2009, was sharper than the common telling. On the public quiz set, a team called The Ensemble made a last-minute submission scoring 10.10 percent improvement against Cinematch, four minutes before the contest closed. BellKor’s Pragmatic Chaos, a merger of three earlier teams led by Koren and including Bell, Volinsky, the Austrian team BigChaos, and the Canadian team Pragmatic Theory, was sitting at 10.09 percent. By the visible numbers, The Ensemble appeared to win.

Netflix awarded the prize on a different set. The hidden test set was what counted, and there both teams achieved an identical 10.06 percent improvement against Cinematch’s baseline. The tiebreaker was submission time on the test set. BellKor’s verified submission landed at 18:18:28 UTC on July 26, 2009, twenty-four minutes before The Ensemble’s.

Matrix factorization was a different scoring function from collaborative filtering’s similarity-weighted average. The latent vectors were dense, learned, and generalized in ways the raw user-item matrix could not. But the architectural shape did not change. Embed users and items as vectors. Score the pair with a function. Rank a fixed set. In the Prize formulation, the set was still implicit, still every unseen movie. The scoring was the only thing that had become different.

3. Third Era: Deep-Learning Recommendation

By the mid-2010s, the trajectory that had reshaped vision and language was arriving at recommendation. In June 2016, sixteen authors at Google published “Wide & Deep Learning for Recommender Systems” at the first Deep Learning for Recommender Systems workshop, co-located with that year’s RecSys conference. The first author was Heng-Tze Cheng. The application was Google Play, the company’s mobile app store, where the team had been working on improving app recommendations for over a billion active users.

Also at RecSys 2016, Paul Covington, Jay Adams, and Emre Sargin published “Deep Neural Networks for YouTube Recommendations”. Their paper was unusual for the era in being explicit about the deployed architecture. The paper, they wrote, was “split according to the classic two-stage information retrieval dichotomy: first, we detail a deep candidate generation model and then describe a separate deep ranking model.” The two-stage cascade, retrieval followed by ranking, was already canonical enough in industrial recommendation that the YouTube team described it as classic by 2016.

The pattern across the early deep-recommendation work was consistent. Replace the dot product or the similarity weighting with a neural network. Add embeddings for context, time, sequence, and side information. Add more side information. Add attention.

In August 2018, ten authors at Alibaba published “Deep Interest Network for Click-Through Rate Prediction” at KDD. The first author was Guorui Zhou. The lineup included Xiaoqiang Zhu, Han Li, and Kun Gai of the company’s display advertising group. The application was Taobao’s display advertising system, which is closely related to but not identical to general recommendation. DIN’s contribution was an attention mechanism that let the model decide, given a candidate advertisement, which parts of a user’s past behavior were relevant. Different ads triggered different weightings over the same history. The architecture was deployed to Alibaba’s main traffic and became a citation that nearly every subsequent industrial CTR-and-recommendation paper built on.

One year later, in August 2019, the same team published a follow-up at KDD. Qi Pi and Weijie Bian were co-first authors. Guorui Zhou was corresponding. The title was “Practice on Long Sequential User Behavior Modeling for Click-Through Rate Prediction,” and the system was named MIMN. The paper’s stated contribution was infrastructure. MIMN was “one of the first industrial solutions that are capable of handling long sequential user behavior data with length scaling up to thousands.” It bought the ability to model histories of a thousand interactions in production. It did not buy a corresponding order-of-magnitude improvement in standard CTR metrics on the same sequence lengths.

This is where the plateau started to become visible to practitioners. By the late 2010s, architectural innovations in deep recommendation were buying either longer sequences, like MIMN’s thousand-length input, or better serving efficiency, like MIMN’s decoupled user-interest server. They were not buying dramatic AUC lifts of the kind the first deep-learning wave had produced in 2016. The plateau was felt years before it was characterized. When Meta’s HSTU team formally documented in 2024 that most production deep-learning recommendation models “fail to scale with compute” and demonstrated that the generative reformulation did obey power-law scaling, they were quantifying something the industry had been navigating in practice since roughly 2019.

The three architectures, Wide & Deep, YouTube DNN, DIN-MIMN, were elaborate. They handled billions of users and billions of items. They moved revenue numbers at the largest scales any AI system had ever moved. But the architectural shape did not change. Embed users and items into representations that were now learned end-to-end with side information. Score user-item pairs with a deep network that could be arbitrarily complex. Rank a fixed candidate set, which had itself become explicit as a separate retrieval stage in the cascade. The set was no longer implicit. The scoring was no longer simple. The framework was unchanged.

4. The Architectural Identity

Place the three eras side by side and the identity becomes visible.

Collaborative filtering, in GroupLens 1994, mapped users and items to positions in an interaction matrix. Similarity between users became the scoring weight. The output was a ranked list of unseen items.

Matrix factorization, in the Netflix Prize era, mapped users and items to dense latent vectors. The dot product became the scoring function. The output was a ranked list of unseen items.

Deep-learning recommendation, from Wide & Deep onward, mapped users and items to embeddings learned end-to-end with side information. A deep network became the scoring function. The output was a ranked list of items drawn from a retrieved candidate set.

The shared pattern has three components.

The first is embedding. Users and items become vectors. The vectors get more sophisticated across eras. They start as rows of an interaction matrix, become dense learned latent factors, then become end-to-end-trained embeddings carrying side information about context, time, and sequence. The sophistication varies. The fact that they are vectors does not.

The second is scoring. A function maps a user-item pair to a real number. The function gets more sophisticated across eras. Weighted average of neighbor ratings, dot product of latent vectors, attention-mediated deep network. The sophistication varies. The fact that the function takes a pair to a number does not.

The third is ranking over a set. The output is a sorted list drawn from a set of items the user has not yet interacted with. The set gets larger and more dynamic across eras. It also gets more explicit. In collaborative filtering it lived only as “the items the user has not seen,” with no separate code path. By deep-learning recommendation, it had become a distinct retrieval stage with its own architecture, models, and latency budget. Every article on Usenet, every movie in the Netflix catalogue, every product retrieved by the candidate-generation stage of a cascade pipeline. The size varies. The fact that the output is a ranking over a set does not.

The three components compose into a single architectural object. Embed, score, rank a fixed set. This object is the architecture of all three eras.

The deepest constraint sits inside the third component. Recommendation, across thirty years, has been what happens after a candidate set has been built. The choice of items that go into the set has been treated, by every era of the field, as someone else’s problem. That treatment has been silent. It has not been argued. It has been assumed.

5. Why It Worked

The discriminative paradigm was not a mistake. It was the rational architecture given the constraints of its era.

The cascade structure that became canonical, retrieval to coarse ranking to fine ranking to reranking, emerged because exhaustive scoring at production scale was not an option. With a billion users and a billion items, scoring every pair would have required a quadrillion forward passes per refresh cycle. So engineers retrieved a small candidate set with a cheap method, scored it with a more expensive one, and ranked the top survivors with the most expensive one. The candidate-set assumption was not theoretical. It was a latency budget and a compute budget made architectural.

The architecture worked. By the early 2020s, recommendation was the largest deployment of applied machine learning anywhere in the economy. Every video on TikTok, every video on YouTube, every product on Amazon, every track on Spotify, every post on Instagram, every ad placement across the global advertising industry was being served by some version of the embed-score-rank-fixed-set pattern. The discriminative paradigm built the attention economy.

What changes the calculus is that the constraints are changing. When the deep-learning era hit the plateau in the late 2010s, the rational response was still to refine the cascade. Better retrieval, better candidate sets, longer sequences, more side information. When the HSTU team and the OneRec team showed in the mid-2020s that the generative reformulation broke the plateau and obeyed scaling laws, refining the cascade stopped being the obvious answer. The candidate-set assumption was not wrong under the old constraints. It is starting to look constraining under the new ones.

The discriminative paradigm is not being displaced because anyone proved it false. It is being displaced because the constraints that made it rational have shifted, and three independent forces have shifted them at the same time. Any account of generative recommendation that treats the discriminative paradigm as a mistake is missing the actual story.

6. The Definitional Pivot

Generative recommendation reframes retrieval. Instead of scoring a candidate set, the model directly generates the identifier of the next item the user is likely to want.

The model treats item identifiers, or compressed representations of them called Semantic IDs, as tokens. It learns, from sequences of user-item interactions, to predict the next token autoregressively. The candidate-set step is not a separate stage feeding into a scoring stage. The candidate is the thing being generated.

This is what distinguishes generative recommendation from what is sometimes called LLM-based recommendation but is in fact LLM-augmented discriminative recommendation. A pipeline that uses an LLM to rewrite user queries or to generate richer item descriptions, while still scoring and ranking a retrieved candidate set, is not generative recommendation in the sense this series uses the term. It is the discriminative architecture with better text processing. The test of whether a system is generative is whether the retrieval process is modeled as sequence generation.

Two consequences follow.

The first is representational. Items can no longer remain arbitrary database rows. To be tokens, they have to be embedded in a vocabulary the model can autoregressively traverse. This is what Semantic ID, introduced by the TIGER paper in 2023 and developed by the LETTER, TokenRec, and ActionPiece lines of work since, makes concrete. An item becomes a short sequence of discrete codes from a learned codebook, where codes carry semantic structure that lets the model generalize across items in a way it cannot generalize across arbitrary row IDs.

The second consequence is architectural. The three-component pattern that organized the discriminative paradigm collapses. There is no separate embed stage that hands off to a separate score stage that hands off to a separate rank stage. Generation does all three simultaneously. The token sequence the model produces is the recommendation, and the act of generating it has folded the embedding, the scoring, and the ranking into a single forward pass.

The cascade that organized recommendation engineering for thirty years was the implementation strategy that made the embed-score-rank-set pattern tractable at scale. If the pattern itself collapses, the cascade has nothing to organize.

7. The Silent Constraint

The architectural identity across three eras is the real story of pre-generative recommendation.

Each era refined the scoring of a candidate set. Each one looked, from outside, like a revolution. Collaborative filtering broke the categorical-classification assumption. Matrix factorization broke the high-dimensional-representation assumption. Deep learning broke the linearity assumption. None of the three broke the candidate-set assumption. That one stayed silent through three breakthroughs, three industrial deployments, and thirty years of progress.

It stayed silent because the engineering constraints of its era made it rational. At billion-user, billion-item scale, the cost of touching every pair forced the field into the cascade architecture. The cascade made the candidate set into the system’s central organizational fact. Each era refined the structure built on top of that fact. None of them questioned the fact itself.

That is what is now changing. The candidate-set assumption was the silent constraint that organized recommendation for thirty years. Three eras refined the scoring of a set they never questioned. The next paradigm starts by questioning the set itself.

Generative Recommendation. Every feed, every ad, every recommendation engine on the internet is being rebuilt. This is the field’s first paradigm rupture in thirty years.

Robonaissance

Discussion about this post

Ready for more?