Shopify - E-Commerce
Driving 28% of Revenue Through AI Recommendations
How we built a recommendation engine for Shopify's merchant ecosystem that drives over a quarter of total revenue through personalized product discovery.
Duration
10 weeks
Team
3 engineers, 1 ML engineer, 1 PM
Tech Stack
The Challenge
Shopify's merchant ecosystem spans millions of active products across home goods, electronics, and apparel. Their existing recommendation system was a basic collaborative filtering model built three years prior - it powered a single "Customers Also Bought" widget on the product detail page. Conversion rates across their merchant network had stagnated at 2.1%, average order value was flat at $67, and customer research showed that 43% of shoppers reported difficulty finding products relevant to their taste. The catalog was simply too large for users to browse effectively, and the recommendation system wasn't helping.
Shopify's data team had attempted to improve the system twice. The first attempt replaced the collaborative filter with a matrix factorization model (ALS), which improved accuracy on offline metrics but had no measurable impact on revenue because it still only powered the single product-page widget - most users never reached that page for relevant products. The second attempt integrated a third-party recommendation API that promised real-time personalization, but the vendor's model treated Shopify's catalog as a black box, couldn't incorporate proprietary signals (loyalty tier, return history, seasonal buying patterns), and charged per API call at a rate that made it cost-prohibitive at scale. After 3 months the integration was abandoned.
Our Approach
We began with a four-week data exploration phase, analyzing 14 months of clickstream data (3.2 billion events), purchase history, product catalog metadata, and return data. Two findings shaped our architecture. First, Shopify's merchant catalog had a severe long-tail problem: 80% of revenue came from 6% of products, while the remaining 94% of products had sparse interaction data, making collaborative filtering unreliable for the majority of the catalog. Second, user intent varied dramatically by entry point - a user arriving via a Google Shopping ad had very different browsing patterns than a returning loyalty customer.
We designed a hybrid architecture with three model components: (1) a two-tower neural collaborative filtering model (user tower + item tower) trained on implicit feedback signals (views, add-to-cart, purchases, and returns as negative signal), (2) a content-based model using product image embeddings (CLIP) and text embeddings (sentence-transformers) from product titles, descriptions, and attributes to handle the cold-start problem for new and long-tail products, and (3) a contextual bandit layer that selects the optimal recommendation strategy based on real-time session context - entry source, device type, time of day, and browsing depth. The three components' outputs are blended by the bandit, which continuously optimizes for the business metric (revenue per session) rather than click-through rate.
We also expanded the recommendation surface area from a single widget to six touchpoints: homepage personalization, category page re-ranking, product detail page ("You May Also Like" and "Complete the Look"), cart page cross-sells, post-purchase email recommendations, and search result boosting.
The Solution
The recommendation system runs on AWS with a real-time serving architecture. User and item embeddings are pre-computed nightly via a Spark job on EMR and stored in a Milvus vector database for fast nearest-neighbor retrieval. Real-time session features are computed via Kafka Streams and stored in Redis. The recommendation API (Python, FastAPI) assembles candidates from the vector store, applies the contextual bandit for ranking, and returns personalized results in under 50ms at p99. Model training runs weekly on SageMaker using the latest interaction data, with automated A/B testing gates - a new model only graduates to production if it outperforms the incumbent on revenue-per-session by a statistically significant margin (p < 0.05) over a 72-hour holdout test. The product catalog pipeline automatically generates CLIP and text embeddings for new products within 15 minutes of catalog ingestion, eliminating cold-start delays.
Results
- 28% of total revenue now driven by recommendations (up from 11% baseline), measured over 6 months post-launch across all six recommendation touchpoints
- 22% increase in average order value ($67 to $81.70), driven primarily by the "Complete the Look" and cart cross-sell placements
- 35% improvement in product discovery - unique products viewed per session increased from 4.2 to 5.7, with long-tail product visibility improving by 3.1x
- Click-through rate on recommendations increased 4.2x (2.8% to 11.8%), with the homepage personalization module showing the highest lift
Key Insight
Expanding from one recommendation widget to six touchpoints drove more incremental revenue than improving model accuracy - the best model in the world doesn't matter if users never see its output.
“The "Complete the Look" feature alone generates $2.3M in quarterly revenue that didn't exist before. But what really impressed our team was the A/B testing framework - every model update has to prove itself against the incumbent before it goes live. We've never had that discipline internally.”
MPMikhail Parakhin
CTO at Shopify