Did you ever get the feeling that Amazon understands your desires better than your spouse? Did you ever search for a vacation in Galicia on Google and then notice you see more ads trying to sell you sea food and raincoats? This is thanks to recommender systems brought to you by the phenomenon of Big Data.
In today’s world, businesses collect an amazing amount of data on their customs and use it to anticipate what they can sell them next. In days gone by an econometrician would have called this estimating your utility curve, utility being how much you value a certain product, but they did not have anywhere as much data to work with as today’s data scientists, nor the plethora of algorithms available to work with.
This type of personal experience is essential to today’s internet companies. Jeff Bezos of Amazon explains:
If I have 4.5 million customers on the web, I should have 4.5 million stores on the web.
This is what is called long-tail marketing: catering not just to the most popular, but also selling to many clients with very particular tastes. Yesterday’s companies reached their customers through mass marketing whereas today’s companies reach them through mass customization.
We realized that by looking at relatively few variables they could make more appropriate recommendations for pension plans
Some recommendation systems don’t have to be as complex as Amazon’s to be personalized and effective. At BBVA D&A we realized that by looking at relatively few variables they could make more appropriate recommendations for pension plans to BBVA clients. The old way of doing mass mailings doesn’t make sense when a bank has the means to understand which clients have different abilities to save. BBVA decided to target and group clients that actually were in a good position to save, and recommend an amount that made sense to deposit in a tax-advantaged plan given recent trends in the client’s behavior.
What was important to analyze was not just a snapshot of the clients’ savings at the end of the year but the trend from month to month and how volatile the savings trend was. For volatility, we measured the ratio of the change in the minimum balance to the change in the medium balance. A higher ratio indicated that the client’s savings was more uncertain.
A time period long enough to establish a trend but short of enough to be relevant to the client’s current situation had to be examined, therefore we chose to examine seven months’ worth of data. Along with a positive savings trend and less uncertainty, a few other conditions, such as still having an income and having an email account, were needed before a client was added to the trial group. Roughly 115,000 clients were chosen to be part of the trial that would be compared to the control group.
Personalization through clustering
To personalize the saving amount recommended, we grouped the clients by saving capacity, volatility and cyclical behavior. We clustered the clients using a Gaussian Mixture Model with the R package Mclust and used Hive (MapReduce) for the time series preprocessing. The algorithm categorized clients into nine clusters. The functions used from the Mclust package do highly refined clustering, thus are demanding with regards to computing power. Given that this project had 115,000 clients to classify, using Mclust was very feasible, but the algorithm would have been impractical on a project with millions of users and tens of millions of items, without throwing more resources at the problem.
Since many pension plans come with tax advantages, there is a specific time of year when you can catch a savers attention. The team had one month to carry out the analysis and make recommendations. The targeted advertisement was sent throughout all of Spain early in 2016. Initial analysis indicated 60% more contracts issued to the target group over the control group, although a review is still ongoing to determine how much of the variation can be attributed to the targeted offerings.
Types, advantages, disadvantages and a hybrid models
With so many business rushing to utilize recommender systems, they are evolving quickly and not everyone one is using the same taxonomy to describe them. But a rough consensus has begun to form. In the broadest categories, recommender systems can be classified into 2 groups:
- Content-based systems, which evaluates the items a user has bought and tries to suggest similar items.
- Collaborative filtering, which will recommend items that are rated highly by other users with similar tastes.
With a content-based system, a movie streaming service might noticed that you liked “Men in Black” directed by Barry Sonnefeld and that he also directed “Get Shorty” and will suggest that, or that you rented a film with Will Smith and will recommend another film with him.
You would start off building a matrix of the client and the items they have purchased or liked, and then use a profile of the items’ attributes. These attributes may be binary, such as whether the movie is or is not an action film, or scalar, such as whether it has 1 to 5 stars. There are many complex quantitative techniques that can be used to interpret this data, such as Bayesian classifiers, clustering, decision trees, neural networks, and text analytics (when recommending news articles).
A disadvantage of this approach is that it can be limited in scope. You only can recommend what you have already seen, so you won’t help the user broaden their taste. In extreme cases this is called the “cold start” problem or “new user” problem. Since you are relying on historical data it takes time to get a system like this working.
An advantage is that the user-item matrix can contain relatively little data and still give reliable recommendations. So after you get over the initial hump of starting, it works well.
Another way of predicting user preference would be collaborative filtering, creating a vector of items bought by other users that have been rated negatively or positively and if one user’s vector is similar to others, the model recommends the items others have rated highly but the user has not bought. Recommendations can be direct, such as an Amazon reviews, or inferred, such as liking a news article or viewing this type of article often.
The advantage of this approach is you need very little understanding of the item being recommended. You might even introduce the user to something new.
The disadvantages are that you need a lot of data to make this accurate. Also, as your data grows you need massive amounts of computing power. Your calculations will grow on an order of magnitude related to the number of clients you have times the number of items. If you are like Spotify, a streaming music site, with millions of users and millions of items, this quickly becomes impractical.
Real-time vs. Resources and best-of-breed solutions
In either method, a service that needs to work in real time can resort to various cluster techniques to speed up the search. Clustering is where you can reduce the users or items to similar segments thus approaching the issue as a classification problem. Cluster algorithms have become popular because they can deliver better online performance. The trade-off is the more refined the segments, the more computing power is needed, but the less refined the segments, the less accurate the recommendations.
Given the advantages and disadvantages of each method many companies will not choose purely one or the other, but will build a hybrid model that is the best-of-breed solution for their particular business. Amazon has built a hybrid they call “Item-to-Item Collaborative Filtering”. Rather than matching similar customers, they match client purchases with other rated items clustered together, then combine the list. The computing time is massively more than other methods to create a rated similar-item table. Therefore it is done off-line. In real time only the most highly correlated items are returned.
Building an hybrid model with BBVA data
In the future, most recommender systems that BBVA will implement, will need will handle millions of items rather than the 115,000 as in the PPI recommender, and as we have mentioned purely content-based and purely collaborative systems both fall short in essential requirements.
Currently we build recommender systems on the best practices of traditional methods but through the process of continuous improvement, we are researching Machine Learning techniques for future versions.
Some of BBVA’s recommender systems depend on human-defined filters built with Java-SQL and are executed over an Amazon Redshift’s cloud data warehouse. The next generation of these applications could be quite different.
Indeed we want to reach customers via BBVA’s mobile app Wallet, SMS or email. We have the raw data needed to identify these individuals through their purchase history using BBVA bank cards. The challenge is converting this raw data into usable and accurate information, and for that end a hybrid model is under development. Some content-based evaluation is needed, but collaborative filtering is decisive in our approach. The process is shaping up to have the following high-level steps:
- Develop the user-item profile
- Create feature vectors and similarity matrices of merchants
- Generate meaningful clusters of merchants
- Apply the collaborative-filtering model to generate a list of interested customers for a businesses offering
Our dataset is a history of a client’s purchases that gives us implicit feedback about the client’s preferences. Although the preferences are implicit, the way people spend money is usually more revealing about how they really feel than an explicit “like” button or giving something three stars. People tend to be more honest with themselves when money is involved than in any other activity.
The project started evaluating Python Pandas but soon turned to Spark as the appropriate tool for the job. PySpark is used for prototyping but Scala and Spark are used when it is necessary to improve performance for production.
Graphs and Clusters from BBVA data
Using Spark machine learning library (MLlib), various algorithms have been experimented with for appropriateness to analyze the dataset. For content-based evaluation, the (Frequent Pattern) FP-growth algorithm is frequently used with market basket analysis, therefore it seemed a promising place to start for our clustering model. FP-growth works with transactions that are viewed as a set of items (or merchants in our case) and the algorithm looks for common patterns. It’s able to work on large datasets by limiting itself to using frequent sets without generating all candidates. Once the merchant sets are generated with Spark, we use python’s igraph library to produce a graph structure suitable for cluster analysis. The clusters are made up of related nodes on the graph, and these set of nodes are commonly referred to as a community. To find these communities, the igraph library provides various implementations that we have tried, such as edge-betweenness, walk-trap, spinglass, fastgreedy and multilevel. In an optimistic sign these algorithms are all producing similar clusters.
Here is an example of how the cluster is visualized.
Learning ALS-WR from Netflix
Once the clusters are formed, they constitute the input for collaborative filtering, for which the project is investigating various ways of doing matrix factorization. In recommender systems, matrix factorization is used to discover latent features influencing the interactions between two different kinds of entities, such as businesses and potential customers. The project considered various implementations, such as MLlib’s Singular Value Decomposition SVD++. The most promising algorithm we are currently investigating is alternating-least-squares with weighted-λ-regularization (ALS-WR). As mentioned earlier, one of the disadvantages of collaborative filtering is that one needs a lot of data, while many datasets are sparsely populated and it demands considerable computing power as the data grows. The authors of this algorithm in their paper “Large-Scale Parallel Collaborative Filtering for the Netflix Prize” say these problems are greatly ameliorated because “the performance of ALS-WR (in terms of root mean squared error (RMSE)) monotonically improves with both the number of features and the number of ALS iterations.”
Feedback from the current system is positive but when the replacement project reaches the testing phase, we expect to see continuing improvement in performance. There’s an arms race in this industry, hence nothing is ever “good enough.”
The million dollar algorithm
So how much is this worth to long-tail retailers? Well over a million dollars for only a 10% better algorithm if you ask Netflix. From 2006 to 2009 Netflix ran a contest to see if any group could improve on their proprietary algorithm, “Cinematch”, by 10%. Netflix released a training set with 100 million ratings and a testing set with 3 million ratings. When building a recommender model you must build the model with one set of data and test with another, otherwise you might only be reverse engineering a model to fit one set of static data. Only against a new test set of data can you see if the model handles new situations. Netflix awarded progress prizes of $50.000 for each 1% improvement or the best progress in a year. At first progress came swiftly. A 1% improvement was submitted in less than 2 weeks, and by 2007 a team named BellKor had submitted an 8,43% improvement. But diminishing marginal returns set in and every incremental improvement took more effort. Finally on 26 June 2009 the 10% barrier was broken and all teams had 30 days to submit their best and last effort. At the deadline BellKor had submitted a 10,09% improvement, which was less than 0,01% improvement over the second-place team. Now Netflix reports that 75% of what its clients watch is from recommendations.
Today there is a website dedicated to such data science contests, Kaggle.com, where companies and organizations post prizes for the best algorithms. But if you want to win you better know your profession and work hard. Like everything in internet commerce, the competition is fierce.