A month ago, the latest Spark Summit Europe, the world’s largest event for the Apache Spark™ Community, was held in Dublin with the participation of more than 1200 developers, engineers, data scientist, researchers and business professionals.
José A. Rodríguez, one of our data scientists, had the opportunity to participate as a speaker with a talk in which he explained a use case of Spark in the Banking industry. The talk, which was a joint work with Luis Peinado under the usual support and interactions of the Advisory and Predictive Models team, highlights some existing BBVA products which are built on Spark, and focuses on how we have experimented with a text classifier for bank transfers based on Spark.
This classifier has made possible that, at BBVA, every money transfer a customer makes goes through an engine that infers a category from its textual description. This engine runs in Spark, mixes MLLib and own implementations, and is currently into production serving more than 5 million customers daily. Throughout the presentation, Jose shared experiences and lessons learned by the team from the data science standpoint, including the challenges that this represented in terms of data science, some sketches of the current solution and an experimental test using the well-known word2vec embeddings and a technique called VLAD (vector of locally aggregated descriptors, inspired from computer vision).
The event had more than 70 sessions, with content for all levels and roles. Here we highlight the trends we found most relevant to us:
As Spark adoption increases and the Spark core APIs become stable, the task of the community is to start populating the ecosystem of missing libraries. Here are some examples of talks proposing and sharing libraries for missing features.
- VEGAS: The missing matplotlib for Spark/Scala (Netflix)
- Building Custom Machine Learning PipelineStages for Feature Selection (BMW)
- SparkNLP: Natural Language Understanding at Scale with Spark-Native NLP, Spark ML, and TensorFlow (Indeed)
- MML-Spark (Microsoft): A machine learning library by Microsoft.
Putting machine learning engines into production involves challenges which go beyond the design of the algorithm or its implementation: how to ensure reproducibility, how to optimally schedule experiments, how to automate model selection. Even how to organize teams of data scientists and engineers. We saw many sketches of this in several presentations. Here is our selection:
- Productionizing Behavioural Features for Machine Learning with Apache Spark Streaming (Booking.com)
- Spline: Apache Spark Lineage, Not Only for the Banking Industry (Barclays) [github]
- No More Cumbersomeness: Automatic Predictive Modeling on Apache Spark (NEC / 9Lives)
Good practices for Machine Learning
As Spark is young, and MLLib has a limited set of algorithms, these days it is not unusual to implement your own ML algorithms. If you do, the following presentations offer some good insights to follow.
- Building Machine Learning Algorithms on Apache Spark (Red Hat)
- Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in Apache Spark (Zalando) Code
- MatFast: In-Memory Distributed Matrix Computation Processing and Optimization Based on Spark SQL (Hortonworks) – Library
The first presentation by William Benton, Data Scientist at Red Hat, gave us some tips to build Machine Learning algorithms in Apache Spark. For instance, a common challenge is how to distribute the training algorithm, especially when the machine learning algorithms need to perform several iterations over the dataset. For those cases, a general recipe to start with is: keep the iteration loop, vectorize the code within each iteration, and use the aggregate and treeAggregate functions where possible.
As we highlighted in the post on the global Spark Summit edition, Spark starts embracing Deep Learning, with the Deep Learning Pipelines as well as tools by players such as Intel, Yahoo or Microsoft. The following talk compares different ways to set up Deep Learning and Spark.