What we saw at Spark Summit Europe 2017: More Spark Libraries, Real Machine Learning systems into production, and the reality of Deep Learning in Spark

Beatriz Alonso

d&a blog

A month ago, the latest Spark Summit Europe, the world’s largest event for the Apache Spark™ Community, was held in Dublin with the participation of more than 1200 developers, engineers, data scientist, researchers and business professionals.

José A. Rodríguez, one of our data scientists, had the opportunity to participate as a speaker with a talk in which he explained a use case of Spark in the Banking industry. The talk, which was a joint work with Luis Peinado under the usual support and interactions of the Advisory and Predictive Models team, highlights some existing BBVA products which are built on Spark, and focuses on how we have experimented with a text classifier for bank transfers based on Spark.

This classifier has made possible that, at BBVA, every money transfer a customer makes goes through an engine that infers a category from its textual description. This engine runs in Spark, mixes MLLib and own implementations, and is currently into production serving more than 5 million customers daily. Throughout the presentation, Jose shared experiences and lessons learned by the team from the data science standpoint, including the challenges that this represented in terms of data science, some sketches of the current solution and an experimental test using the well-known word2vec embeddings and a technique called VLAD (vector of locally aggregated descriptors, inspired from computer vision).

The event had more than 70 sessions, with content for all levels and roles. Here we highlight the trends we found most relevant to us:

Libraries

As Spark adoption increases and the Spark core APIs become stable, the task of the community is to start populating the ecosystem of missing libraries. Here are some examples of talks proposing and sharing libraries for missing features.

Productionizing

Putting machine learning engines into production involves challenges which go beyond the design of the algorithm or its implementation: how to ensure reproducibility, how to optimally schedule experiments, how to automate model selection. Even how to organize teams of data scientists and engineers. We saw many sketches of this in several presentations. Here is our selection:

Good practices for Machine Learning

As Spark is young, and MLLib has a limited set of algorithms, these days it is not unusual to implement your own ML algorithms. If you do, the following presentations offer some good insights to follow.

The first presentation by William Benton, Data Scientist at Red Hat, gave us some tips to build Machine Learning algorithms in Apache Spark. For instance, a common challenge is how to distribute the training algorithm, especially when the machine learning algorithms need to perform several iterations over the dataset. For those cases, a general recipe to start with is: keep the iteration loop, vectorize the code within each iteration, and use the aggregate and treeAggregate functions where possible.

Deep Learning

As we highlighted in the post on the global Spark Summit edition, Spark starts embracing Deep Learning, with the Deep Learning Pipelines as well as tools by players such as Intel, Yahoo or Microsoft. The following talk compares different ways to set up Deep Learning and Spark.

We invite you to check our presentation and read the post about the international version of this conference in San Francisco, and our highlights we saw there.