Spark Summit 2017, which took place in San Francisco in June, brought together more than 3,000 data scientists, developers and industry experts that participated in more than 170 to discuss recent developments and future trends that will shape the future of the data industry.
This is the biggest worldwide event of the Apache Spark community, the technology that has become the de facto standard for big data processing, that’s because it has been widely adopted in industry and academia and also has a huge and active community of developers that has allow the fast evolution of this solution and the ecosystem around . For this reason, professionals from the largest technology companies around the world attend this event to share their experiences and also to learn about the new trends that will guide the building of data-based products in the near future. This post will describe the main aspects of this event from our perspective as attendees.
Spark 2.2: Hi Deep Learning, Bye micro batching
Databricks (the company founded by the creators of Apache Spark) opened the event as host. They showed the functionalities of the next versions of Spark , specifically, the upcoming 2.2 version, which will include two main features:
- Deep Learning Pipelines: This new capability represents the integration between Spark and the Deep Learning world, allowing to use Neural Network as a natural element of a Machine Learning Pipeline, through the interaction with well-known Deep Learning Frameworks as TensorFlow, Keras and BigDL (we will talk about this last one later). For example, this abstraction allows us to use neural networks as User Defined Functions (UDF).
- Structured/Continuous Streaming: Structured streaming is the possibility to include Spark SQL in the analysis of a stream, which will facilitate the development of streaming applications. On the other hand, Continuous Streaming is the capability to process streaming in real-time, without micro-batching. This is something that the community has been waiting for since the first versions of streaming on Spark. This latest development closes the gap with Apache Flink.
All this new stuff is shown in a very nice demo related to James Bond. This is the video of the presentation,
In addition, there were deep dive sessions that explained other aspects of this new version, for example, Catalyst, Spark’s SQL’s query optimizer that defines an execution plan for a query based on a set of statistics managed by the optimizer (Cost Based Optimization or CBO). All of this allows us to reduce the time required to execute a query.
As mentioned before, Deep Learning was a key topic of the conference. Intel has widely shown its Deep Learning Library, Big DL, which could be integrated with Spark and MKL (Intel Math Kernel Library), also allowing us to create different neural network architectures (Perceptrons, Convolutionals, RNN). It seems that Intel is trying to be a key player in the Deep Learning ecosystem, and their challenge will be to position themselves in an area dominated by Tensor Flow (Software) and Nvidia (Hardware).
Other talks related to Deep Learning:
- Tensor Flow On Spark, a Yahoo! Proposal to use Tensor Flow in a Spark Environment.
- Nvidia showed their specialist hardware for training and serving deep learning models working on different use cases and frameworks (DL4J / H2O)
- Natural Language processing using CNTK, Microsoft’s Deep Learning framework
- Benchmark between hardware/software (CPU, GPU, Multi-GPU, TensorFlow on Spark) configurations to train deep learning models.
This edition of Spark Summit was full of interesting use cases from a diverse set of domains. This is a key indicator of the relevance of this tool at industry and research level. Here are some of them:
- Detection of toxic players in League of Legends analyzing the in-game chat through natural language processing.
- Analytics infrastructure at Hotels.com and how it used for big scale image analytics.
- How LinkedIn uses GraphX to find insights on his social network
- Project Fortis, a collaboration between UN and Microsoft to analyze humanitarian crises and guide the decision making process. This involves streaming analytics and GraphQL.
- SETI and IBM talk about the technical aspects to be solved on radio analysis using Spark.
- The road to productizing a Botnet detection model.
The Spark community has acquired experience over the years, and this has allowed us to generate a set of good practices for common problems such as productization for models, monitoring applications and solving performance issues. Here are some examples:
- Databricks talks about how Data Pipelines improve the development of the ETL process, providing a set of tools to solve common problems.
- Facebook describes their experience optimizing Spark processes.
- Netflix describes how to improve their features extraction process using Dataframes.
- Data science infrastructure at PayPal
- Monitoring of Spark processes using Dr. Elephant
- How to productionize MLlib applications
- Clipper: A tool to serve machine learning models
- Some guides for GraphX application development
- Development of R based applications using Sparkling Water
These were only some of the 179 talks of the conference. Each one revealed different perspectives and aspirations of a community that has grown exponentially. However, the Spark ecosystem has grown at the same time and in the right direction. This year, we have seen a clear tendency towards productization of machine learning models and dealing with the challenges involved. Also it is emerging a diverse set of environments attempting to facilitate the interaction between data scientist and engineers in the development of data-based apps. All this is occurring while the ecosystem is embracing the Deep Learning trends and its technical implications in term of both software and hardware.
To sum up, this year’s edition of the Summit was fairly productive, and marks the beginning of a new phase for Spark: it is no longer an emerging tool, but it is becoming an established market product with a mission to bring the analytical capabilities (as described in academia) into the industrial world. It remains to be seen what awaits for future editions of the Summit.
The next appointment of this community will be the Spark Summit Europe 2017, which will take place in Dublin the next October 24 – 26. There we will talk about our experiences using Spark for categorize money transfers. See you there 😉