Cleansing and Exploratory Data Analysis with Apache Spark and Optimus

Favio Andre Vazquez Prieto

d&a blog

Outdated, inaccurate, or duplicated data won’t drive optimal data driven solutions. When data is inaccurate, leads are harder to track and nurture, and insights may be flawed. The data on which you base your big data strategy must be accurate, up-to-date, as complete as possible, and should not contain duplicate entries. Clean data results in better decisions.

Cleaning data is the most time-consuming and least enjoyable data science task (until Optimus), but one of the most important ones. No one can start a data science, machine learning or data driven solution without being sure that the data that they’ll be consuming is at its optimal stage. Although several data cleansing solutions exists, none of them can keep up with the emergence of Big Data, or they are really hard to use.

1

Right now more and more companies are entering (or at least trying to enter) the Big Data and Machine Learning revolution. All of the data driven approaches need to clean, wrangle, normalize and fix the data that will be input to the models they want to create, and with Optimus we are launching an easy to use, easy to deploy to production, and open source framework to clean and analyze data in a parallel fashion using state of the art technologies, that can be used by small, medium, big industries or even startups that want to create data science solutions and don’t have the money to pay lots of data scientists and create their own cluster to clean the data they are going to use.

2

Optimus is the missing framework for cleansing (cleaning and much more), pre-processing and exploratory data analysis in a distributed fashion. It uses all the power of Apache Spark to do so. It implements several handy tools for data wrangling and munging that will make your life much easier. The first obvious advantage over any other public data cleaning library is that it will work on your laptop or your big cluster, and second, it is amazingly easy to install, use and understand.

The group of BBVA Data & Analytics in Mexico has been using Optimus for the past months and we have boosted our performance for cleansing, exploring and analyzing our data by 10x factor.

Requirements

  • Apache Spark 2.2.0
  • Python 3.5

Installation (Windows, Mac & Linux)

In your terminal just type:
 

<br />
pip install optimuspyspark<br />

 

For a complete documentation on how to use it please visit our GitHub repository:

https://github.com/ironmussa/Optimus

If you want a peak of what can Optimus do for you check out this Demo:

https://nbviewer.jupyter.org/github/ironmussa/Optimus/blob/master/examples/Optimus_Example.ipynb

Contributors

License

Apache 2.0 © Iron.
4