As data scientists, we love participating in some initiatives outside the scope of our daily jobs. This gives us the ability to learn new things that are not directly related to our field of expertise and take a fresh look into complex analytical problems. At the same time, these kinds of experiences allow us to collaborate with colleagues who normally work in separate projects and enrich our network of contacts. Last year, we participated for the first time in the FEIII 2018 challenge. It was such a great experience, that this year we have decided to collaborate with the University of Maryland in organizing the challenge as well.
What does FEIII stand for?
The Financial Entity Identification and Information Integration (FEIII) is a challenge hosted by the workshop “Data Science for Macro-Modeling with Financial and Economic Datasets” (DSMM). This workshop is held simultaneously with the SIGMOD Conference. SIGMOD is one of the most well-known conferences in the field of database management.
In the last years, it has increased its scope to the application of machine learning to database management problems, and to end-to-end machine learning.
DSMM workshop’s goal serves two purposes: Firstly, to extract useful insights from financial data. On the one hand, there are multiple open data sources ready to be used for this purpose. On the other hand, there are multiple key industry players and government organizations interested in these insights. There’s room to make important contributions to this field.
Secondly, the workshop wants to find the most appropriate methods for dealing with this task and try to build a benchmark of different approaches, so that later on the same methods can be extrapolated to other data. This purpose is very useful for BBVA Data & Analytics, where we deal with very different data in terms of privacy, language, and features but we face similar challenges when it comes to cleaning and integrating separate data sources or building a financial knowledge graph.
The way DSMM tries to achieve its goal is to gather a community of people both in Academia and Industry and make them collaborate. Here it’s where the FEIII challenge comes into play, organizing a long-term challenge (it lasts over a month) so that more than just preliminary approaches can be used. But organizing a challenge it’s not straightforward.
Which are the FEIII challenge organization difficulties?
One of the main difficulties when trying to put different people working for the same goal is data. Because of privacy policies, data can’t be easily shared. Therefore, FEIII challenge tries to focus on public data as the starting point.
This year we are lucky to count on Enigma to provide a terrific dataset, full of economic signals and analytic challenges. As recently described by Forbes, Enigma is a company providing free curated public data. Its ability to rapidly make sense of this data and link it to private data has attracted some of the world’s leading companies, from BlackRock to PayPal.
The Dataset: U.S. Customs and Border Protection’s ‘Automated Manifest System’ (AMS)
This year the challenge is based on a comprehensive dataset of the bills of lading header information from the U.S. Customs and Border Protection Agency’ Automated Manifest System (AMS), for incoming US shipments in 2018.
This dataset provides a wonderful look into the U.S. commercial trade, and therefore a huge part of the world trade. It provides information on goods that arrive at U.S. ports on containerized shipping from all over the world. It’s also a test for your data processing skills with more than 16 million records for the first half of 2018.
You may have a look at some of the insights that Ben Matheson has provided in this visualization.
Did you know that Alaska’s the Bering Sea is the marine highway for thousands of transits between the two largest economies, China and the USA?
The Challenge: Mapping trade
The AMS dataset is rich with both macro-economic signals and microeconomic information on exporting companies. In order to please every Data Scientist in FEIII 2019 we have designed two tasks:
- A SCORED Task will focus on finding exporters for a given product and country. Such reference datasets have significant commercial value, e.g., exporters are usually targeting customers for a financial services company.
- An OPEN task that aims to the creativity of the participants and may answer interesting questions as follows:
- Summary of trends; visualization of flows; outliers.
- Given an industry sector, characterize the most significant products, sources, and ports.
- Given a product, identify potential bottlenecks including sources and ports of entry.
- March 10 Release Datasets.
- April 15 Abstract submission to DSMM Workshop.
- May 1 Early registration deadline for SIGMOD 2019 and DSMM.
- May 15 Scoring of participant solutions.
- May 31 Camera-ready short paper submission to DSMM Workshop.
- June 30 DSMM Workshop.
Have a look at the data! Try to find your favorite food or wine in the dataset browser provided by Enigma.
You will discover how real datasets challenge real data scientists. Interested in unsupervised data cleaning, graph analytics, record linkage or collective text classification? How do they scale to millions of records?
We expect your submission by May. and welcome you to SIGMOD in Amsterdam!