Lessons learned while building state-of-the-art Analytics Platform

In the last several years, I was blessed with a chance to be an intimate part of an ambitious program which could be summarized as the design, development and operationalization of a state-of-the-art Data and Analytics Platform at my former employer PayPal.

It’s a well-known fact that the key to PayPal’s success has historically been its effective risk management function, which in turn relies heavily on predictive models. Back in the day, it used to take up to 9 months to build and release such a model. Moreover, each ‘variable’ (feature used in the model) was in itself a separate query hitting the same database which served the core business. Such queries proved to be extremely expensive since these databases were not designed to fetch the type of data Risk management needed (transactional vs account- or for that matter, any other entity-centric data query). As a result, introduction of new variables was heavily restricted, severely limiting the capabilities of risk models. Needless to say, especially in the case of fast-moving fraud patterns, a model built on 12- to 18-months-old data is largely obsolete by the time it’s rolled to production. At some point it was pretty clear that tweaks and tricks exhausted themselves and that we could not move forward without fundamentally new approach. In a nutshell, PayPal needed to dramatically cut the lifecycle of predictive models and to radically lower the cost of leveraging more (and more sophisticated) data to keep up with both fraud trends and the endless slew of new products PayPal has been releasing. And since the efficiency of risk models is directly hitting the bottom line of the company, the decision of building a brand new Analytics Platform had a very clear ROI appeal.

Fast forward to today, and the new Analytics Platform is an invaluable asset to PayPal’s Risk Management function, enabling dozens of predictive models running in hundreds of milliseconds and evaluating tens of millions of events a day with best-in-industry efficiency rates. The lifecycle of new models is cut to a couple of weeks, and they have since evolved from basic linear regression to significantly more complex types such as neural networks relying on many hundreds of variables.

So… what key lessons did I learn along the way? Without disclosing any significant details, let me lay out some select ones in hopes that you’ll find them helpful, if not illuminating.

It. Just. Takes. Time.

Building any major infrastructure project takes time and resources. If you think you can have it up and running in 6-12 months, you’re kidding yourself (and your stakeholders). Of course, a lot depends on the starting point (how bad your legacy stack is), the capabilities of the team (we all know the difference), and particular expectations and requirements (where the devil typically resides). Still, if we are talking about building and deploying new infrastructure in a company with the size and scalability requirements of PayPal – with all the moving parts, organizational complexity and so on – we better set ourselves up for multi-year investment and prepare the upper management for ‘strategic patience’.

Simulation, simulation, simulation

In predictive analytics, your efficiency is as good as your ability to simulate what are you going to end up evaluating. No matter what spectacular results you got in your ‘data science environment’, they aren’t worth much if there is a significant gap between that environment and the production (also known as ‘reality’).

I was amazed how much the aspect of data simulation is overlooked and demoted to a role of ‘nice to have’ feature by some ‘experts’. In fact, reliable and efficient data simulation is absolutely at heart of any platform which supports predictive analytics. It means the ability to simulate any feature which could be potentially used in the output model at point-in-time of the actual event (e.g. a transaction) for which we are trying to predict the outcome (e.g. probability of turning fraudulent). By ‘reliable’, I mean that the simulated value is the same as it would have appeared in production during actual evaluation. By ‘efficient’, I mean the ability to support simulation for a large number (at least thousands) of features for a large number (tens of millions) of historic events reasonably fast (hours, not days).

Strong Engineering and Data Science teams. Even stronger Product Management function.

Even best-in-class Engineering and Data Science teams (such as PayPal’s) have somewhat opposing perspectives and mindsets. Engineers tend to be conservative and hate working on something without clearly defined boundaries. Their KPIs are typically revolving around availability and performance. For analysts, it’s never as black and white. The more data the analyst is given, the better they can do their job in predicting the target behavior. Their KPIs are all about the quality of their prediction. If engineers were the ones to decide, the platform would have strict limitations over what data to use in the models, and all changes would have to go through vigorous control and oversight. If it was up to modelers, they’d take all the data in the World (in real-time, please!), and like to have the freedom of frequent adjustments and changes in which data to rely on as late in the game as possible. These differences need to be recognized as natural and perfectly legitimate.

Here where a strong Product Management function with a good grip on both perspectives plays a crucial role. PMs are the ones to define the right framework within which we could both assess the business impact of new requirements and to evaluate their potential impact to the system.

I personally liked to apply what I called the ‘ROI model’. In that approach, the engineers should work towards minimizing the ‘I’ part – i.e. strategically reducing the cost of creation and maintenance of new variables. ‘R’ part is responsibility of the Data Scientist – it reflects how much incremental value over time we get from an individual variable. Naturally, it’s easier said than done – it’s extremely tricky to evaluate the ‘R’ part as trends move, for example – but with a transparent, well-defined methodology it’s absolutely possible.

No one single class of variables ‘wins the war’

Models are built on top of variables, and variables are built on top of data. A well-designed Analytics Platform needs to provide maximum flexibility for both introducing new data to the system and providing tools to process that data (to produce variables). Moreover, the data can be processed in a variety of ways – real-time, almost-real-time and offline. The Platform should give the users (analysts in this case) the ability to pick and choose which tools to use in each individual case – together with well-articulated costs and implications for each option. There’s no one way which would satisfy all the use cases. For example, some variables analyze trends and interconnections and need to crunch a lot of historic data, while others are all about the activity in the last several hours – or minutes – preceding the event which is being evaluated. A good Platform needs to support calculation and simulation for many classes of variables – even those we are not aware of yet – with well-defined processes, SLAs and costs for each of such classes.

I often used metaphors borrowed from the military (it started long time ago when I was bringing up to speed a new hire who happened to have an extensive military background). In case of fraud detection the parallels are not hard to find. Indeed, we are waging a war on fraud, fighting highly sophisticated and mobile adversary. No war is won with single type of weaponry, hence we need to invest into both strategic (slower, but more powerful) and tactical (more versatile and targeted) types of ‘weapons’. A good Analytics Platform is our defense industry, and the more flexible it is, the easier it is for us to adapt to the ever-evolving fraud ‘attacks’.

 

All in all, building a scalable Analytics Platform to fully enable its users, be flexible enough to adapt and ingest ever more data, as well as satisfy stringent SLAs and availability requirements is a subject one could write volumes about. But the above lessons are the ones which deserved to be mentioned in a short blog post – since they are both important and often overlooked. IMHO.