Applying “Google spellchecker” principle in detecting online fraud

One of the ways bad guys manage to penetrate/influence a web site’s functionality – is “poking around” by hitting different pages – often on different geolocations (e.g. instead of XYZ.com – country specific sites XYZ.de, XYZ.ca etc.) – coupled with “playing” with input parameters – thus looking for input validation breaches or other site inconsistencies. If successful, bad guys can do a lot of harm – including manipulation of data (e.g. changing a user’s state by following some quixotic page sequence), stealing information and so on.

Such breaches could be successfully detected in early stages by using a technique I call “google’s spellchecker” approach. Anybody who used google to check the spelling of a word – or the right collocation/phrase – knows the underlying principle. It’s (paraphrasing eBay’s motto) “people are basically educated”. That is – if we have 5 million hits for one spelling and 5 thousand for the “competitor” spelling – then the former is the correct one. (BTW, that is one of the basic principles of linguistics: if enough people say ‘nucelar’ – it automatically becomes a legitimate word).

The way the same principle would work in detecting bad behavior is similar:

  1. assign each page a unique ID (normal practice)
  2. define boundaries of individual user sessions
  3. record the sequence of pages hit during individual sessions – e.g. 23 (login),887 (account setting landing page), 368 (account setting confirmation), 99 (logout); in other words create a “page trail” of each session
  4. record and at the end of each session increment the number of times a particular trail appeared on the radar – e.g. 23,887,368,99 -> 1035 times;

Leave the system to bake for some time. Assuming that most people use the site for legitimate purposes, the numbers eventually will reflect the “normal” usage of the site. Maintaining that information would help in detecting abnormal usage of the site (e.g. jumping to 368 “account setting confirmation” without hitting 887 “account setting landing page”) very soon after the “probe” is done. It is important to detect this early, as – if the hole becomes widely abused, its sequence may approach the “normality” level. We also should have some safeguards/mechanism to avoid false positives – e.g. if a new page is added to the site, we want to know about it (e.g. have page age information) and treat it as an exception.

Naturally, the approach is not bullet proof (hardly any one is). Indeed, if fraudsters are sophisticated enough – they could mask their behavior by mimicking legitimate sequence, or trying to make session tracking more difficult. Nevertheless that would be a serious complication of their lives – or another “bump” on their way – so the goal of slowing them down would be fully achieved.