Kaggle Human or Robot Challenge

This is a brief write-up on my approach in a recent interesting Kaggle competition hosted by Facebook. In a nutshell, we are given some data describing user level bidding behaviours in an online penny auction website (I didn’t know there exists such a thing!). The goal is to come up with a predictive model that can distinguish between real human bidders and automated robot bidders.

Data Exploration

A quick look at the data by plotting the sorted timestamps of the bidding activities shows that there are three distinct “periods” in the dataset.

A quick value_counts() on each column in the dataset very roughly shows how much information is contained in the dataset:

bid_id         7656334
bidder_id         6614
auction          15051
merchandise         10
device            7351
time            776529
country            199
ip             2303991
url            1786351

Another interesting observation is about the address and payment_account given as meta-data associated with each bidder. Although the 37-character hashes do look like MD5, their distributions are clearly not completely random. For example, 52.3% of the bidders have their address hashes prefixed by a3d2de7675556553a5f08e4c88d2c228; 18.6% of the bidders have their account hashes prefixed by the same magic hash. This artifact did contribute slightly in the final performance of my model, as will be described in the next section shortly.

Feature Engineering

The following lists the features I found useful and included in my model. Obviously, each of these features are extracted for every bidder.

The first set of features makes use of the categorical information contained in the data, such as country, url, ip and device:

  • Basic counts: Counts of unique auction, device, country, ip and url

  • Basic counts by auction: Same as above, but do the counting within each auction and extract the resulting statistics (mean, median, std)

  • Shared device: Some devices are shared by multiple bidders. This feature counts the number of bids that use such shared device. Repeat the counting for each auction and extract the statistics.

  • Country based: One-hot encode the most frequently occurring (top 10%) countries in the dataset. Extract the number of bids associated with these countries.

  • URL based: Number of bids having the URL vasstdc27m7nks3 or other URLs (i.e. similar to one-hot encoding with only two dimensions).

  • IP based: One-hot encode the most frequently occurring IP addresses (top 20) in the dataset. Extract the number of bids associated with these IPs.

  • Device based: One-hot encode the most frequently occurring devices (top 40) in the dataset. Extract the number of bids associated with these devices.

The second set of features is derived mainly from time information:

  • Period based: Number of bids in each of the three “periods”

  • Biding (inverse) frequency: Statistics of the time differences between bids (mean, std)

  • Reaction time: Statistics of the time lapsed from the last bid within each auction

  • Auction switch: Number of switch between different auctions, when bidding activities are sorted chronologically

  • Bid timimg: Define “bid timing” as the chronological order of a bid in an auction or a period normalized to [0, 1]. For example, the very first bid in each auction has a bid timing of 0. The resulting features are obtained by computing the statistics of bid timing for each auction, as well as for each “period”.

  • First bid: Statistics bid timing of the first bid within each auction. Interestingly, its counterpart “Last Bid” feature is not useful, probably because for many auctions, we don’t even know whether it has already ended.

And finally, the mysterious meta-data information:

  • Address hash prefix: One-hot encode the address hash prefix (of length 32) into two categories: a3d2de7675556553a5f08e4c88d2c228 and others.

Combining the above feature sets yields a feature dimension of about 180.

Modeling

I use the extra-trees classifier (implemented as ExtraTreesClassifier from the scikit-learn library) as the modeling algorithm. It turns out to yield the best local validation score among the other go-to methods, such as random forests and gradient boosted trees. Due to the way the labels are assigned in this dataset, they are believed to be rather noisy. This perhaps explains why the more powerful gradient boosted trees does not generalize as well as the extra-trees classifier. I have seen this method giving good results in another dataset, which also contains noisy labels.

The final model is a simple average of multiple models trained using different random seeds.

Evaluation Pipeline

Normally, I use a 5-fold cross validation to evaluate changes in the model pipeline. For this dataset, I feel the need to level it up by repeating it 20 times, yielding a total of 100 different folds. Luckily, due to the small size of the dataset, the processing time is still very much tolerable. I look at both the mean and standard deviation of the cross-validated AUC score to decide whether each change in features or parameters is helpful, optimizing for the most stable result when improvement is infinitesimal.

The leader board score is indeed not very helpful, due to the small sample size.

Things that Did Not Help

As usual, many things don’t work. Some of which that I still remember are:

  • Interaction between categorical features (e.g. the use of two particular devices, occurrences of two particular countries)

  • Fancy blending by optimizing weights between different models (for this I also trained other models such as neural network, k-NN and linear models that did not work as well individually)

  • The merchandise field. Tried multiple ways to derive features from this information but none is actually useful.

Ending Notes

There are some interesting ideas that other people have used, which I have not thought of:

  • Further decoding of time unit to extract time-window based features that make sense

  • Treating bidding activities as text and borrow ideas from text classification methods

  • SMOTE-based re-sampling technique to address the class imbalance issue

My final ranking jumped up a bit when the private leader board is revealed, to the 11th place. Not too bad, since near the end of competition I was just hoping to stay within the top 10%. Hats off to the top participants, especially the winner who dominated everyone with just 3 submissions. Very impressive!