Yi Tang Data Scientist with Emacs

Kaggle Avito Demand Prediction Challenge - 22th Solution

Avito Demand Prediction Challenge asks Kagglers to predict the "demand" likelihood of an advertisement. If an listed 2nd-hand Iphone 6 is selling for £20,000, then the "demand" is likely to be very low. This is the my first competition to build model using tabular data, text, and also images.

I teamed up with Rashmi, Abhimanyu, Yiang, Samrat and we finished at 22 among 1917 teams.

This is an interesting competition for me. I was about to quit this competition and Kaggle because of other commitments in life/work. Just one day before team merge deadline, Rashmi asked me to join, at that time, my position is 880-th, about 50%, and Rashmis' team is about 82-th. So I decided to join and finish this competition which I already spent about many hours.

Final Ensemble Models

As part of this team, I worked on final ensemble models. Immediately after join, i completed 5 tasks:

  1. make sure everyone uses the same agreed cross validation schema. This is essential for building ensemble model.
  2. provide model_zoo.md document to keep track of all level 1 models, their train/valid/lb scores, feature used, and file path to their oof/test prediction.
  3. write merge_oof.py to combine all oof/test predictions together.
  4. write R scripts for glmnet ensemble.
  5. write python scripts for LightGBM ensemble.

Once new model is built, other team member update the model_zoo.md and upload the data to a private github repo. Then I update the merge_oof.py to include new models' result, and run glmnet and LightGBM ensemble. We had this ensemble workflow automated so it takes little effort to see the ensemble model's performance.

I spent some times analysing the coefficients/weights of L1 model and tried to exclude models with negative and lower weights. To my surprise it doesn't help at all. The final submission is a glmnet ensemble with 41 models (lgb + xgb + NN).

Also, LightGBM ensemble has much better cv score but the LB score is worse. I suspect it is because there are leakage in L1 models and glmnet is more robust to leakage since it's linear model. Unfortunately, there's no enough time to identify which models have leakage.

Collaboration

This is my 2nd time work in a team, although there's a lot space for improvement collaborating when compared with a professional data scientist team but as night/weekend project, we have done a really good job as a team.

The setup for collaboration:

  1. Slack for discussion. we have channel for general, final_ensemble, random for cat photos etc.
  2. we also used Slack for sharing features which i personal don't like.
  3. Private github repo for sharing code and oof/test predictions.
  4. Monday.com for managing tasks. it gives a nice overview of what everyone's up to.

we tried very hard to get a gold, but other teams work even more harder. At one point we were at 17, and finished at 22.

Some Kagglers to Avoid

Finally, when we waited 1 hour for the final deadline, we had a lovely discussion about our past disqualification experience. We were all shocked when we were at different team in Toxic competition but team up with the same person. We shared their person's multiple Kaggle accounts and added to our personal block-list.

etags - Build a TAG for Multiple R Packages

Here is what tried to build a TAG for multiple R packages. It enable me to jump to a location where the function/variable is defined and modify if I want to.

Useful variable and functions

ess-r-package-library-path
default path to find packages, should be a list
ess-r-package-root-file
if the folder has DESCRIPTION file, then the folder is a R package.
(ess-build-tags-for-directory DIR TAGFILE)
build tag on DIR to TARGET.
tags-table-list
List of file names of tags tables to search.
(visit-tags-table FILE &optional LOCAL)
Tell tags commands to use tags table file.
;; new variable 
(defvar ess-r-package-library-tags nil
  "A TAG file for multiple R packages.")

(setq ess-r-package-library-path '("~/tmp/feather/R" "~/tmp/RPostgres/"))
(setq ess-r-package-library-tags "~/tmp/all_tags")

(dolist (pkg-path ess-r-package-library-path)
  (let ((pkg-name (ess-r-package--find-package-name pkg-path)))
    (unless (and pkg-name pkg-path
                 (file-exists-p (expand-file-name ess-r-package-root-file pkg-path)))
      (error "Not a valid package. No '%s' found in `%s'." ess-r-package-root-file pkg-path))
    (ess-build-tags-for-directory pkg-path ess-r-package-library-tags)
    ))

Note the workhorse is ess-build-tags-for-directory which does what it means. The core of this function use find and etags program. The find program will find files with extension .cpp, R, nw etc, and then feed to (using pipe) to the etags program which generate a TAG table. These two steps are demonstrated in the following snippet, which is grabbed from the source code of ess-build-tags-for-directory.

(setq find-cmd (format "find %s -type f -size 1M \\( -regex \".*\\.\\(cpp\\|jl\\|[RsrSch]\\(nw\\)?\\)$\" \\)" (car ess-r-package-library-path)))

(setq regs (delq nil (mapcar (lambda (l)
                               (if (string-match "'" (cadr l))
                                   nil ;; remove for time being
                                 (format "/%s/\\%d/"
                                         (replace-regexp-in-string "/" "\\/" (nth 1 l) t)
                                         (nth 2 l))))
                             imenu-generic-expression)))
(setq tags-cmd (format "etags -o %s --regex='%s' -" "~/lala"
                       (mapconcat 'identity regs "' --regex='")))

(setq sh-cmd (format "%s | %s" find-cmd tags-cmd))
(shell-command sh-cmd)

Note when they are used in Emacs, the tags-table-list variable is appended with the path to the new TAG table. So that the user can use xref-find-definitions (M-.) to jump (if the point is under a word) or select which function/variable to jump to. The users then check the function/variable definition, or modify it if it is necessary. Then call xref-pop-marker-stack (M-,) to jump back.

Compare RPostgres and RPostgreSQL Package

R is a great language for R&D. It's fast to write prototypes, and has great visualisation tools. One of constraints of R is it stores the data in system memory. When the data becomes too big to fit in the memory, we asked the user has to manually split the dataset and then aggregate the output later. This process is inefficient and error prone for a non-technical user.

I started an R development project to automate this split-aggregate process. A viable solution is to store the whole data in PostgreSQL, and let R to fetch one small chunk of the data at a time, do the calculation, and then save the output to PostgreSQL. This solution requires frequently data transferring between these two systems, which could be a bottleneck in performance. So I did a comparison of two R packages that interface R and PostgreSQL.

RPosrgreSQL
is supported and developed in the Google Summer of Code 2008 program. It is currently out of development. The last publication is in 2013.
RPostgres
is a new package which provides similar functionality to RPostgreSQL but rewrite using C++ and Rcpp. The development is led by Kirill Müller.

Based on my testing, the RPostgres package is about 30% faster than RPostgreSQL.

The testing set-up is quite simple: I write an R script to send data to and get data out from a remote PostgreSQL database. It logs how long each task takes to complete in R. To avoid other factors that can affect the speed, it repeats this process 20 times and use the minimal run-time as the final score. The dataset transferred between R and PostgreSQL is a flat table with three columns and the number of rows varies from ten thousand to one million.

The run-time in seconds are plotted against number for rows for each package and operation.

figure-28761Pij.png

nil

Here is a summary of what I observed:

  1. RPostgreSQL is slower than RPostgres. For getting data out, it's 75% slower, which is massive! For writing, difference is closer, it's about 20%. When combine both scores together, it is about 33% slower.
  2. Particularly, it's slower to read than to write for RPostgreSQL package, the ratio is about 1.5. While as it's quicker to read than to write for RPostgres, the ratio is about 0.8. This is an interesting observation.
  3. Both package has a nice feature - the reading/writing time linearly depends on the number of rows. This makes the time estimation reliable. I would be confident to say that for 2 millions rows, it takes RPostgres package about 6 seconds to read.

I don't why which part of implementation makes the RPostgres faster. I guess its the usage of C++ and the magical Rcpp package.

Here is the script just in case you want to your own tests.

library(data.table)                     
library(ggplot2)
library(microbenchmark)
library(RPostgreSQL)
library(DBI)   
                                        # config for PostgreSQL database
host.name <- NULL
database.name <- NULL
postgres.user <- NULL
postgres.passwd <- NULL
postgres.port <- NULL
temporary.table.name <- NULL

                                        # config for testing
nrows <- seq(10 * 1e3, 1 * 1e6, length = 10)
repeats <- 20


                                        # open PostgreSQL connection
pg.RPostgreSQL <- dbConnect(dbDriver("PostgreSQL"),
                           host = host.name,
                           dbname = database.name,
                           user = postgres.user,
                           password = postgres.passwd,
                           port = postgres.port)
pg.RPostgres <- dbConnect(RPostgres::Postgres(),
                         host = host.name,
                         dbname = database.name,
                         user = postgres.user,
                         password = postgres.passwd,
                         port = postgres.port)

ReadWriteWarpper <- function(pg.connection) {
                                        # helper function 
    write <- function() dbWriteTable(pg.connection, temporary.table.name, dt, overwrite = TRUE)
    read <- function() dbReadTable(pg.connection, temporary.table.name)

    var <- list()
    for (n in nrows) {
                                        # create a dataset
        dt <- data.table(x = sample(LETTERS, n, T),  # character
                        y = rnorm(n), # double
                        z = sample.int(n, replace=)) # integer

                                        # read and write once first.
        write()
        read()

                                        # run and log run-time
        res <- microbenchmark(write(),
                             read(),
                             times = repeats)

                                        # parse 
        var[[as.character(n)]] <- data.table(num_row = n,
                                            operation = res$expr,
                                            time = res$time)
    }

                                        # aggregate and return
    rbindlist(var)
}

                                        # run
df0 <- ReadWrite(pg.RPostgres); df1 <- ReadWrite(pg.RPostgreSQL)
df0$pacakge <- "RPostgres"; df1$package <- "RPostgreSQL"
df <- rbind(df0, df1)
plot.df <- df[, min(time) / 1e9, .(num_row, operation, package)]

## generate plot
plot.df[, operation := gsub("\\(|\\)", "", operation)]
ggplot(plot.df, aes(x=num_row, y=V1, col = package)) +
    geom_path() +
    geom_point() +
    facet_wrap(~operation) +
    theme_bw() +
    labs(x="Number of rows",
         y="Run time (sec)"
         )

Notes on Factorisation Machine

To start with, consider model equation of a linear regression model with two-way interaction effect:

\begin{equation} f(x) = \beta_0 + \sum_{i=1}^{p} \beta_i x_i + \sum_{i=1}^{p} \sum_{j=i+1}^{p} \beta_{i, j} x_i x_j \end{equation}

where

\(\beta_o\)
is the interception term,
\(\beta_i\)
models the strength of the i-th variable.
\(\beta_{i,j}\)
models the interaction between i-th and j-th variable.

To estimate the parameters, we can firstly create the interactive variables 1 \(x_{i,j} = x_i \cdot x_j\) and add them to the design matrix \(X\). It converts the problem into a linear regression which can be easily solved by least squares2.

The Factorisation Models have the same model equation but use a different way of estimating the interaction parameter \(\beta_{i,j}\) by factorising it: provided a sufficiently large \(k\), the \(\beta\) \((p \times p)\) matrix can be approximated (factorised) by \(V \bullet V^T\) using a lower dimension matrix \(V\) \((p \times k)\).

In this way, interaction parameter becomes

\begin{equation} \hat{\beta}_{i,j} := \langle v_i, v_j \rangle = \sum_{f=1}^{k} v_{i, f} v_{j, f} \end{equation}

Where the k is a hyper-parameter that defines the dimensionality of the factorisation.

The model equation can be solved using stochastic gradient descent with various of loss (square loss, logit, or hinge loss). For square loss 3, the gradient of FM model is

\begin{equation} \frac{\partial{f}}{\partial{\theta}} = \begin{cases} 1, & \textrm{ if } \theta \textrm{ is } \beta_0 \\ x_i, & \textrm{ if } \theta \textrm{ is } w_i \\ x_i(\sum_{j=1}^{p} v_{j, f} x_j) - v_{i, f} x_i^2, & \textrm{ if } \theta \textrm{ is } v_{i, f} \end{cases} \end{equation}

Due to this factorisation of interaction there is no model parameter that directly depends on the two variables. The computation can reduced from quadratic \(\mathcal{O}(kn^2)\) to linear \(\mathcal{O}(kn)\), see Lemma 3.1 in Rendle's paper 4.

Another advantage of this is we are able to estimate the interaction parameter \(\beta_{i,j}\) when there is not enough observation for \((x_i, x_j)\). This is a desired feature when working with sparse dataset. Rendle showed FM outperforms Support Vector Machine on hugely sparse dataset.

Factorisation Machine looks promising: it's fast to train and performs well on sparse dataset. I haven't fully understand it yet but am keen to apply it to a Kaggle competition and gain more insights.

How to Create a Screencast GIF in Emacs

nil

I've always wanted to create a GIF using Emacs to demonstrate some features, it just looks so cool. I finally got a chance after attending the Leeds Code Dojo. The final exercise is bit unusual; we have to write a basic expression evaluation program without using the eval function in whatever language we choose. The first problem we had was to figure out the order of sub-expression to evaluate. For example, in (5 * (2 + 1) ) expression, we know we firstly add 2 to 1 to get the 3, and then multiply 3 by 5. It sounds trivial but it is actually hard to write a program to do that.

I used regular expression1 to locate the most inner expression to evaluate, then replaced the expression with its evaluating result, and continued these two steps until there was no expression2.

The above GIF shows each step in a expression evaluation program written in Emacs Lisp.

This post show how to make GIF in Emacs on Ubuntu system.

Dependencies

There are three packages to install first. We need recordmydesktop to capture the motion of the screen, mplayer to view the video, and imagemagic to convert the recorded video into GIF file. They can be installed easily using the apt-get command, as in the following bash shell script:

sudo apt-get install recordmydesktop mplayer imagemagick

On Emacs side, I use camcorder package to control the workflow. It is hosted in MELPA repository, and can be installed by

(package-install 'camcorder)

Then everything should work nicely together.

Workflow

After these packages are installed, creating a GIF is simply, only requiring three steps.

1. Initiate the recording

In Emacs,

  • Switch to the buffer we want to record, let's call this buffer the recording buffer,
  • Initiate the recording by M-x camcorder-record command,
  • Choose where to save the video file, then

A new frame with the recording buffer will pop up. It is wrapped inside a white rectangular box. Everything inside the box will be recorded and saved in the video file. Note, if we move the window or overlay it with other windows, we probably get undesired results.

2. Record Choose the recording buffer/frame,

  • Press F-11 to pause/resume,
  • Show some cool things,
  • Press F-12 to stop,

Note the demonstration must have an effect on the recording buffer, and we can use with-current-buffer function to dump the output for a particular buffer, for example,

(with-current-buffer "Demo_Buffer"
  (insert "Start to demo: "))

will insert "Start to demo: " into the Demo_Buffer.

I found it is useful to wrap the demonstration into a function and bind to a key because I will probably run it many times.

(defun yt/camcorder-show-off ()
  (interactive)
  (goto-char (point-min))
  (insert "going to show you something cool, don't blink your eyes.")
  (sleep-for 2)
  ;;;; apply some functions
  (insert "\nExciting isn't?"))

(define-key camcorder-mode-map [f5] 'yt/camcorder-show-off)

There are two functions that are helpful control the flow. Use sleep-for function to let the program wait few seconds, and use y-or-n-p to let us choose whether to proceed or switch flow.

3. Make gif

After the demo is finished,

  • Type M-x camcorder-convert to convert a video file to a GIF file,
  • Choose a file name for the GIF file,
  • Select convert method, and choose use mplay with imagicstick.

We probably repeat the step 1-3 multiple times until we are happy with the GIF.

Footnotes:

1
Regular expression might not be suitable for this task, and it works
2
Everything is actually an expression
If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!