Yi Tang Data Scientist with Emacs

Notes on Factorisation Machine

To start with, consider model equation of a linear regression model with two-way interaction effect:

\begin{equation} f(x) = \beta_0 + \sum_{i=1}^{p} \beta_i x_i + \sum_{i=1}^{p} \sum_{j=i+1}^{p} \beta_{i, j} x_i x_j \end{equation}

where

\(\beta_o\)
is the interception term,
\(\beta_i\)
models the strength of the i-th variable.
\(\beta_{i,j}\)
models the interaction between i-th and j-th variable.

To estimate the parameters, we can firstly create the interactive variables 1 \(x_{i,j} = x_i \cdot x_j\) and add them to the design matrix \(X\). It converts the problem into a linear regression which can be easily solved by least squares2.

The Factorisation Models have the same model equation but use a different way of estimating the interaction parameter \(\beta_{i,j}\) by factorising it: provided a sufficiently large \(k\), the \(\beta\) \((p \times p)\) matrix can be approximated (factorised) by \(V \bullet V^T\) using a lower dimension matrix \(V\) \((p \times k)\).

In this way, interaction parameter becomes

\begin{equation} \hat{\beta}_{i,j} := \langle v_i, v_j \rangle = \sum_{f=1}^{k} v_{i, f} v_{j, f} \end{equation}

Where the k is a hyper-parameter that defines the dimensionality of the factorisation.

The model equation can be solved using stochastic gradient descent with various of loss (square loss, logit, or hinge loss). For square loss 3, the gradient of FM model is

\begin{equation} \frac{\partial{f}}{\partial{\theta}} = \begin{cases} 1, & \textrm{ if } \theta \textrm{ is } \beta_0 \\ x_i, & \textrm{ if } \theta \textrm{ is } w_i \\ x_i(\sum_{j=1}^{p} v_{j, f} x_j) - v_{i, f} x_i^2, & \textrm{ if } \theta \textrm{ is } v_{i, f} \end{cases} \end{equation}

Due to this factorisation of interaction there is no model parameter that directly depends on the two variables. The computation can reduced from quadratic \(\mathcal{O}(kn^2)\) to linear \(\mathcal{O}(kn)\), see Lemma 3.1 in Rendle's paper 4.

Another advantage of this is we are able to estimate the interaction parameter \(\beta_{i,j}\) when there is not enough observation for \((x_i, x_j)\). This is a desired feature when working with sparse dataset. Rendle showed FM outperforms Support Vector Machine on hugely sparse dataset.

Factorisation Machine looks promising: it's fast to train and performs well on sparse dataset. I haven't fully understand it yet but am keen to apply it to a Kaggle competition and gain more insights.

If you have any questions or comments, please post them below. If you liked this post, you can share it with your followers or follow me on Twitter!
comments powered by Disqus