# Notes on Factorisation Machine

20 Dec 2015To start with, consider model equation of a linear regression model with two-way interaction effect:

\begin{equation} f(x) = \beta_0 + \sum_{i=1}^{p} \beta_i x_i + \sum_{i=1}^{p} \sum_{j=i+1}^{p} \beta_{i, j} x_i x_j \end{equation}where

- \(\beta_o\)
- is the interception term,
- \(\beta_i\)
- models the strength of the i-th variable.
- \(\beta_{i,j}\)
- models the interaction between i-th and j-th variable.

To estimate the parameters, we can firstly create the interactive
variables ^{1} \(x_{i,j} = x_i \cdot x_j\) and add them to the design
matrix \(X\). It converts the problem into a linear regression which can
be easily solved by least squares^{2}.

The Factorisation Models have the same model equation but use a different way of estimating the interaction parameter \(\beta_{i,j}\) by factorising it: provided a sufficiently large \(k\), the \(\beta\) \((p \times p)\) matrix can be approximated (factorised) by \(V \bullet V^T\) using a lower dimension matrix \(V\) \((p \times k)\).

In this way, interaction parameter becomes

\begin{equation} \hat{\beta}_{i,j} := \langle v_i, v_j \rangle = \sum_{f=1}^{k} v_{i, f} v_{j, f} \end{equation}Where the k is a hyper-parameter that defines the dimensionality of the factorisation.

The model equation can be solved using stochastic gradient descent
with various of loss (square loss, logit, or hinge loss). For square
loss ^{3}, the gradient of FM model is

Due to this factorisation of interaction there is no model parameter
that directly depends on the two variables. The computation can
reduced from quadratic \(\mathcal{O}(kn^2)\) to linear
\(\mathcal{O}(kn)\), see Lemma 3.1 in Rendle's paper ^{4}.

Another advantage of this is we are able to estimate the interaction parameter \(\beta_{i,j}\) when there is not enough observation for \((x_i, x_j)\). This is a desired feature when working with sparse dataset. Rendle showed FM outperforms Support Vector Machine on hugely sparse dataset.

Factorisation Machine looks promising: it's fast to train and performs well on sparse dataset. I haven't fully understand it yet but am keen to apply it to a Kaggle competition and gain more insights.