Non-Negative Matrix Factorization

By Bas Machielsen

October 5, 2022

Introduction

For a course I am teaching in, the syllabus includes an alternative to Principal Component Analysis, called non-negative matrix factorization. The method was introduced in an article by Lee and Seung (1999) with a subsequent elaboration on the algorithm to implement the method in 2001. The method is described in the Elements of Statistical Learning, p. 553, but there is very little elaboration on the meaning and logic behind the updating rule. I also searched the internet, and except the original 2001 article, there is very little information. The article itself takes a very formal view and proves that the algorithm converges to a (local) maximum, but does not focus on the intuition why. Hence this blog post.

Non-Negative Matrix Factorization

Non-negative matrix factorization is meant to be applied to any non-negative matrix $\mathbf{X}$, of size $N \times p$, which is approximated as:

$$ X \approx W H $$

with $\mathbb{W}$ being $N \times r$ and $\mathbb{H}$ being $r \times p$, and all $x_{ij}$, $w_{ij}$ and $h_{ij}$ greater than zero. Similar to PCA, $W$ can be interpreted as a lower dimensional basis (and $r$ is picked by the user), and $H$ the corresponding weights that map this basis to back to $X$. The objective for non-negative matrix factorization is to maximize the likelihood for a model in which $x_{ij}$ has a Poisson distribution with mean $\mathbf{WH}_{ij}$.

Interpretation of Algorithm

Lee and Seung (2001) propose to initialize the $W$ and $H$ matrices with random positive numbers, and then use the following update rules:

$$ w^{n+1}_{ik} \leftarrow w^{n}_{ik} \cdot \frac{\sum_{j=1}^{p}h_{kj} \cdot x_{ij}/(\mathbf{WH}_{ij})}{\sum_{j=1}^{p}h_{kj}} $$

and

$$ h^{n+1}_{kj} \leftarrow h^{n}_{kj} \cdot \frac{\sum_{i=1}^{N}w_{ik} \cdot x_{ij}/(\mathbf{WH}_{ij})}{\sum_{i=1}^{n}w_{ik}} $$

First, we note that the updating rules consist of two terms. The previous value and a factor. The factor can be intepreted as an updating factor, meaning that it is a scaled $h^n_{kj}$ and $w^{n}_{ik}$ respectively.

Second, we should notice that both update rules contain terms like $x_{ij}/ \mathbf{WH}_{ij}$. $\mathbf{WH}_{ij}$ is the approximation for that particular $x_{ij}$ in the data matrix $\mathbf{X}$. If the approximation is too high, $\mathbf{WH}_{ij}$ is greater than $x_{ij}$ and the next value corresponding to iteration $n+1$ is scaled down compared to its value in iteration $n$, everything else equal. Vice versa, if the approximation is too low, than the previous value is scaled up.

The rest of this updating algorithm essentially consists of a normalized weighted updating of all the errors between $X_{ij}$ and $WH_{ij}$ which involve a particular element of $W$ and $H$ respectively. To see this, let’s take the updating rule for $w_{11}$. This logic also extrapolates to other cells in $W$, and also to all cells in $H$. Now, when is $w_{11}$ used? $w_{11}$ is used to approximate all the entries in the first row of $X$. Those entries are generated by multiplying the first row of $W$ with all columns of $h$, where the $h_{1j}$‘th weight is always multiplied with $w_{11}$. Knowing this, it is easy to see that the algorithm is then essentially computing a weighted average of all those approximations in the first row, and then normalizing them by dividing over the sum of all of those weights. See the below illustration:

$$ \begin{bmatrix} \mathbf{x_{11}} & \dots & \mathbf{x_{1p}} \newline \dots & \ddots & \dots \newline x_{n1} & \dots & x_{np} \end{bmatrix} = \begin{bmatrix} \mathbf{w_{11}} & \dots & w_{1r} \newline \dots & \ddots & \dots \newline w_{n1} & \dots & w_{nr} \end{bmatrix} \begin{bmatrix} \mathbf{h_{11}} & \mathbf{h_{12}} & \dots & \mathbf{h_{1p}} \newline \dots & \ddots & \ddots & \dots \newline h_{r1} & \dots & \dots & h_{rp} \end{bmatrix} $$

(The cells in bold face are the cells that involve $w_{11}$, as in the example.)

Conclusion

This short post attempted to intuitively explain the logic behind the updating rules used in non-negative matrix factorization. I personally think this intuitive understanding also makes clear why the algorithm is sensible and ultimately converges, even without reading through the proof. I hope this was useful, and I thank you for reading!

Posted on:: October 5, 2022

Length:: 4 minute read, 659 words

See Also: