The Performance of Probabilistic Latent Semantic Analysis
By Bas Machielsen
October 5, 2023
Introduction
In this blog post, I want to investigate the performance of probabilistic latent semantic analysis: a subject I have been teaching (but also studying) for a course. Probabilistic latent semantic analysis proceeds from a document-term matrix, a standard data matrix in the field of text mining. It should look something like this, where the rows of the matrix represent documents and the columns terms (words). Usually, . a
The standard maximum likelihood estimator for is where is the total word count in all documents. This has a simple interpretation: count of word in document / total word count in all documents.
PLSA
Probabilistic Latent Semantic Analysis (PLSA) is an attempt to decompose this matrix using something similar to a singular value decomposition. In particular, given a probability matrix:
We can have construct an approximation with ($r$ classes):
, and ($r \times p$) has elements . Naturally, the object of interest is usually : this represents the probabilities of the document belonging to class to .
In R, the package svs
can be used to carry out probabilistic latent semantic analysis:
library(svs)
In what follows, I’ll demonstrate the capacity of PLSA to distinguish two types of documents on its own: I’ll scrape and convert into a document text matrix several pages about football, and several pages about tennis, set (the number of classes) equal to 2, and investigate the output.
Example
Here, I first web-scrape the text of several wikipedia pages:
Now, I use the tidytext
package to put these into a document-term matrix:
library(tidytext)
# Compute the texts into a data.frame
text_df <- tibble(Text = texts) |>
rowwise() |>
mutate(Text = paste(Text, collapse="")) |>
ungroup()
# Put all words together grouped by document
text_data <- text_df |>
group_by(row_number()) |>
unnest_tokens(word, Text) |>
rename('document' = 'row_number()')
# Convert to a document-term matrix
# Filter out stop_words and numbers
stop_words <- bind_rows(stop_words, data.frame(word = as.character(0:10000), lexicon="Custom"))
dtm <- text_data |>
count(document, word) |>
filter(!is.element(word, stop_words)) |> #!str_detect(word, paste(as.character(0:10000), collapse="|"))) |>
cast_dtm(document, word, n)
The document-term matrix (dtm
) looks like this:
as.data.frame(as.matrix(dtm)) |> dim()
## [1] 14 6197
Now, let’s compute the frequencies and apply LPSA:
library(svs)
X <- as.matrix(dtm)
out <- fast_plsa(X, k=2, symmetric=T)
Now, I want to find out which class each of the documents have been assigned to:
apply(
out$prob1, 1, which.max
)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14
## 1 2 1 1 1 2 2 2 2 2 2 2 2 2
.. which means that the majority of the documents is classified in the correct corresponding cluster.
Comparison
We can compare the results with a so-called latent semantic analysis, which is just a singular value decomposition.
out_lsa <- fast_lsa(X)
out_lsa$pos1[, 'Dim1']
## 1 2 3 4 5 6
## -0.803081398 -0.338888792 -0.191561592 -0.298909337 -0.271174528 -0.133052909
## 7 8 9 10 11 12
## -0.146478395 -0.026522482 -0.011199147 -0.008402141 -0.019986480 -0.013202093
## 13 14
## -0.001734649 -0.001084412
In this case, we can see that the median of the first dimension already separates the documents perfectly in two classes. The first 7 observations having very low values and the second 7 values having very high values. So we can take this to be an indicator for which class the documents belong to:
data.frame(doc_no = 1:14) |>
mutate(class = if_else(out_lsa$pos1[,'Dim1'][doc_no] > median(out_lsa$pos1[,'Dim1']), 1, 2))
## doc_no class
## 1 1 2
## 2 2 2
## 3 3 2
## 4 4 2
## 5 5 2
## 6 6 2
## 7 7 2
## 8 8 1
## 9 9 1
## 10 10 1
## 11 11 1
## 12 12 1
## 13 13 1
## 14 14 1
Conclusion
In this setting, I have demonstrated a simple example of latent probabilistic semantic analysis, and latent semantic analysis, and I would prefer a simpler method to a potentially more complicated method. Thank you for reading!
- Posted on:
- October 5, 2023
- Length:
- 4 minute read, 763 words
- See Also: