Contents
News Personalization System
Google's News Personalization System Clone Project.
Initial Contributors
Edward Yoon (R&D center, NHN corp.)
Download
Algorithm Overview
- Obtain a list of candidate stories
- For each story:
- Obtain 3 scores (y1, y2, y3)
- Final score = ∑ wi*yi
- User clustering algorithms (Minhash (y1), PLSI (y2))
- Score ~ number of times this story was clicked by other users in your cluster
- Story-story covisitation (y3)
- Score ~ number of times this story was co-clicked with other stories in your recent click history
- User clustering done offline as batch process
- Can be run every day or 2-3 times a week
- Cluster-story counts maintained in real time
- New users
- May not be clustered
- Rely on co-visitation to generate recommendations
NPS Architecture
+----------------------------+
| Mail Frontend Web server |
+-----+---------------+------+
| ↑ |
ACP communication
↓ | ↓
+--------------------------------+--------------+
| intermediation DB Servers +---+
+----------------------------------------+------+ |
↑ ↑ | |
Reads user profile Read contents Update contents |
| | ↓ |
+-------------------------+-----+ +---------+----------------------+ |
| Hbase:UserTable | | Hbase:MIMETable | |
| (user profile, logs) | | (MIME data) | |
+-------------------------------+ +--------------------------------+ |
↑ ↑ |
+-----------------------------+--------------------+----------------------+ |
| Hadoop: MapReduce Jobs (e.g. Log Analysis, Network Analysis, .., etc) |←--+
+-------------------------------------------------------------------------+
Clustering Algorithms
User clustering - MinHash
- Input: User and his clicked stories
Su = {su1 , su2 , ... , sum}
User similarity = | Su1 I Su2 | / | Su1 Y Su2 |
- Output: User clusters.
- Similar users belong to same cluster
MinHash
- Randomly permute the universe of clicked stories
{su1 , su2 , ... , sum} = {s'1 , s'2 , ... , s'm}
MH(u) = min(suj) min defined by permutation
P{MH(u1) = MH(u2)} = | Su1 I Su2 | / | Su1 Y Su2 |
- Pseudo-random permutation
- Compute hash for each story and treat hash-value as permutation index
Treat MinHash value as ClusterId
- Probabilistic clustering
Clustering - PLSI Algorithm
- Learning (done offline)
- ML estimation
- Learn model parameters that maximize the likelihood of the sample data
- Ouput = P[zj|u]’s P[s|zj]’s
- P[zj|u]’s lead to a soft clustering of users
- ML estimation
- Runtime: we only use P[zj|u]’s and ignore P[s|zj]’s
Covisitation count
- For each story si store the covisitation counts with other stories c(si, sj)
- Candidate story: sk
- User history: s1,…, sn
- score (si, sj) = c(si, sj)/∑m c(si, sm)
- total_score(sk) = ∑n score(sk, sn)
References
- Google News Personalization: Scalable Online Collaborative Filtering
- Bigtable paper: OSDI 2006