News Personalization System
Google's News Personalization System Clone Project.
Initial Contributors
Edward Yoon (R&D center, NHN corp.)
Download
Algorithm Overview
Obtain a list of candidate stories
For each story:
Obtain 3 scores (y1, y2, y3)
Final score = ∑ wi*yi
User clustering algorithms (Minhash (y1), PLSI (y2))
Score ~ number of times this story was clicked by other users in your cluster
Story-story covisitation (y3)
Score ~ number of times this story was co-clicked with other stories in your recent click history
User clustering done offline as batch process
Can be run every day or 2-3 times a week
Cluster-story counts maintained in real time
New users
May not be clustered
Rely on co-visitation to generate recommendations
NPS Architecture
+----------------------------+
| Mail Frontend Web server |
+-----+---------------+------+
| ↑ |
ACP communication
↓ | ↓
+--------------------------------+--------------+
| intermediation DB Servers +---+
+----------------------------------------+------+ |
↑ ↑ | |
Reads user profile Read contents Update contents |
| | ↓ |
+-------------------------+-----+ +---------+----------------------+ |
| Hbase:UserTable | | Hbase:MIMETable | |
| (user profile, logs) | | (MIME data) | |
+-------------------------------+ +--------------------------------+ |
↑ ↑ |
+-----------------------------+--------------------+----------------------+ |
| Hadoop: MapReduce Jobs (e.g. Log Analysis, Network Analysis, .., etc) |←--+
+-------------------------------------------------------------------------+
Clustering Algorithms
User clustering - MinHash
Input: User and his clicked stories
Su = {su1 , su2 , ... , sum}
User similarity = | Su1 I Su2 | / | Su1 Y Su2 |
Output: User clusters.
Similar users belong to same cluster
MinHash
Randomly permute the universe of clicked stories
{su1 , su2 , ... , sum} = {s'1 , s'2 , ... , s'm}
MH(u) = min(suj) min defined by permutation
P{MH(u1) = MH(u2)} = | Su1 I Su2 | / | Su1 Y Su2 |
Pseudo-random permutation
Compute hash for each story and treat hash-value as permutation index
Treat MinHash value as ClusterId
Probabilistic clustering
Clustering - PLSI Algorithm
Learning (done offline)
ML estimation
Learn model parameters that maximize the likelihood of the sample data
Ouput = P[zj|u]’s P[s|zj]’s
P[zj|u]’s lead to a soft clustering of users
Runtime: we only use P[zj|u]’s and ignore P[s|zj]’s
Covisitation count
For each story si store the covisitation counts with other stories c(si, sj)
Candidate story: sk
User history: s1,…, sn
score (si, sj) = c(si, sj)/∑m c(si, sm)
total_score(sk) = ∑n score(sk, sn)
References
Google News Personalization: Scalable Online Collaborative Filtering
Bigtable paper: OSDI 2006