Title/Summary:
Implementation of Expectation Maximization algorithm with Mixture Model for Mahout
Student:
Yifan Wang
Student e-mail:
Student Major:
Computer Science
Student Degree:
Master
Student Graduation:
2010
Organization:
The Apache Software Foundation
Abstract:
Implementation of Expectation Maximization algorithm with Mixture Model for Mahout
Detailed Description:
Development Goal
Implementation of Expectation Maximization algorithm with Mixture Model for Mahout
When submit the program, the followings should also be submitted.
1.Unit test cases and regression test cases.
2.Help documents and example code for using the algorithm.
Development Plan
According to the Google website, the coding begins on May 23th, and ends on August 10th.
So it will give me about 12 weeks to design, development and test.
1) Pre-phase
This phase will begin as soon as my application is approved, and will last until May 23th.
Firstly, during this phase, I will get myself familiar with the Mahout project and its framework and read the existing codes in your library.
Secondly, I have basic knowledges of the mixture models of EM algorithm, but I suppose that this is not enough for a MapReduce implementation of the algorithm in the Mahout library. So I plan to study one or two open source implementations of the single-thread version of the algorithm. During the time, i can get familiar with the algorithm and know essential steps to implement a single-thread version of EM.
2) Development-phase
This phase will begin on May 23th, and end on August 10th.
Part I:
Week 1-3:
Development of single-thread version of EM.
Test the correctness of the implementation of the algorithm.
Part II:
Week 4-10:
Development of MapReduce version of EM.
Test the correctness and performance of the MapReduce version of the algorithm.
Part III:
Week 11-12:
Bug fix and Example code for user
The 3-week Part I can be further divided into several steps:
1. Read the specification of the algorithm and design the implementation (0.5 week)
2. Implementation of the algorithm.(1.5 -2 weeks)
3. Test of the correctness of the implementation and Bug fix (0.5-1 week)
The 7-week Part II can be further divided into several steps:
1. Design of the Algorithm (1.5 - 2 weeks)
1.1 Find the part of the algorithm that can be parallelized. Design the MapReduce version of the Implementation of the algorithm. 1.2 Design of the API that will be called by the user of the library.
2. Implementation of the algorithm. Along with unit test cases and documentation of the API
- 2.1 Find a good seed(initial parameters) algorithm for the EM algorithm (0.5 week)
2.2 Implementation of the MapReduce version of the algorithm
- 2.2.1 Parallelization of codes in the Maximization Step(1-1.5 week) 2.2.2 Parallelization of codes in the Expectation Step(1-1.5 week) 2.2.3 Parallelization of codes between different EM iterations (1 week)
3. Test of the correctness and performance of the implementation and Bug fix. (1.5-2 week)
Biography
My name is Yifan Wang, I am a master student in the CS department of Technical University of Munich.
Before come to Germany, I received a bachelor degree of Software Engineering from Tongji University, China.
I am interested in the field of distributed systems, parallel computing and machine learning technology.
During the study, I did summer internships in Microsoft (Windows Server 2008 Team, 2006) and SAP (LinuxLab, 2008)
Motivation
I am interested in the application of very large information systems. Machine learning technology is one of the most important applications for such system.
We are in an era where information explodes but our brain's ability to handle the information remains. We need the technology that can help us get the Information as soon as possible and as related as possible. Machine learning algorithms can greatly help us on this.
Present Skills
I have experiences with a few languages and techniques.
c/c++/OpenMP/MPI: I am now working on a distributed performance analysis program running on a supercomputer installed in LRZ in Germany. The program named Periscope, it is design to analysis the
performance of program running on peta-scale supercomputer.
c++/c#: During my internship in Microsoft, I wrote the UDDI SDK. It can be used to create program that publish, search and manage web service on a Microsoft UDDI server.
Java/Python/KDE: During my internship in SAP LinuxLab, I wrote the single sign-on program for their Linux platform.
The program is used to provide a unified solution to the access of resources in the company's internal network.
PHP/JavaScript: www.allesfuer5.com. It is a online-shopping site in Germany.
End
This document is a draft and can be modified in the future according to feedbacks from mentors.