# GSoC2009-A.SEHILI-MahoutProjectProposal

**Title/Summary:**

[Apache Mahout] Implementation of two machine learning algorithms for the Mahout Machine Learning project

**Student:**

Mohamed El Amine SEHILI

**Student e-mail:**

amine.sehili AT gmail DOT com

**Student Degree:**

Computer Science engineer

**Organization:**

Apache Lucene Mahout-Machine-Learning

**Assigned Mentor:**

**Abstract:**

The goal of this project is to implement, test and document two of the best known machine learning algorithms for the Apache Mahout Machine Learning project. The algorithms in question are K-means and Hidden Markov Models (HMMs). The first one, K-means, is used to create clusters from a set of objects while HMMs is known as a simple and powerful tool used to analyze and recognize sequences.

**Detailed Description:**

The two proposed algorithms are described as follows:

**K-means:**

Despite its problems (e.g. needs to know the number of classes), k-means still a well-known algorithm for clustering. One of its applications is, for instance, to create a mixture of multi Gaussian from a set of vectors. There are several variants of k-means that attempt to improve the standard algorithm by allowing it to estimate the good number of clusters and including new distance measures. The article http://www.cs.stanford.edu/people/ang/papers/nips06-mapreducemulticore.pdf describes a Map/Reduce form of the algorithm. the key idea is to split the data into individual subgroups and clustering vectors in each subgroup separately. To determine new centroid vectors the set of input vectors is divided into subgroups such as the sum of each subgroup is computed in parallel. The reducer will then use the partial sums to compute the new centroids.

**Hidden Markov Models:**

Hidden Markov Models are a statistical model widely used in artificial intelligence and espicially in speech recognition, face recognition, part-of-speech tagging, bioinformatics... etc. Simply, one can say that most successful commercial speech recognition systems are using HMMs. Furthermore they are used in other domains like signal processing. Given their wide range of applications, it is a pretty good thing to implement them as a part of this project. The implemantation will cover the three canonical problems related to HMMs:

- Estimating of the probability of a given sequence of observations (using forward-backward procedure or viterbi algorithm).
- Estimating the most likely sequence of states given a sequence of observations (viterbi algorithm).
- re-estimating the parameters of a HMM to improve the probability of a given sequence of observations (baum-welch algorithm, a special case of EM). When dealing with HMMs that use GMMs as Probability Distribution Function (PDF) we can use the Map/Reduce form of EM as described in the previous article.

The re-estimation of the transition matrix (matrix A according to the Lawrence Rabiner's notation) and the initial state distribution (PI vector) is the same whether the HMM uses discrete or continuous PDF (although for some applications PI is not re-estimated to maintain the first state the only possible start state). Things are different when we consider the re-estimation of PDFs because the re-estimation of a discrete distribution (represented by a matrix B for instance) differs from the re-estimation of a continuous distribution (e.g. GMMs).

For sequences with very little number of observations, the default version of forward-backward and viterbi algorithms will work fine, but as the observations are longer there will be a need to use techniques such as scaling and using logarithms to avoid the underflow problem (this problem arises because the probability is computed by multiplying a great number of values which are < 1).

For discrete HMMs, When a symbol doesn't appear in the training data it is very important to give it a probability (although very small) greater than zero so that if it appears in the test data (it is quite possible that it does) the probability of the whole sequence will not be null.

As it can be seen, the HMMs project is so important that the k-means algorithm itself can be seen as a part of it (i.e. used to create GMMs). Of course all of the issues discussed above will be taken into account during implementation.

*Note:*

I have chosen these algorithms because I know theme (I have good knowledge on HMMs and I already used them in a project) but I'll have no problem to implment another algorithm instead of them if my mentor recommend it (in such case I would prefer PCA or ICA).

**Deliverables:**

I hope at the end of this project most of the algorithms will effectively be implemented by the participants. From my part, I'll give a particular importance to the algorithms I proposed and I will primarily focus on them.

For k-means:

- One variant of the algorithm must be implemented.

For HMMs, The following must be imlemented:

- Discrete HMMs with a finite alphabet of symbols.
- Continuous HMMs with : normal distribution, multivariate Gaussian distribution, Gaussian mixture models.
- Simple version of viterbi algorithm and a version which uses logarithms.
- Simple version of forward-backward procedure and a scaled one.
- Simple version of baum-welch algorithm (uses the first version of forward-backward) and a scaled version (uses the second).

It would be so good to add support for embedded HMMs and implement more than one variant of k-means.

**Project planning:**

First, I would like to mention that I have finished with my courses and exams this year and I will have no courses till october. I'll soon begin a project on statistical machine translation at the LIMSI Laboratory in my university, but oubviously, this will not prevent me to provide the necessary time and effort working on the current project. Given the project timelines, my plan will be as follows (from May 23rd to August 10th):

- k-means : 4 weeks
- Hidden Markov Models: 6 weeks

Although I will be ready to begin working from the first week of Mai, I would like to spent a few time getting to know my mentor, the team and the project available code and learn how work will be organized during the project as I am new to open source projects.

**Biography:**

My name is Mohamed El Amine SEHILI, I am 24. I am student at Paris-south 11 (Paris-sud in french) university in Master 2 Research in Computer Science . This year, I had mainly courses on Natural Language Processing (NLP) but I have also studied Semantic Web, Multi-agent Systems and High-performance computing. I will join the Spoken Language Processing Group (TLP) at the LIMSI Laboratory to work on a machine translation project (Consensus Networks for statistical machine translation).

In 2008, I worked on a face recognition project to get the degree of engineer in computer science (from Algeria). This project was extremely good and important for my career as it allowed me to discover artificial intelligence and learn several machine learning algorithms. At the end of the project, I have implemented 5 face recognition methods including:

- PCA (Principal Components Analysis).
- DCT (Discrete Cosine Transform) with features vector.
- Combination of PCA and DCT.
- Discrete HMMs with DCT.
- Discrete HMMs with Haar-like features.

I have also implemented a method for face detection in colored images.

In the beginning, I would like to write my project using first Matlab and then C++ with OpenCV but I finally decided to write it entirely in Java as it is my main programming language.

**Motivations:**

My main motivations for the project are:

1. While it is quiet easy to find source code for many kinds of projects, including web browsers, video streaming applications and even whole operating systems, it is not so easy to find a good written and tested scientific code. So, I will be very happy and proud to share my knowledge by participating in a project that should make life easier for students and reseachers and would be adopted (I hope) by several universities in the future.

2. At the end of my project in June 2008, I had some amount of code on face recognition and detection that I would like to make available on the Internet through an open source project but I hesitated a lot because I have no experience on open source projects and I was afraid of being overwhelmed by the complexity associated to its management. I hope the current project will allow me to get to know how things work in the open source world and helps me to prevent my hard disk to become a grave of my codes.