Title/Summary: [Apache Mahout] Implement parallel Random/Regression Forests
Student: AbdelHakim Deneche Student e-mail: firstname.lastname@example.org
Student Major: Phd in Computer Science
Student Degree: Master in Computer Science
Student Graduation: Spring 2011
Organization: Apache Lucene Mahout-Machine-Learning
My goal is to add the power of random/regression forests to Mahout. At the end of this summer one should be able to build random/regression forests for large, possibly, distributed datasets, store the forest and reuse it to classify new data. In addition, a demo on EC2 is planned.
This project is all about random/regression forests. The core component is the tree building algorithm from a random bootstrap from the whole dataset. I already wrote a detailed description on Mahout Wiki Random Forests]. Given the size of the dataset, two distributed implementation are possible:
- The most straightforward one deals with relatively small datasets. By small, I mean a dataset that can be replicated on every node of the cluster. Basically, each mapper has access to the whole dataset, so if the forest contains N trees and we have M mappers, each mapper runs the core building algorithm N/M times. This implementation is, relatively, easy because each mapper runs the basic building algorithm "as it is". It is also of great interest if the user wants to "try" different parameters when building the forest. An out-of-core implementation is also possible to deal with datasets that cannot fit into the node memory.
- The second implementation, which is the most difficult, is concerned with very large datasets that cannot fit in every machine of the cluster. In this case the mappers work differently, each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest. The core building algorithm must thus be rewritten in a map-reduce form. This implementation can deal with datasets of any size, as long as they are on the cluster.
Although the first implementation is easier to implement, the CPU and IO overhead of the out-of-core implementation are still unknown. A reference, non-parallel, implementation should thus be built to better understand the effects of the out-of-core implementation, especially for large datasets. This reference implementation is also usefull to asses the correctness of the distributed implementation.
Working Plan and list of deliverables
- reference implementation of Random/Regression Forests Building Algorithm:
- Build a forest of trees, the basic algorithm (described in the wiki) takes a subset from the dataset as a training set and builds a decision tree. This algorithm is repeated for each tree of the forest.
- The forest is stored in a file, this way it can be re-used, at any time, to classify new cases.
- At this step, the necessary changes to Mahout's Classifier interface are made to extend its use to more than Text datasets.
- Study the effects of large datasets on the reference implementation
- This step should guide our choice of the proper parallel implementation
- Parallel implementation, choose one of the following:
- 3a. Parallel implementation A
- When the dataset can be replicated to all computing nodes.
- Each mapper has access to the whole dataset, if the forest contains N trees and we have M mappers, each mapper runs the basic building algorithm N/M times. The mapper if also responsible of computing the out-of-bag error estimation.
- The reducer store the trees in the RF file, and merges the oob error estimations.
- When the dataset is so big that it can no longer fit on every computing node, it must be distributed over the cluster.
- Each mapper has access to a subset from the dataset, thus all the mappers collaborate to build each tree of the forest.
- In this case, the basic algorithm must be rewritten to fit in the map-reduce paradigm.
- 3a. Parallel implementation A
- Run the Random Forest with a real dataset on EC2:
- This step is important, because running the RF on a local dual core machine is different from running it on a real cluster with a real dataset.
- This can make a good demo for Mahout
Amazon has put some interesting datasets to play with Public Datasets]. The US Census dataset comes in various sizes ranging from 2Go to 200Go, and should make a very good example.
At this stage it may be useful to implement MAHOUT-71 (Dataset to Matrix Reader).
- If there is still time, implement one or two other important features of RFs such as Variable importance and Proximity estimation
Additional Information: I am a PhD student at the University Mentouri of Constantine. My primary research goal is a framework to help build Intelligent Adaptive Systems. For the purpose of my Master, I worked on Artificial Immune Systems. I applied them to handwritten digits recognition Pattern Recognition] and Muliple Sequence Alignement (bioinformatics) Bio Informatics]. Last year I participated in the GSoC as an Apache student, I integrated an Evolutionary Computing Framework to Mahout Mahout-56.
[Bio Informatics] A. Layeb, A. Deneche, "Multiple Sequence Alignment by Immune Artificial System", ACS/IEEE International Conference on Computer Systems and Applications AICCSA’07, Jordan 2007.
[Pattern Recognition] S. Meshoul, A. Deneche, M. Batouche, "Combining an Artificial Immune System with a Clustering Method for Effective Pattern Recognition", International Conference on Machine Intelligence ICMI’05, pp. 782-787, Tunis 2005.
[Public Datasets] http://aws.amazon.com/publicdatasets/
[Random Forests] http://cwiki.apache.org/MAHOUT/random-forests.html