Title/Summary:

Implement one machine learning algorithm (including demo and docs) using Hadoop for the Mahout machine learning project.

Student:

Philip Ramsey

Student e-mail:

goal.oriented.design AT gmail DOT comĀ 

Student Major:

Computer Science; Computational Linguistics

Student Degree:

1st Bachelor's Degree

Student Graduation:

Spring 2012

Organization:

Apache Lucene Mahout-Machine-Learning

Abstract:

Currently I am engaged in research that explores word sense disambiguation via co-reference resolution. Much of the work thus far has involved computing syntactic similarity of words using weighted distance measures between the probability of their occurrence in a given bi-gram. I've been using Hadoop to implement these similarity measures, with the goal of generating sets of syntactically-similar words from which phrasal/part of speech categories can be abstracted. This research has primarily been based on the results of (Lee, 2001), with the next step looking towards an abstraction of these similar sets for the purpose of grammar induction, as explored by (de Pauw, 2004). My interest in exploring Mahout as a platform for this abstraction began earlier this year, although I have not yet begun working with it. Building off of the work of (De Pauw, 2004) and (Lankhorst, 1994), I propose to implement an evolutionary algorithm framework specifically geared towards inferring grammars over a large dataset using the Watchmaker implementations currently available in Mahout. The intended consequence of this proposal is to work as a use-case test, providing patches and resolutions as needed to the near-release-ready GA package currently available in Mahout. The benefit of this test-case is that, due to the robust representation needed for NLP, a more precise example for variable optimization strategies can be explored, potentially resulting in unobserved bug-fixes and a more diverse implementation library.

Detailed Description: Using Hadoop, we have been able to generate sets of similar words, and have begun work on linking these sets via the occurrence of a bi-gram, given that a subset of similar words occupies a specific location within it. In this scenario, the training/test data will be broken into parts such that a fitness measure can be computed based on the degree to which the results of a given generation mirror the relative frequencies observable in the data. As such, the goal will be to generate, from the training data, word co-occurrences that have an actual probability in the testing data that do not, however, occur in the training data. As sets of similar words, the generated co-occurrences will work as an inference engine for defining the rules of grammar over the dataset. Prior to May 23rd, I intend to familiarize myself with the currently-available GA packages in Mahout, using them as-is to develop an implementation for this specific test case. As the coding quarter begins, I intend to develop new classes that provide for a more robust utilization of Watchmaker within the Mahout framework. As is most likely clear, the details of this outline are very open for change/revision. Currently I am not familiar with the needs of the community in regards to the GA package. Thus, I intend to work specifically where I may be most useful to the project, as regards the Watchmaker/GA implementation.

Draft Timeline

week 1-3: Successfully implement GA test case using current Mahout tools as-is;

week 4-6: generate tests using the various evolution engine classes available in Watchmaker, finding an optimal approach using their tools for multi-threading, splitting, etc.;

week 7-8: based on the results of the previous tests, rework/append to the org.apache.mahout.ga.watchmaker package;

Week 9-10: debug/make modifications as needed to successfully complete the GSoC commitments.

Additional Information:

My experience in this field is not vast. However, I am very comfortable with Hadoop, and working with Mahout is the direction that I am moving in. The research that I have been doing, and will continue to do, relies on my ability to optimally integrate machine learning algorithms with a distributed framework. To have the opportunity to work hands-on with the Mahout platform would be absolutely amazing. I'm 21 years old, living in Olympia, Washington, attending The Evergreen State College.