Google Summer of Code 2009 – Project Proposal
apache-SoC AT tumblecow DOT net
- CEST or UTC+2
- Eastern Time + 6
Implementing A feed-forward neural network back-propagation training algorithm on Map/Reduce as part of the Mahout project.
Goal of this project is to implement a trainer with back propagation for feed-forward neural networks on the Apache 'Hadoop' Map/Reduce library as a part of the Mahout / Lucene project. Introductionary work was been done in the article Map-Reduce for Machine Learning on Multicore' by CT Chu et. al. (2007), in which 10 common machine-learning algorithms and their implementation on Map / Reduce have been detailed.
Before starting work on the specific algorithms, it will be important to have a complete framework for the (class of) algorithms. I'll see what's already there in the mahout project, can all algorithms that I want to implement be fitted in the paradigm that the framework poses? I'll confer with my mentor about optimal ways to implement the specific algorithms in the framework.
Neural Networks with back propagation
Neural networks with back propagation are powerful, relatively simple, and easy to parallelize. I'll use the description of feed-forward neural networks and the back-propagation training algorithm as described in Russel & Norvig - 'Artificial Intelligence - A modern Approach' (page 736-748), and will use the implementation in the popular weka machine learning as a reference implementation.
For training, the mentioned article recommend running a subsets of the data through the backpropagation algorithm on each of the cores, and using a gradient descent to combine the results.
Implementation of the classification-phase on Map/Reduce is neigh trivial: Each node runs a copy of the completely trained neural net, and classifies a subset of the data. Classification results can be returned as a list, or aggregated into performance overviews.
If there's enough time left, I could focus on more complex form of neural nets. Could I implement a training algorithm for a recurrent neural net on Map/Reduce? Could we implement a form of Hebbian Learning for Mahout? Neural networks are a fascinating subject with many more possibilities to explore.
Some other wild ideas that would be cool to implement, but for which probably time is too short, this isn't the right project, or other disadvantages.
Lucene on GPU
Many modern pc's contain a fast graphics card or gpu, which can offer a order of magnitude (or more!) performance for parallelizable problems. With more and more projects levaraging the computing-power locked inside the GPU and OpenCL() around the corner, I think it's safe to say that the dawn of GPU computing is upon us.
Since Lucene is already focusing on parallel computing, it might be very interesting to see whether some of these algorithms could be converted to run on a GPU. There are however a lot of problems with this idea. Code would need to be written specifically to run on a GPU, and hadoop would probably not be useful. Also, since most GPGPU libraries are written in C, it's questionable at best whether this is something that you'd want in a java library.
I'll use the MoSCoW method to prioritize my deliverables. A deliverable is complete when it's coded, tested and documented.
- Neural-network with back-propagation training implemented.
* Could have
- Recurrent Neural-networks or Hebbian learning implemented in Mahout / Lucene.
* Wanna have
- Have an algorithm in mahout running on a gpu
Coding officially begins at May the 23th, and ends around August the 10th, which allows for 11-12 weeks of dedicated work. To have room for some little other activities (see 'other activities' section), and allow for unforeseen circumstances and delays, I will plan for ten weeks.
- before: Get to know the project, the project-members, and get familiar with the code-base.
- Reading up on Neural networks. (~ 1 week)
- Making a simple 'first version' of the algorithm in serial code if necessary. (~ 2 weeks)
- implement the algorithm in the lucene framework. (~ 3 weeks)
- test the implementation (~ 1 week)
- thoroughly document the implementation. (~ 1 week)
- 2 weeks left for trying my hand at other types of neural network.
Let me introduce myself. I'm Tijs Zwinkels. I'm 25 years old, and I'm studying artificial intelligence at the University of Groningen in the Netherlands. I've been lucky enough to have been invited at Carnegie Mellon University last year, where I've worked on person tracking for the 'snackbot project' for my masters graduation project. Right now I'm back in the Netherlands, and in the final weeks of my masters project, which I'll finish with ample time before the start of the google Summer of Code.
A little over ten years ago, a local computer magazine put a cd-rom with 'Slackware Linux' on my doormat. I had no choice but to try to install it, of course rendering the house pc unusable for days until I had the bootloader figured out. Since then I've been fascinated by open source. It seems such a logical idea to publish your work for the gain of everyone, allowing in to grow and evolve, instead of releasing for the gain of a single company and a few people that are willing to pay for your work without anyone ever being able to really adopt your work to their own needs. It's wonderful to see this concept take root all around, and I'm eager to participate in this idea.
Then there's the field of artificial intelligence and machine learning. Machine learning is a powerful mechanism that could potentially be applied in many places for many tasks. Amazing things are being built in labs around the world, but most of it never makes it out of the lab. Scientific code is often not published, and software makers (commercial or otherwise) just don't seem that interested yet in employing these techniques for consumer software. We're not there yet, but having a stable, well-documented, programmer-friendly, maintained library that is ready for the computing platforms of the future such as Mahout can be, is definitely a step in the right direction.
Moreover, this project is the single Google summer of code project that has the highest probability of being useful to myself; I often need 'a quick machine learner' for little projects. I'm currently often using the OpenCV library, but the machine learning part is just not very good currently. Mahout is or has the potential to become an ideal machine learning library for many of these projects.
During my study artificial intelligence I've come across most machine learning algorithms mentioned on the wiki page, many of which I've used in some form. I've personal experience with all the algorithms that I propose to implement. I have a bachelor degree in computer science, for which I went to the University of 's Hertogenbosch for four years. This particular university was very java-oriented, which has given me experience and taught me to write 'proper' object-oriented java. Since then I've used java for my bachelor project at the University of Groningen and some other small projects.
I've worked in c++ for both my internship and my bachelor project in college, a part-time job for three years for a human-resources company, and for my current masters project. I would say that I'm most proficient in the c++ language right now, but I'm able to employ my knowledge in different programming languages as well. I have some experience with toolsets commonly used in open-source settings. I've used svn, mercurial, and bugzilla for small projects with around five persons using the systems concurrently at max, but I have no experience with these tools for bigger projects.
I'm mainly new to the open source community. I might have submitted a simple bugfix or two to a few projects in the past, but I have no prior experience with functioning in an open-source software team. I wholeheartedly believe in the principle of open source software, and I would love the opportunity that the google Summer of Code gives me to become introduced. I'm familiar with the workings of Map/Reduce, but have no personal experience with it. It is however a very interesting and widely deployed technique for simultaneous processing on multiple cores and/or machines that becomes more and more relevant, and I would like to gain broader knowledge, understanding, and experience with the Hadoop platform.
I'm not entirely comfortable with putting up my curriculum vitae for the whole world to see, but I'm more then willing to send it if you want a more detailed overview of my experience, knowledge, and qualifications. Just send me a mail, and I'll send my CV right away.
Other activities for this summer
I have a few minor other activities I might need to do for this summer, which can take up to a week of my time overall. This won't be a problem for my activities for this project, but I think I should be clear and frank about this so we won't run into misunderstanding later on.
Somewhere during the summer, I have to give a mandatory presentation about my masters project, for which I'll need a few days to prepare. I might want time to submit papers about my masters project to some conferences which might take me a few days overall, but I expect to do most of this work before the google summer of code starts and after it ends. Depending on whether there is time, I might decide to go to Rome for a few days. Overall, a commitment of ~30-40 hrs a week on average (possibly more) will not be a problem at all.
I consider this document to be (very) dynamic in nature. Nothing in this proposal is 'set in stone', and any additional ideas or any feedback is highly appreciated. Is this proposal detailed enough? Still brief enough to be readable? Do mentors and Apache Lucene programmers agree with my choose of machine learning algorithms, or would other algorithms be preferred? Is this enough detail about myself? Most importantly: Is this a good chunk of work for a Google Summer of Code project? Maybe I should include more? Or maybe I'm too ambitious? Any comments, feedback, and answers to my questions can be sent to my mail or added to this wiki, and I'll be more than happy to react and adapt.