On the Friday after the Hadoop Summit 2009 a group of Hadoop committers and developers met at Cloudera's office in Burlingame to talk about Hadoop development challenges. Here are some notes and pictures (see the attachments) from the discussions that we had.

Attendees

Eric Baldeschwieler, Dhruba Borthakur, Doug Cutting, Nigel Daley, Alan Gates, Jeff Hammerbacher, Russell Jurney, Jim Kellerman, Aaron Kimball, Mahadev Konar, Todd Lipcon, Alex Loddengaard, Matt Massie, Arun Murthy, Owen O'Malley, Johan Oskarsson, Dmitriy Ryaboy, Joydeep Sen Sarma, Dan Templeton, Ashish Thusoo, Craig Weisenfluh, Tom White, Matei Zaharia, Philip Zeiliger

Five things

Things we don't like about Hadoop

Things we like about Hadoop

Project challenges

Nigel brings up:
Patch Development Process:

... for testing we need to know that the tests are sufficient – this means that docs are required

Test plan template: What are you testing? what are the risks? How do you test this?

Phil: What are examples of *good* JIRAs?

Doug: Should we add a test plan field to JIRA?

Nigel: People need to take ownership of features and consider scalability, etc.

The patch queue is too long...

Doug: committers need to be better here

Dhruba: doing reviews right is hard

Phil: What do people do here?

Doug: We need a "wall of shame" to incentivise people to review other peoples' code rather than just write new patches of their own

Pushing Warnings to Zero:

Typo Fixes, etc:

Most things require both

Exceptions: www site, rolling a release

How do you continuously refactor?

Phil : Can we agree that it's ok to fix comments/typos/etc in the same file you're working in?

Doug Yes, but people need to say that this is what they're doing in the JIRA comments

Tom: This should be clarified on the HowToCommit wiki page

Checkstyle:

Matt: Could we think about an auto-reformatting patch service? (lukewarm reaction here)

Testing:

Build System:

Wish Lists

MapReduce

MapReduce and HDFS

HDFS

Build And Test

Avro

Common

Pig

Subproject discussions

MapReduce

HDFS

Pig/Hive

Approaches to column storage were discussed. The Hive and Pig implementations are fundamentally different – Hive uses a PAX-like block-columnar approach, wherein a block is organized in such a way that all values of the same column are stored next to each other and compressed, with a header that indicates where the different columns are stored inside the block. This file format, RCStore, can be used by Pig by virtue of writing a custom slicer/loader. The Pig representatives indicated that they were interested in experimenting with RCStore as an option for their workloads.

The Pig model splits columns into separate files, and inxeses into them; there is a storage layer that needs to coordinate the separate files. The layer is (naturally) able to do projection pushdown, looking into also pushing down filters. Pig integration is in progress. No technical impediments to implementing a Hive SerDe for Zebra.

Sharing code and ideas would be easier if there was a common place to look for this sort of stuff. For example, RCStore can be pulled out of hive – but where? Commons? Avro?

Neither project has plans for storage-layer vector operations (a-la vertica).

Metadata discussion pushed off to next week.

Join support is roughly equivalent for both systems (map-side aka FRJoin, and regular hash join). Neither supporting or planning on bloom joins (no real demand – be it because the users aren't familiar with it, or because the workloads don't need it, is unknown). Either way, "we accept patches (wink)"

Blue-sky stuff:
Oozie integration – a way to define a hive node? Some other custom processing node? Having Hive and Pig clients automatically push queries off as Oozie jobs so that they are restartable, etc?

Pig is looking to move off JavaCC. Currently leaning in the direction of cup+jflex ; hive uses Antlr, which is now a lot less memory-intensive as they've started using lookaheads. This information might be putting Antlr back on the table for Pig.

Avro/Common

Configuration

Registry class for configuration that allows you to "register" configuration

Hadoop configs should be read from a distributed filesystem (HADOOP-5670)

Avro

Build And Test

Top Level components Nigel proposed:

Backwards Compatibility Testing:

System Test Framework:

Patch Testing:

Test Plan Template:

Mock Objects: