|Deletions are marked like this.||Additions are marked like this.|
|Line 5:||Line 5:|
|Apache Distributed Release Audit Tool (DRAT) is a distributed,
parallelized (Map Reduce) wrapper around Apache™ RAT to allow it to
complete on large code repositories of multiple file types where Apache™ RAT hangs forever
|Apache Distributed Release Audit Tool (DRAT) is a distributed, parallelized (Map Reduce) wrapper around Apache™ RAT to allow it to complete on large code repositories of multiple file types where Apache™ RAT hangs forever.|
Apache DRAT Proposal
Apache Distributed Release Audit Tool (DRAT) is a distributed, parallelized (Map Reduce) wrapper around Apache™ RAT to allow it to complete on large code repositories of multiple file types where Apache™ RAT hangs forever.
Apache DRAT is a distributed, parallelized (Map Reduce) wrapper around Apache™ RAT (Release Audit Tool). RAT is used to check for proper licensing in software projects. However, RAT takes a prohibitively long time to analyze large repositories of code, since it can only run on one JVM. Furthermore, RAT isn't customizable by file type or file size and provides no incremental output. This wrapper dramatically speeds up the process by leveraging Apache™ OODT to parallelize and workflow the following components:
- Apache™ Solr based exploration of a CM repository (e.g., Git, SVN, etc.) and classification of that repository based on MIME type using Apache™ Tika.
- A MIME partitioner that uses Apache™ Tika to automatically deduce and classify by file type and then partition Apache™ RAT jobs based on sets of 100 files per type (configurable) -- the M/R "partitioner"
- A throttle wrapper for RAT to MIME targeted Apache™ RAT. -- the M/R "mapper"
- A reducer to "combine" the produced RAT logs together into a global RAT report that can be used for stats generation. -- the M/R "reducer"
Background and Rationale
As a part of the Apache Software Foundation (ASF) project, Apache Creadur, a Release Audit Tool (RAT) was developed especially in response to demand from the Apache Software Foundation and its hundreds of projects to provide a capability for release auditing that could be integrated into projects. The primary function of the RAT is automated code auditing and open-source license analysis focusing on headers. RAT is a natural language processing tool written in Java to easily run on any platform and to audit code from many source languages (e.g., C, C++, Java, Python, etc.). RAT can also be used to add license headers to codes that are not licensed.
Current developers for the project are all ASF members, experienced with ASF processes and procedures. We know how to grow an Apache community and to develop a meritocratic free and open source project.
Mention JPL folks Tyler is at Google Karanjeet formerly of USC + JPL and now Apple
Apache is, by far, the most natural home for taking the AsterixDB project forward. A large fraction of today's top Big Data technologies have their homes in Apache, including Hadoop, YARN, Pig, Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a significant gap -- the parallel data management system gap -- that exists in the Big Data open source world. It is well-aligned with a number of the Apache projects, e.g., it has strong support for accessing and indexing external data in HDFS, and it uses YARN as an answer to basic cluster resource management. AsterixDB also seeks to achieve an Apache-style development model; it is seeking a broader community of contributors and users in order to achieve its full potential and value to the Big Data community.
There are also a number of related Apache projects and dependencies that will be mentioned below in the Relationships with Other Apache products section.
Given the current level of intellectual investment in AsterixDB, the risk of the project being abandoned is very small. The UCI/UCR faculty team leads are highly incentivized to continue development since the database groups at UC Irvine and UC Riverside are both reliant on AsterixDB as a platform for long-term graduate research projects. UC San Diego is also beginning to contribute to the code base, and a collaboration involving public health applications is forming with UCLA. The work on AsterixDB is managed via a mix of mailing list discussions supplemented by weekly project status meetings which are summarized on the mailing list. Typical (local plus Skype-in) attendance to the weekly status meetings runs at about 20 active contributors.
Inexperience with Open Source
AsterixDB and Hyracks were completely developed in Open Source under the ALv2. The source code repositories, issue tracker, and mailing lists are available on Google Code and discussions and decisions happen on the mailing lists (which is necessary due to the geographic distribution of the current developers).
Also a few of the initial committers have contributed to Apache projects. Vinayak Borkar is a committer on the Apache Helix and Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF and an IPMC member. Preston Carman and Steven Jacobs are committers on the Apache VXQuery project.
Relationships with Other Apache Products
Apache VXQuery is based on the Hyracks data-parallel runtime, which is also included in the AsterixDB code base.
AsterixDB is closely related to Apache Hadoop. Included in AsterixDB is support for accessing external data in HDFS (and Hive formats), and resource management and system administration features are in the process of being migrated to YARN.
AsterixDB's AQL query facilities offer comparable query power to Apache's Pig and Hive systems for big data analytics. AsterixDB differs in storing and indexing data and thus being able to quickly answer small and medium queries without large HDFS data scans - thereby targeting a different class of use cases.
AsterixDB's data storage and indexing facilities are similar to those of HBase, but AsterixDB differs in being a much more complete and queryable BDMS (not just a key-value style store).
AsterixDB's target use cases are not in-memory processing or iterative algorithm support, making AsterixDB complementary to the Apache Spark platform. (Spark interoperability is on our longer-term to-do wishlist.)
As mentioned before the current community is already organizationally and geographically distributed - and we would like to increase the heterogeneity.
Reliance on Salaried Developers
Of the initial committers only 3 are full-time UCI staff. The other committers are a mix of students, alumni who continue to contribute to the effort, and individuals working with permission part-time (or in spare time) on this project.
A Excessive Fascination with the Apache Brand
We believe in the processes, systems, and framework Apache has put in place. Apache is also known to foster a great community around their projects and provide exposure. While brand is important, our fascination with it is not excessive. We believe that the ASF is the right home for AsterixDB and that having AsterixDB inside of the ASF will lead to a better long-term outcome for the Big Data community.
Documentation and publications related to AsterixDB can be found at http://asterixdb.ics.uci.edu/.
Current source resides in Google code: https://code.google.com/p/asterixdb/ (query language and upper system layers) and https://code.google.com/p/hyracks/ (dataflow runtime system and storage management libraries).
AsterixDB depends on a number of Apache projects:
- ApacheDB JDO
- Jakarta ORO
and other open source projects (organized by license):
- Google Guava
- Google Guice
- Microsoft Azure SDK
- Datanucleus (JDO)
- CDDL 1.0
- Java Activation Framework
- Java Transactions
- Java Servlet API
- CDDL 1.1
- JAXB Reference Implementation
- JSON License
- EPL 1.0
- JDOM License
- Public Domain
As all dependencies are managed using Apache Maven, none of the external libraries need to be packaged in a source distribution.
Developer and user mailing lists
firstname.lastname@example.org (with moderated subscriptions)
A git repository
A JIRA issue tracker
The following is a list of the planned initial Apache committers (the active subset of the committers for the current repository at Google code).
Abdullah Alamoudi (email@example.com)
Cameron Samak (firstname.lastname@example.org)
Chen Li (email@example.com)
Ian Maxon (firstname.lastname@example.org)
Inci Cetindil (email@example.com)
Ildar Absalyamov (firstname.lastname@example.org)
Jianfeng Jia (email@example.com)
Keren Ouaknine (firstname.lastname@example.org)
Markus Dreseler (email@example.com)
Mike Carey (firstname.lastname@example.org)
Murtadha Hubail (email@example.com)
Pouria Pirzadeh (firstname.lastname@example.org)
Preston Carman (email@example.com)
Raman Grover (firstname.lastname@example.org)
Sattam Alsubaiee (email@example.com)
Steven Jacobs (firstname.lastname@example.org)
Taewoo Kim (email@example.com)
Till Westmann (firstname.lastname@example.org)
Vassilis Tsotras (email@example.com)
Vinayak Borkar (firstname.lastname@example.org)
Yingyi Bu (email@example.com)
Young-Seok Kim (firstname.lastname@example.org)
Zach Heilbron (email@example.com)
- Mike Carey
- Chen Li
- Ian Maxon
- Inci Cetindil
- Yingyi Bu
- Raman Grover
- Pouria Pirzadeh
- Young-Seok Kim
- Cameron Samak
- Taewoo Kim
- Jianfeng Jia
- Murtadha Hubail
- Markus Dreseler
- Ildar Absalyamov
- Preston Carman
- Steven Jacobs
- Vassilis Tsotras
- Keren Ouaknine
- Till Westmann
- Vinayak Borkar
- Zach Heilbron
KACST Saudi Arabia
- Sattam Alsubaiee
- Abdullah Alamoudi
Carey, Li, and Maxon are full-time UCI (UC Irvine) staff, Tsotras is full-time UCR (UC Riverside) staff, with the remaining UCI and UCR affiliates being students. The non-UC committers are a mix of alumni who continue to contribute to the effort and individuals working with permission part-time (or in spare time) on this project.
Chris Mattmann (NASA/JPL)
- Henry Saputra
- Jochen Wiedmann
- Ted Dunning
- Ate Douma
The Apache Incubator