Differences between revisions 16 and 17
Revision 16 as of 2017-08-02 16:38:12
Size: 14153
Comment:
Revision 17 as of 2017-08-02 16:48:45
Size: 7715
Comment:
Deletions are marked like this. Additions are marked like this.
Line 21: Line 21:
All XDATA software is open source and is ingested into DARPA’s Open Catalog [6] that points to outputs of the program including its source code and metrics on the repository. Because of this, one of core products of XDATA is the internal Git repository. Since XDATA brought together open source software across multiple performers, having an understanding of the licenses that the source codes used, and their compatibilities and differences was extremely important and since there repository was so large, our strategy was to develop an automated process using Apache RAT. All XDATA software is open source and is ingested into [[https://opencatalog.darpa.mil/|DARPA’s Open Catalog]] that points to outputs of the program including its source code and metrics on the repository. Because of this, one of core products of XDATA is the internal Git repository. Since XDATA brought together open source software across multiple performers, having an understanding of the licenses that the source codes used, and their compatibilities and differences was extremely important and since there repository was so large, our strategy was to develop an automated process using Apache RAT.
Line 23: Line 23:
The lessons learned navigating these issues have motivated to create “DRAT”, which stands for "Distributed Release Audit Tool". DRAT directly overcomes RAT's limitations and brings code auditing and open source license analysis into the realm of Big Data using scalable open source Apache technologies. DRAT is already being applied and transitioned into the government agencies. DRAT currently exists at Github under the ALv2 The lessons learned navigating these issues have motivated to create “DRAT”, which stands for "Distributed Release Audit Tool". DRAT directly overcomes RAT's limitations and brings code auditing and open source license analysis into the realm of Big Data using scalable open source Apache technologies. DRAT is already being applied and transitioned into the government agencies. DRAT currently exists at Github under the ALv2 under Chris Mattmann's GitHub account. Chris Mattmann was the PI of DARPA XDATA at JPL.
Line 44: Line 44:
former USC students
Line 48: Line 49:
Apache is, by far, the most natural home for taking the AsterixDB
project forward. A large fraction of today's top Big Data
technologies have their homes in Apache, including Hadoop, YARN, Pig,
Hive, Spark, Flink, HBase, Cassandra and others. AsterixDB fills a
significant gap -- the parallel data management system gap -- that
exists in the Big Data open source world. It is well-aligned with a
number of the Apache projects, e.g., it has strong support for
accessing and indexing external data in HDFS, and it uses YARN as an
answer to basic cluster resource management. AsterixDB also seeks to
achieve an Apache-style development model; it is seeking a broader
community of contributors and users in order to achieve its full
potential and value to the Big Data community.

There are also a number of related Apache projects and dependencies
that will be mentioned below in the Relationships with Other Apache
products section.
TBD
Line 70: Line 56:
Given the current level of intellectual investment in AsterixDB, the
risk of the project being abandoned is very small. The UCI/UCR
faculty team leads are highly incentivized to continue development
since the database groups at UC Irvine and UC Riverside are both
reliant on AsterixDB as a platform for long-term graduate research
projects. UC San Diego is also beginning to contribute to the code
base, and a collaboration involving public health applications is
forming with UCLA. The work on AsterixDB is managed via a mix of
mailing list discussions supplemented by weekly project status
meetings which are summarized on the mailing list. Typical (local
plus Skype-in) attendance to the weekly status meetings runs at about
20 active contributors.
JPL making a commitment to run DRAT on our internal code repos
TBD
Line 85: Line 61:
AsterixDB and Hyracks were completely developed in Open Source under
the ALv2. The source code repositories, issue tracker, and mailing
lists are available on Google Code and discussions and decisions
happen on the mailing lists (which is necessary due to the geographic
distribution of the current developers).

Also a few of the initial committers have contributed to Apache
projects. Vinayak Borkar is a committer on the Apache Helix and
Apache VXQuery projects. Till Westmann is the VP VXQuery at the ASF
and an IPMC member. Preston Carman and Steven Jacobs are committers
on the Apache VXQuery project.
TBD
Line 100: Line 66:
Apache VXQuery is based on the Hyracks data-parallel runtime, which
is also included in the AsterixDB code base.

AsterixDB is closely related to Apache Hadoop. Included in AsterixDB
is support for accessing external data in HDFS (and Hive formats),
and resource management and system administration features are in the
process of being migrated to YARN.

AsterixDB's AQL query facilities offer comparable query power to
Apache's Pig and Hive systems for big data analytics. AsterixDB
differs in storing and indexing data and thus being able to quickly
answer small and medium queries without large HDFS data scans -
thereby targeting a different class of use cases.

AsterixDB's data storage and indexing facilities are similar to those
of HBase, but AsterixDB differs in being a much more complete and
queryable BDMS (not just a key-value style store).

AsterixDB's target use cases are not in-memory processing or
iterative algorithm support, making AsterixDB complementary to the
Apache Spark platform. (Spark interoperability is on our longer-term
to-do wishlist.)
RAT, OODT, Tika, Lucene, Solr, Wicket
Line 126: Line 70:
As mentioned before the current community is already organizationally
and geographically distributed - and we would like to increase the
heterogeneity.
TBD
Line 133: Line 75:
Of the initial committers only 3 are full-time UCI staff. The other
committers are a mix of students, alumni who continue to contribute
to the effort, and individuals working with permission part-time (or
in spare time) on this project.
TBD
Line 141: Line 80:
We believe in the processes, systems, and framework Apache has put in
place. Apache is also known to foster a great community around their
projects and provide exposure. While brand is important, our
fascination with it is not excessive. We believe that the ASF is the
right home for AsterixDB and that having AsterixDB inside of the ASF
will lead to a better long-term outcome for the Big Data community.
TBD
Line 151: Line 85:
Documentation and publications related to AsterixDB can be found at
http://asterixdb.ics.uci.edu/.
Documentation including code, a wiki, and publications surrounding DRAT can be found at http://github.com/chrismattmann/drat/.
Line 157: Line 90:
Current source resides in Google code:
https://code.google.com/p/asterixdb/ (query language and upper system
layers) and https://code.google.com/p/hyracks/ (dataflow runtime
system and storage management libraries).
Documentation including code, a wiki, and publications surrounding DRAT can be found at http://github.com/chrismattmann/drat/
Line 167: Line 97:
 * Ant
 * Avro
 * ApacheDB JDO
 * Commons
 * Derby
 * Hadoop
 * Hive
 * HTTPComponents
 * Jakarta ORO
 * Maven
 * Tomcat
 * Thrift
 * Velocity
 * OODT
 * Lucene
 * RAT
 * Solr
 * Tika
Line 181: Line 103:
 * Xerces
Line 185: Line 106:
 * ALv2:
  * Jackson
  * Google Guava
  * Google Guice
  * JSON-simple
  * BoneCP
  * Microsoft Azure SDK
  * Netty
  * Rome
  * !JetS3t
  * Groovy
  * Jettison
  * Plexus
  * Datanucleus (JDO)
  * Jetty
  * Twitter4J
  * Snappy-java

 * BSD:
  * Antlr
  * !ObjectWeb ASM
  * Protobuf
  * JSCH
  * JavaCC
  * Paranamer
  * JLine
  * Stax
  * !StringTemplate
  * xmlEnc

 * MIT
  * !AppAssembler
  * SimpleLog4J

 * CDDL 1.0
  * Java Activation Framework
  * Java Transactions
  * Java Servlet API
  * Grizzly
  * gmbal
  * Glassfish

 * CDDL 1.1
  * Jersey
  * JAXB Reference Implementation

 * JSON License
  * JSON

 * EPL 1.0
  * JUnit

 * JDOM License
  * JDOM

 * Public Domain
  * xz
  * AOPAlliance
Line 252: Line 115:
 * private@asterixdb.incubator.apache.org (with moderated subscriptions)
 * commits@asterixdb.incubator.apache.org
 * dev@asterixdb.incubator.apache.org
 * users@asterixdb.incubator.apache.org
 * private@drat.incubator.apache.org (with moderated subscriptions)
 * commits@drat.incubator.apache.org
 * dev@drat.incubator.apache.org
Line 258: Line 120:
A git repository A gitbox repository at:
Line 260: Line 122:
https://git-wip-us.apache.org/repos/asf/incubator-asterixdb.git https://github.com/apache/drat.git
Line 262: Line 124:
Issue tracking
Line 263: Line 126:
A JIRA issue tracker

https://issues.apache.org/jira/browse/ASTERIXDB
We will use the GitHub issue tracker.
Line 274: Line 135:
 * Abdullah Alamoudi (bamousaa@gmail.com)
 * Cameron Samak (eufery@gmail.com)
 * Chen Li (chenli@gmail.com)
 * Ian Maxon (imaxon@uci.edu)
 * Inci Cetindil (icetindil@gmail.com)
 * Ildar Absalyamov (ildar.absalyamov@gmail.com)
 * Jianfeng Jia (jianfeng.jia@gmail.com)
 * Keren Ouaknine (kereno@gmail.com)
 * Markus Dreseler (apache@dreseler.de)
 * Mike Carey (dtabass@apache.org)
 * Murtadha Hubail (hubailmor@gmail.com)
 * Pouria Pirzadeh (pouria.pirzadeh@gmail.com)
 * Preston Carman (prestonc@apache.org)
 * Raman Grover (ramangrover29@gmail.com)
 * Sattam Alsubaiee (salsubaiee@gmail.com)
 * Steven Jacobs (sjaco002@apache.org)
 * Taewoo Kim (wangsaeu@gmail.com)
 * Till Westmann (tillw@apache.org)
 * Vassilis Tsotras (tsotras@cs.ucr.edu)
 * Vinayak Borkar (vinayakb@apache.org)
 * Yingyi Bu (buyingyi@gmail.com)
 * Young-Seok Kim (kisskys@gmail.com)
 * Zach Heilbron (zheilbron@gmail.com)
 * Chris Mattman
 * Tyler Palsulich
 * Paul Ramirez
 * Lewis John McGibbney
 * Karanjeet Singh
 * Steven Francus
 * Michael Joyce
Line 301: Line 146:
UC Irvine
 * Mike Carey
 * Chen Li
 * Ian Maxon
 * Inci Cetindil
 * Yingyi Bu
 * Raman Grover
 * Pouria Pirzadeh
 * Young-Seok Kim
 * Cameron Samak
 * Taewoo Kim
 * Jianfeng Jia
 * Murtadha Hubail
 * Markus Dreseler
NASA JPL
 * Chris Mattmann
 * Paul Ramirez
 * Lewis John McGibbney
 * Michael Joyce
Line 316: Line 152:
UC Riverside
 * Ildar Absalyamov
 * Preston Carman
 * Steven Jacobs
 * Vassilis Tsotras
Apple
 * Karanjeet Singh
Line 322: Line 155:
Hebrew University
 * Keren Ouaknine
Google
 * Tyler Palsulich
Line 325: Line 158:
Oracle
 * Till Westmann
Chronaly
 * Steven Francus
Line 328: Line 161:
X15 Software
 * Vinayak Borkar
 * Zach Heilbron

KACST Saudi Arabia
 * Sattam Alsubaiee

Saudi Aramco
 * Abdullah Alamoudi

Carey, Li, and Maxon are full-time UCI (UC Irvine) staff, Tsotras is
full-time UCR (UC Riverside) staff, with the remaining UCI and UCR
affiliates being students. The non-UC committers are a mix of alumni
who continue to contribute to the effort and individuals working
with permission part-time (or in spare time) on this project.
Line 353: Line 171:
 * Henry Saputra
 * Jochen Wiedmann
 * Ted Dunning
 * Ate Douma
 * Chris Mattmann
 * Paul Ramirez
 * Lewis John McGibbney
[others]
Line 360: Line 178:
The Apache Incubator The Apache Board (pTLP) or...the Apache Incubator.

Apache DRAT Proposal

Abstract

Apache Distributed Release Audit Tool (DRAT) is a distributed, parallelized (Map Reduce) wrapper around Apache™ RAT to allow it to complete on large code repositories of multiple file types where Apache™ RAT hangs forever.

Proposal

Apache DRAT is a distributed, parallelized (Map Reduce) wrapper around Apache™ RAT (Release Audit Tool). RAT is used to check for proper licensing in software projects. However, RAT takes a prohibitively long time to analyze large repositories of code, since it can only run on one JVM. Furthermore, RAT isn't customizable by file type or file size and provides no incremental output. This wrapper dramatically speeds up the process by leveraging Apache™ OODT to parallelize and workflow the following components:

  • Apache™ Solr based exploration of a CM repository (e.g., Git, SVN, etc.) and classification of that repository based on MIME type using Apache™ Tika.
  • A MIME partitioner that uses Apache™ Tika to automatically deduce and classify by file type and then partition Apache™ RAT jobs based on sets of 100 files per type (configurable) -- the M/R "partitioner"
  • A throttle wrapper for RAT to MIME targeted Apache™ RAT. -- the M/R "mapper"
  • A reducer to "combine" the produced RAT logs together into a global RAT report that can be used for stats generation. -- the M/R "reducer"

Background and Rationale

As a part of the Apache Software Foundation (ASF) project, Apache Creadur, a Release Audit Tool (RAT) was developed especially in response to demand from the Apache Software Foundation and its hundreds of projects to provide a capability for release auditing that could be integrated into projects. The primary function of the RAT is automated code auditing and open-source license analysis focusing on headers. RAT is a natural language processing tool written in Java to easily run on any platform and to audit code from many source languages (e.g., C, C++, Java, Python, etc.). RAT can also be used to add license headers to codes that are not licensed.

In the summer of 2013, our team ran Apache RAT on source code produced from the Defense Advanced Research Projects Agency (DARPA) XDATA national initiative whose inception coincided with the 2012 U.S. Presidential Initiative in Big Data. XDATA brought together 24 performers across academia, private industry and the government to construct analytics, visualizations, and open source software mash-ups that were transitioned into government projects and to the defense sector. XDATA produced a large Git repository consisting of ~50,000 files and 10s of millions of lines of code. DARPA XDATA was launched to build a useful infrastructure for many government agencies and ultimately is an effort to avoid the traditional government-contractor software pipeline in which additional contracts are required to reuse and to unlock software previously funded by the government in other programs. All XDATA software is open source and is ingested into DARPA’s Open Catalog that points to outputs of the program including its source code and metrics on the repository. Because of this, one of core products of XDATA is the internal Git repository. Since XDATA brought together open source software across multiple performers, having an understanding of the licenses that the source codes used, and their compatibilities and differences was extremely important and since there repository was so large, our strategy was to develop an automated process using Apache RAT. We ran RAT on 24-core, 48 GB RAM Linux machine at the National Aeronautics and Space Administration (NASA)’s Jet Propulsion Laboratory (JPL) to produce a license evaluation of the XDATA Git repository and to provide recommendations on how the open source software products can be combined to adhere to the XDATA open source policy encouraging permissive licenses. Against our expectations, however, RAT failed to successfully and quickly audit XDATA’s large Git repository. Moreover, RAT provided no incremental output, resulting in solely a final report when a task was completed. RAT’s crawler did not automatically discern between binary file types and another file types. It seemed that RAT performed better by collecting similar sets of files together (e.g., all Javascript, all C++, all Java) and then running RAT jobs individually based on file types on smaller increments of files (e.g., 100 Java files at a time, etc). The lessons learned navigating these issues have motivated to create “DRAT”, which stands for "Distributed Release Audit Tool". DRAT directly overcomes RAT's limitations and brings code auditing and open source license analysis into the realm of Big Data using scalable open source Apache technologies. DRAT is already being applied and transitioned into the government agencies. DRAT currently exists at Github under the ALv2 under Chris Mattmann's GitHub account. Chris Mattmann was the PI of DARPA XDATA at JPL.

Current Status

TBD

Meritocracy

Current developers for the project are all ASF members, experienced with ASF processes and procedures. We know how to grow an Apache community and to develop a meritocratic free and open source project.

Community

TBD

Core Developers

Mention JPL folks Tyler is at Google Karanjeet formerly of USC + JPL and now Apple former USC students

Alignment

TBD

Known Risks

Orphaned products

JPL making a commitment to run DRAT on our internal code repos TBD

Inexperience with Open Source

TBD

Relationships with Other Apache Products

RAT, OODT, Tika, Lucene, Solr, Wicket

Homogeneous Developers

TBD

Reliance on Salaried Developers

TBD

A Excessive Fascination with the Apache Brand

TBD

Documentation

Documentation including code, a wiki, and publications surrounding DRAT can be found at http://github.com/chrismattmann/drat/.

Initial Source

Documentation including code, a wiki, and publications surrounding DRAT can be found at http://github.com/chrismattmann/drat/

External Dependencies

AsterixDB depends on a number of Apache projects:

  • OODT
  • Lucene
  • RAT
  • Solr
  • Tika
  • Wicket

and other open source projects (organized by license):

As all dependencies are managed using Apache Maven, none of the external libraries need to be packaged in a source distribution.

Required Resources

Developer and user mailing lists

A gitbox repository at:

https://github.com/apache/drat.git

Issue tracking

We will use the GitHub issue tracker.

Initial Committers

The following is a list of the planned initial Apache committers (the active subset of the committers for the current repository at Google code).

  • Chris Mattman
  • Tyler Palsulich
  • Paul Ramirez
  • Lewis John McGibbney

  • Karanjeet Singh
  • Steven Francus
  • Michael Joyce

Affiliations

NASA JPL

  • Chris Mattmann
  • Paul Ramirez
  • Lewis John McGibbney

  • Michael Joyce

Apple

  • Karanjeet Singh

Google

  • Tyler Palsulich

Chronaly

  • Steven Francus

Sponsors

Champion

Chris Mattmann (NASA/JPL)

Nominated Mentors

  • Chris Mattmann
  • Paul Ramirez
  • Lewis John McGibbney

[others]

Sponsoring Entity

The Apache Board (pTLP) or...the Apache Incubator.

DRATProposal (last edited 2017-08-30 16:54:36 by ChrisMattmann)