This wiki will serve as resource to document some of the reasoning behind the design decisions for sqoop2 

What is Sqoop2 in a nutshell ?

Apache Sqoop is a tool designed for efficiently transferring data between structured, semi-structured and unstructured data sources. Relational databases are examples of structured data sources with well defined schema for the data they store. Cassandra, Hbase are examples of semi-structured data sources and HDFS is an example of unstructured data source that Sqoop can support. Sqoop2 achieves this via a job encapsulating the extract/load phases of the data transfer. Sqoop2 job provides the FROM and TO abstraction for the extract and load phases.


Why Tomcat for the Sqoop2 Server?

Tomcat provides the basic web-server container to host sqoop as a service. One of the design goals of Sqoop2 was to provide rest-apis for creating sqoop jobs. It has its quirks and in 2014 there are better alternatives we can use for a JVM based web-server. BTW, we welcome patches to support jetty or netty for the sqoop server. 

 

What is the Sqoop2 Repository and why do we use Derby? Can we document-store to save the Sqoop entities?

Sqoop2 job information is persisted in the repository. We chose Derby as an initial implementation probably for its simplicity. Since we have a well-defined Repository API, it is possible to add support for additional DB implementation to store the Sqoop2 job and its associated information. The Sqoop2 entities such as the Connector Configurables, Links, Jobs, LinkConfigs and JobConfigs are currently modeled in a way that are best represented in a relational database, but it should be possible to store them in a document-store such as mongoDB and the constraints such as unique names across connectors might have to be modeled in code unlike in RDMS.

What are the main design goals of Sqoop2?

The overarching goals are documented here. But there are more subtle ones will be added here.

  • Allow development of data connectors against a stable API, independent on Sqoop2 implementation internals (such as choice of execution engine, dependency on Hadoop components, etc). For example:  Oracle connector can't assume a tnsnames.ora exists in the environment,  Kite connector can't assume that hive-site.xml will exist. The connector can still ask for a location of hive-site.xml or tnsnames.ora as an input when creating a link though.
  • Connectors focus on how to get data in and out of data systems. The framework include execution life-cycle - kicking off tasks / workers and such. We never rely on the framework to handle data  reads and writes (even though most frameworks have IO capability) - this is the responsibility of the connectors.
  • End-user actions should be exposed through a Java Client API, a REST API and a command line utility. All three are mandatory for new features.

Does Sqoop2 use MR for data transfer?

Yes, it is one of the currently ( as of this writing) supported execution engines. Sqoop job lifecyle is defined by the sqoop connector API and the Extract/Load phases are mapped to Map/Reduce phase of the MR engine. Most often the sqoop job may be map-only if not explicitly stated to use the reduce step for loading. i.e both Extract/Load happens in the map part of the MR pipeline. 

Does Sqoop2 ensure any particular ordering of the data transferred?

No, the ordering of the final output is not ensured. It can depend on the IntermediateDataFormat implementation used by the sqoop job


 

 

 

 

 

Adding some fun facts about the design are encouraged!

  • No labels