Use, rather abuse, of PigContext

The current code for pig uses PigContext as a generic placeholder for anything and everything. To the extent that it even has rename and copy methods. It is very confusing and doesn't stick to its role as can be inferred from the name of the class. I would prefer the following role for PigContext and it is how I would like to see PigContext used.

Role of PigContext

From the name I infer that the class should be a context in which Pig executes. So this should ideally be just a set of properties.

Current roles of PigContext

  1. It has a set of properties
  2. Maintains handles to the DataStores both DFS and LFS

  3. It is tightly integrated with the jar registration and object instantiation
  4. Even has methods that rename & copy files

  5. Maintains a reference to current JobConf that is being exectued

  6. Maintains a reference to the execution engine

Changes I would like to see

It should basically just be a set of properties. Parts that can be easily separated out are the DataStores and the handle to the execution engine. Classes that want to create handles to DataStore should just use the PigContext instance for properties and create the DataStores internally instead of depending on PigContext to provide the handles. The execution engine is somethig that PigServer should maintain. A better thing to do would be to have a mapping between the execution engines and the PigContext that was used to invoke them. Also it is probably not a good idea to make PigContext Singleton or a class with static methods because if we move to a server model where we can have multiple backends to which pig has to talk, we need to create different instances of PigContext one per backend and use that to connect to the backend.

Object instantiation is tightly coupled with other parts. We should add another utility class that just does object instantiation by using the properties inside PigContext like the extra jars that were added during the execution of Pig. With this infact the PigContext class becomes redundant. This can be maintained as a variable of type properties in PigServer, as a mapping from the execution engine. The PigServer can have a variable currentExecEngine which points to an entry in the mapping. So all accesses to PigContext can then be replaced by a Properties object. However this would mean that PigServer will be the starting point for any operation. I guess that is the way it should be.

Comments

AlanGates 21 May 2008

I think it makes sense to associate data store handles, job configuration, properties, and jar registration with a given session. I like your idea of having a separate utility to handle object instantiation. I also agree that PigServer should be the central point for accessing this information. So maybe it makes sense to have something like this. There is a PigSessionContext class that holds all information relevant to a particular session. This would include jars that have been registered, properties specified for the job, connection to an execution engine and/or file system, and any information about jobs running on that execution engine. The set of all PigJobContext objects would be kept in PigServer, in a map keyed by a session id. This way we can get access to all relevant session information, internally or externally, and we need only pass around the session key. For now, there will always only be one session, but as we move to a client-server architecture in the future that can change.