HBase Token Authentication

While HBase security now supports Kerberos authentication for client RPC connections, this is only part of the puzzle for integration with secure Hadoop. Kerberos authentication is only used for direct client access to HDFS. The Hadoop MapReduce framework instead uses a DIGEST-MD5 authentication scheme, where the client is granted a signed "delegation token" and secret "token authenticator" (the SHA1 hash of the delegation token and a NN secret key) when a MapReduce job is submitted. The token and authenticator are serialized into a secure location in HDFS, so that the spawned Child processes can de-serialize the credentials and use them to re-authenticate to the NN as the submitting user.

Since Kerberos credentials are not used in the MapReduce task execution context, any client attempts to authenticate to HBase will fail. As a result, HBase connections will need to support an alternate authentication scheme, similarly to the Hadoop MapReduce framework.

Goals

The main considerations for supporting map reduce authentication are:

  1. The implementation should avoid any changes to core Hadoop code. Any changes in Hadoop will require a great deal more review and discussion to potentially be accepted, and would necessitate running a forked version of Hadoop for some time.
  2. Any changes should be transparent to existing map-reduce user code. We shouldn't require any new APIs to be used for authentication, for example.
  3. Changes to the job submission process, such as using a wrapper or utility to submit map-reduce jobs, are preferable to any changes requiring code modifications

HBase Authentication Tokens

While Hadoop user delegation tokens provide an existing means of MapReduce task authentication, their reliance on an secret key stored in memory on the NameNode makes them inaccessible for authentication in HBase. Fortunately, the Hadoop security implementation and MapReduce job submission and execution code provides a generalized framework for token handling. Building on top of this, we can provide token based authentication from MR tasks to HBase without any core Hadoop or MapReduce changes.

Proposal: Adding an HBase user token

  1. extend org.apache.hadoop.security.token.TokenIdentifier with our own token implementation

  2. implement org.apache.hadoop.security.token.SecretManager

  3. master will generate a secret key for signing and authenticating tokens
    1. will need to persist somewhere (zookeeper?) to allow for master restarts and failover
    2. the generated secret key will be distributed across all cluster nodes via ZooKeeper

      1. ZooKeeper access to keys will be secure by Kerberos authentication (ZOOKEEPER-938) and use of ACLs limiting access to HBase principals

  4. add a helper like TableMapReduceUtil.initJob() to use when submitting a new job

    1. will obtain a new token from master
    2. add token to Credentials instance
    3. normal JobClient code will serialize Credentials for MR job

  5. when running MR job, Credentials will be deserialized from secure location
    1. HBaseClient will look in credentials for any relevant tokens

Limitations

  1. Doesn't appear we'll be able to use the existing delegation token renew mechanism (but do we really need to do token renewal?)

Token

The HBase authentication token is modeled directly after the Hadoop user delegation token. We have dropped support for a designated renewer, however, as we will not be able to support HBase token renewal without modification to core map reduce code. The token will consist of:

Authentication

HBase token authentication builds on top of DIGEST-MD5 authentication support provided by Hadoop RPC. HBase token authentication follows the same process as Hadoop user delegation token authentication by the NameNode:

  1. Client sends TokenID to server

  2. Server uses TokenID and the in-memory master secret key to regenerate TokenAuthenticator

  3. Server validates TokenID, checks for expiration

  4. Server and client then use TokenAuthenticator as the shared secret to negotiate DIGEST-MD5 authentication

Master Secret Key

Authentication relies on a secret key generated and held in memory on the RPC server and used to generate Authentication Tokens for clients. The secret key used to generate tokens will be periodically rolled in order to limit exposure to brute-force reversing of the secret key from token signatures. However, to allow previously issued token to continue to work, N previously generated keys will be retained, with the oldest removed on rolling when the limit is reached. In order to simplify coordination of the key rolling process, all RPC servers in a cluster will select a "leader" which will perform the secret key rolling and expiration.

Authentication Tokens could be generated on RPC server holding the current secret key for Kerberos authenticated clients, but authentication using this token will need to succeed on all servers in a cluster. So the leader node will need a means to distribute the secret key changes to other cluster nodes.

The secret keys currently in use will also need to be available via semi-persistent storage in order for validation of previously issued authentication tokens to survive a cluster restart. The keys themselves are by nature transient, due to the expiration policy. But this will allow any new RPC server to read in the currently in use keys on startup, and resume with token generation and authentication.

ZooKeeper will be used to coordinate all three of these needs.

To coordinate leader node selection, RPC servers will race to obtain an ephemeral znode on startup, with the successful node starting the key rolling and expiration thread:

  <HBASE_ROOT>/
      tokenauth/
          znode('keymaster', serverName)

The process used is very similar to handling of the active HMaster. If the server holding the "keymaster" node dies or loses it's ZooKeeper session, the remaining nodes will again race to claim the znode first.

To broadcast master key changes throughout the cluster and to provide key persistence between server restarts or failover, the leader will maintain a znode per secret key, with all other nodes watching on changes to children of a known parent:

  <HBASE_ROOT>/
      tokenauth/
          keys/
              znode(keyID1, serialized DelegationKey1)
              znode(keyID2, serialized DelegationKey2)
              ...

Note that this depends on securing access to the key znodes via Kerberos authentication of ZooKeeper clients and setting of ZooKeeper ACLs.

Implementation

  1. Extend org.apache.hadoop.security.token.TokenIdentifier with new HBase type

  2. Implement org.apache.hadoop.security.token.TokenSelector to pull out HBase type tokens

  3. Extend org.apache.hadoop.security.token.SecretManager with implementation to generate HBase tokens. This will be used on HMaster to generate HBase tokens, and on HRegionServer to validate tokens for authentication.

Map Reduce Flow

For all of this to work without changes to Hadoop and MapReduce code, we have two key requirements:

  1. We must be able to add our own tokens to the MR job Credentials instance at job submission time (and the job must be able to serialize our token correctly with the rest of the job info)
  2. The Child task executing on each node must deserialize our token and add it to the UserGroupInformation instance so it can later be picked up by the HBase client for authentication

Job Submission

  1. Add a new utility class SecureMapReduceUtil with a static helper method, something like void initAuthentication(Job job)

    1. Call Master to obtain a new authentication token for the logged in user
      • Token will only be returned if user is authenticated via Kerberos, same as HDFS
    2. Add HBase token to job credentials -- job.getCredentials().addToken(Text alias, Token)

      • FileSystem.getCanonicalServiceName() is used as the alias for HDFS delegation tokens, what should we use?

  2. Job.submit() is later called normally, which should serialize token with the rest of the job credentials

    1. JobTracker.submitJob() receives the credentials via RPC and adds them to a JobInProgress instance added to the job queue

    2. Scheduler will write out the tokens when the job is run. JobInProgress.initTasks() -> generateAndStoreTokens() -> Credentials.writeTokenStorageFile()

    3. The serialized tokens will be written to <jobdir>/jobToken

Job Execution on Task Nodes

  1. On task start, Child.main() will read in a copy of the tokens from the local filesystem, local path passed as an env variable, read in using TokenCache.loadTokens()

  2. Each token is added to the child task UserGroupInformation instance used to run the local task

  3. Any HBase connections opened by the task will inherit the same UGI
  4. A TokenInfo annotation on the HRegionInterface and HMasterInterface protocol interfaces identifies the HBase TokenSelector implementation, which is then used to extract the relevant authentication token from the UGI's credentials

  5. Using the HBase authentication token, the authentication process proceeds as above