Technical Articles

Multi tenant Impala Cluster

05 Mar 2014

The goal is to create a multi tenant Impala cluster. In other words, we need a server that authorizes each SQL statement for each user. Probably, Cloudera Sentry can play this role and might replace this server.

The multi tenant server is not directly based on the Impala server (C++) but on Cloudera’s HiveServer2 (Java). Both Impala server and HiveServer2 implement the same TCLIService protocol. This protocol is used in Cloudera’s ODBC 2.5 driver. It was easier to modify the HiveServer2 code that the Impala code, therefore I took the HiveServer2 implementation as starting point.

The multi tenant server sits between the external ODBC clients and the actual Impala cluster. It inspects each SQL statement. The authorization itself is done by a remote server. If this server grants acces, the intermediate server will simply forward the request to the actual Impala cluster.

Infrastructure multi tenant Impala cluster

The multi tenant server is nothing more than a delegator: 1. Capture each SQL statement and verify with the authorization service. 2. If the access is authorized, forward the query to the Impala cluster, otherwise return an error.

Impala server verser HiveServer2

The Cloudera Impala server launches 2 external facing Thrift services

The Cloudera HiveServer2 launches one service

In order to create a multi tenant server, it is sufficient to implement the TCLIService. There options are:

Building on top of Cloudera’s HiveServer2 turned out to be fastest solution to get to a working prototype.

Delegation of the Thrift commands

The idea is to delegate all the Thrift request (defined in the ImpalaService.thrift), except for * OpenSession * CloseSession * ExecuteStatement

These 3 functions will be intercepted and a specific action will be done before or after the delegation.

Construction of the delegate

During initialization of the server, the delegate is constructed. The delegate will forward the requests to the actual Impala server.

  public synchronized void init(HiveConf hiveConf) {
    this.hiveConf = hiveConf;
    super.init(hiveConf);
    try {
        String host = HiveConf.getVar(hiveConf,IMPALA_SERVER_HOST);
        int port = HiveConf.getVar(hiveConf,IMPALA_SERVER_PORT)
        TSocket transport = new TSocket(host,port);
        transport.setTimeout(60000);
        TBinaryProtocol protocol = new TBinaryProtocol(transport);
        delegate = new TCLIService.Client(protocol);
        transport.open();
        LOG.info("ThriftCLIServiceDelegator connected to remove Impala daemon");
    }
    catch(Exception e) {
        LOG.error(e.toString());
    }
  }

Forward the requests to the actual Impala server

For most Thrift requests, the request is simply forwarded to the actual Impala server.

  public TGetTablesResp GetTables(TGetTablesReq req) throws TException {
      return delegate.GetTables(req);
  }
  @Override
  public TGetColumnsResp GetColumns(TGetColumnsReq req) throws TException {
      return delegate.GetColumns(req);
  }

Capture the new sessions

The OpenSession request is first forwarded to the actual Impala server. The reply contains a session identifier which is then linked to the session’s user information. A simple mapping from a session identifier to the user information is maintained.

  @Override
  public TOpenSessionResp OpenSession(TOpenSessionReq req) throws TException {
      TOpenSessionResp resp = delegate.OpenSession(req);
      // store the identifier of the Impala session
      String username = req.getUsername();
      String password = req.getPassword();
      ...
      sessions.addSession(username,password,ip,resp.getSessionHandle());
      return resp;

Capture the SQL statement

Each request that executes a SQL statement is intercepted. The mapping from session to user information tells which user executes the SQL statement. The remote authentication server informs us whether the user has the permission to execute the statement. If granted, the request is forwarded to the actual Impala server. Otherwise, a response with an appropriate message is constructed.

  @Override
  public TExecuteStatementResp ExecuteStatement(TExecuteStatementReq req) throws TException {
      LOG.info("Execute statement : " + req.getStatement() );
      TSessionHandle sessionHandle = req.getSessionHandle();
      LocalSessionManager.SessionInfo sessionInfo = sessions.getSession(sessionHandle);
      if(sessionInfo != null) {
          LOG.info("User " + sessionInfo.username + " at " + sessionInfo.ip);
      }
      if( validateUserStatement(sessionInfo.username,sessionInfo.ip,req.getStatement() == true) {
        return delegate.ExecuteStatement(req);
      }
      else {
      TStatus status = TStatus(TStatusCode.ERROR_STATUS);
      status.setErrorMessage("No permission to execute statement");
      ...
      TExecuteStatementResp response = new TExecuteStatementResp(status);
      response.setStatus(status);
      return response;
      }
  }

A small note on the secure communication with HiveServer2 server

As mentioned in the Thrift post, the server binds a TTransport and a TProcessor together. The type of transport (TServerSocker, TSSLTransport, …) is determined by the hive-site parameters file and is handled by the HiveAuthFactory.

SSL

SSL is handled by the Thrift transport. It does the initial handshake before accepting any communication. HiveServer2 supports the following options in hive-site.xml

HiveServer2 authentication

Impala Server authentication

As described in the documentation, Impala server supports the following authentication methods * No Authentication * User Name (it merely labels a session) * Kerberos

Two use cases

Client with authentication

If the client can authenticate, it is enough the deploy a single multi tenant server. The generic ODBC 2.5 driver supports

Client without authentication

In case the client cannot authenticate (for example Tableau Impala ODBC driver does not allow to authenicate over SSL), a multi tenant server is deployed for each client. Each client targets a dedicated port and a dedicated multi tenant server. Each multi tenant server can use the port (or an alias) to authorize the query.

Of course, it is possible to find intermediate solutions, such as Impala ODBC with user name authentication over a SSL tunnel.