Kerberos is an authentication system that allows DC/OS Data Science Engine to retrieve and write data securely to a Kerberos-enabled HDFS cluster. Long-running jobs will renew their delegation tokens (authentication credentials).

This guide assumes you have already set up a Kerberos-enabled HDFS cluster.

Configuring Kerberos with DC/OS Data Science Engine

DC/OS Data Science Engine and all Kerberos-enabled components need a valid krb5.conf configuration file. The krb5.conf file tells data-science-engine how to connect to your Kerberos key distribution center (KDC). You can specify properties for the krb5.conf file with the following options.

{
  "security": {
    "kerberos": {
      "enabled": true,
      "kdc": {
        "hostname": "<kdc_hostname>",
        "port": <kdc_port>
      },
      "primary": "<primary_for_principal>",
      "realm": "<kdc_realm>",
      "keytab_secret": "<path_to_keytab_secret>"
    }
  }
}

Make sure your keytab file is in the DC/OS secret store, under a path that is accessible by the data-science-engine service.

Example: Using HDFS with Spark in a Kerberized Environment

Here is an example notebook of Tensorflow on Spark using HDFS as a storage backend in a Kerberized environment.

First of all, you need to make sure that HDFS service is installed and DC/OS Data Science Engine is configured with its endpoint. To read more about configuring an HDFS integration of DC/OS Data Science Engine, see the Using HDFS with DC/OS Data Science Engine section.

Make sure HDFS Client service is installed and running with the “Kerberos enabled” option.

Run the following commands to set up a directory on HDFS with proper permissions:

# Suppose the HDFS Client version you are running is "2.6.0-cdh5.0.1", then command will be
dcos task exec hdfs-client /bin/bash -c '/hadoop-2.6.0-cdh5.9.1/bin/hdfs dfs -mkdir -p /data-science-engine'
# Suppose the name of the primary mentioned above is "jupyter"
dcos task exec hdfs-client /bin/bash -c '/hadoop-2.6.0-cdh5.9.1/bin/hdfs dfs -chown jupyter:jupyter /data-science-engine'
dcos task exec hdfs-client /bin/bash -c '/hadoop-2.6.0-cdh5.9.1/bin/hdfs dfs -chmod 700 /data-science-engine'

Launch Terminal from the Notebook UI.

Clone TensorFlow on Spark repository and download a sample dataset:

rm -rf TensorFlowOnSpark && git clone https://github.com/yahoo/TensorFlowOnSpark
rm -rf mnist && mkdir mnist
curl -fsSL -O https://infinity-artifacts.s3-us-west-2.amazonaws.com/jupyter/mnist.zip
unzip -d mnist/ mnist.zip

List files in the target HDFS directory and remove it if it is not empty.

hdfs dfs -ls -R /data-science-engine/mnist_kerberos && hdfs dfs -rm -R /data-science-engine/mnist_kerberos

Generate sample data and save to HDFS.

spark-submit \
  --verbose \
  $(pwd)/TensorFlowOnSpark/examples/mnist/mnist_data_setup.py \
  --output /data-science-engine/mnist_kerberos/csv \
  --format csv

hdfs dfs -ls -R /data-science-engine/mnist_kerberos

Train the model and checkpoint it to the target directory in HDFS.

You will need to specify two additional options to distribute the Kerberos ticket cache file to executors: --files <Kerberos ticket cache file> and --conf spark.executorEnv.KRB5CCNAME="/mnt/mesos/sandbox/krb5cc_99". The Kerberos ticket cache file will be used by executors for authentication with Kerberized HDFS:

spark-submit \
  --files /tmp/krb5cc_99 --conf spark.executorEnv.KRB5CCNAME="/mnt/mesos/sandbox/krb5cc_99" \
  --verbose \
  --py-files $(pwd)/TensorFlowOnSpark/examples/mnist/spark/mnist_dist.py \
  $(pwd)/TensorFlowOnSpark/examples/mnist/spark/mnist_spark.py \
  --cluster_size 4 \
  --images /data-science-engine/mnist_kerberos/csv/train/images \
  --labels /data-science-engine/mnist_kerberos/csv/train/labels \
  --format csv \
  --mode train \
  --model /data-science-engine/mnist_kerberos/mnist_csv_model

Verify that the model has been saved.

hdfs dfs -ls -R /data-science-engine/mnist_kerberos/mnist_csv_model

Kerberos

Using Kerberos with DC/OS Data Science Engine to retrieve and write data securely

Configuring Kerberos with DC/OS Data Science Engine

Example: Using HDFS with Spark in a Kerberized Environment