

Using Kerberos with DC/OS Data Science Engine to retrieve and write data securely

Kerberos is an authentication system that allows DC/OS Data Science Engine to retrieve and write data securely to a Kerberos-enabled HDFS cluster. Long-running jobs will renew their delegation tokens (authentication credentials). This section assumes you have previously set up a Kerberos-enabled HDFS cluster.

Configuring Kerberos with DC/OS Data Science Engine

DC/OS Data Science Engine and all Kerberos-enabled components need a valid krb5.conf configuration file. The krb5.conf file tells data-science-engine how to connect to your Kerberos key distribution center (KDC). You can specify properties for the krb5.conf file with the following options.

  "security": {
    "kerberos": {
      "enabled": true,
      "kdc": {
        "hostname": "<kdc_hostname>",
        "port": <kdc_port>
      "primary": "<primary_for_principal>",
      "realm": "<kdc_realm>",
      "keytab_secret": "<path_to_keytab_secret>"

Make sure your keytab file is in the DC/OS secret store, under a path that is accessible by the data-science-engine service.

Example: Using HDFS with Spark in a Kerberized Environment

Here is the example notebook of Tensorflow on Spark using HDFS as a storage backend in Kerberized environment. First of all, you need to make sure that HDFS service is installed and DC/OS Data Science Engine is configured with its endpoint. To find more about configuring HDFS integration of DC/OS Data Science Engine follow Using HDFS with DC/OS Data Science Engine section.

  1. Make sure HDFS Client service is installed and running with the “Kerberos enabled” option.

  2. Run the following commands to set up a directory on HDFS with proper permissions:

    # Suppose the HDFS Client version you are running is "2.6.0-cdh5.0.1", then command will be
    dcos task exec hdfs-client /bin/bash -c '/hadoop-2.6.0-cdh5.9.1/bin/hdfs dfs -mkdir -p /data-science-engine'
    # Suppose the name of the primary mentioned above is "jupyter"
    dcos task exec hdfs-client /bin/bash -c '/hadoop-2.6.0-cdh5.9.1/bin/hdfs dfs -chown jupyter:jupyter /data-science-engine'
    dcos task exec hdfs-client /bin/bash -c '/hadoop-2.6.0-cdh5.9.1/bin/hdfs dfs -chmod 700 /data-science-engine'
  3. Launch Terminal from the Notebook UI.

  4. Clone TensorFlow on Spark repository and download a sample dataset:

    rm -rf TensorFlowOnSpark && git clone
    rm -rf mnist && mkdir mnist
    curl -fsSL -O
    unzip -d mnist/
  5. List files in the target HDFS directory and remove it if it is not empty.

    hdfs dfs -ls -R /data-science-engine/mnist_kerberos && hdfs dfs -rm -R /data-science-engine/mnist_kerberos
  6. Generate sample data and save to HDFS.

    spark-submit \
      --verbose \
      $(pwd)/TensorFlowOnSpark/examples/mnist/ \
      --output /data-science-engine/mnist_kerberos/csv \
      --format csv
    hdfs dfs -ls -R /data-science-engine/mnist_kerberos
  7. Train the model and checkpoint it to the target directory in HDFS. You will need to specify two additional options to distribute Kerberos ticket cache file to executors: --files <Kerberos ticket cache file> and --conf spark.executorEnv.KRB5CCNAME="/mnt/mesos/sandbox/krb5cc_99". The Kerberos ticket cache file will be used by executors for authentication with Kerberized HDFS:

    spark-submit \
      --files /tmp/krb5cc_99 --conf spark.executorEnv.KRB5CCNAME="/mnt/mesos/sandbox/krb5cc_99" \
      --verbose \
      --py-files $(pwd)/TensorFlowOnSpark/examples/mnist/spark/ \
      $(pwd)/TensorFlowOnSpark/examples/mnist/spark/ \
      --cluster_size 4 \
      --images /data-science-engine/mnist_kerberos/csv/train/images \
      --labels /data-science-engine/mnist_kerberos/csv/train/labels \
      --format csv \
      --mode train \
      --model /data-science-engine/mnist_kerberos/mnist_csv_model
  8. Verify that the model has been saved.

    hdfs dfs -ls -R /data-science-engine/mnist_kerberos/mnist_csv_model