Spark 2.11.0-2.4.6

Documentation for DC/OS Apache Spark 2.11.0-2.4.6

Welcome to the documentation for the DC/OS Apache Spark. For more information about new and changed features, see the release notes.

Apache Spark is a fast and general-purpose cluster computing system for big data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including the following:

  • Spark SQL for SQL and DataFrames
  • MLlib for machine learning
  • GraphX for graph processing
  • Spark Streaming for stream processing.

For more information, see the Apache Spark documentation.

DC/OS Apache Spark consists of Apache Spark with a few custom commits along with DC/OS-specific packaging.

DC/OS Apache Spark includes:

Benefits

  • Utilization: DC/OS Apache Spark leverages Mesos to run Spark on the same cluster as other DC/OS services
  • Improved efficiency
  • Simple management
  • Multi-team support
  • Interactive analytics through notebooks
  • UI integration
  • Security, including file-based and environment-based secrets

Features

  • Multiversion support
  • Run multiple Spark dispatchers
  • Run against multiple HDFS clusters
  • Backports of scheduling improvements
  • Simple installation of all Spark components, including the dispatcher and the history server
  • Integration of the dispatcher and history server
  • Zeppelin integration
  • Kerberos and SSL support

Related services