Release Notes

ENTERPRISE

Discover the new features, updates, and known limitations in this release of DC/OS Storage Service

Release notes for DC/OS Storage Service version 1.0.0

  • This version of the DC/OS Storage Service is considered Generally Available and all users are strongly encouraged to upgrade from previous versions.
  • This version of the DC/OS Storage Service requires DC/OS Enterprise version 1.13.2 or later.

New Features

  • The devices volume provider now blacklists descendants of blacklisted devices by default. To override the default behavior, you can also explicitly blacklist a device using the blacklist-exactly configuration option.
  • The devices and lvm volume providers now emit metrics. For more information see New Metrics.
  • DSS can manage storage providers and volumes on agents that also advertise GPU resources.
  • Operator can scrub volume removal operations that will never complete due to interrupted DESTROY_DISK operations.
  • Operator can scrub local volumes and local volume providers that DSS reports as MISSING.
  • Volume remove operations can be canceled. If no Mesos operations been issued to remove the volume, you can cancel the removal request.
  • Operator can more easily remove failing providers from a node.
  • The dcos storage volume create accepts create parameters via JSON file or stdin.
  • The dcos storage ... commands accept a -v flag to toggle verbose logging.

Updates

  • Additional logging of API requests and responses.
  • Enforce uniqueness of device provider names.
  • More robust enforcement of non-overlapping devices among multiple lvm volume providers.
  • Device provider creation validates that the target node is known to DSS.
  • Prevent volume lifecycle operations when the parent provider is being modified, or is otherwise not ready.
  • Prevent provider modifications when that provider has an in-progress volume operation.
  • Removed permissions that are no longer needed by storage principal (related to marathon, package, storage service).
  • DSS running on permissive mode clusters requires storage principal configuration.
  • DSS running on strict mode clusters requires enforce-authorization to be enabled.
  • dcos storage ... list commands display results in sorted order.
  • dcos storage provider list table header STATE is now called STATUS (for consistency).
  • Removed the --all flag from dcos storage provider list
  • The --timeout flag sets a timeout after which the CLI will abort its operation instead of relying on the server to time out the operation. The CLI will keep retrying internally until the timeout is hit or a non-timeout error or success is achieved.
  • Removed the previously deprecated “Artifacts Container” installation method.
  • Secondary DSS instances will refuse to start if a primary instance is already running.
  • Actively monitor Mesos heartbeats to DSS and trigger re-connection as needed.
  • The DSS package includes a LICENSES file that contains copies of all OSS licenses.
  • Service bug fixes, performance fixes, security fixes, as well as other doc fixes and improvements.

Limitations

  • Only local volume storage is currently supported.
  • Only manual upgrades of a running DC/OS Storage Service on an existing cluster are supported at this time.
  • Volume size must be a multiple of 4MiB, which is the default size of an LVM extent. Otherwise, DSS will reply with an error when attempting to create the volume.
  • When planning to manually remove a logical volume via lvremove, the operator is responsible for zeroing the volume prior to removal.

Known Issues

  • In the event of an unexpected device and/or volume change on an agent, you must restart the agent for the devices and lvm providers to reconcile the condition. For example, if you add or remove devices, restart the agent to update the devices volume provider with the changes.
  • dcos storage CLI subcommands may fail with a gateway timeout error, but still complete successfully in the background.
  • The Mesos SLRP implementation is not yet compatible with multiple profiles that consume capacity from the same provider in different ratios (for example, RAID1 and linear). To work around this, create multiple providers, each of which is wholly dedicated to linear or RAID1.
  • The storage service should only list providers that it currently manages; incompletely removed providers may be incorrectly listed in some cases.
  • Deleting a volume may fail with “Cannot allocate memory” on some versions of CoreOS. To avoid this issue, ensure you are using a supported version of CoreOS.
  • Kernels from (including) 3.10.0-862.6.3.el7 up to (including) 3.10.0-862.11.6.el7 may panic as a result of LVM operations (https://access.redhat.com/solutions/3520511).
  • The DC/OS installer may issue one or more WARNING messages regarding missing kernel modules:
    Checking if kernel module raid1 is loaded: WARNING Kernel module raid1 is not loaded. DC/OS Storage Service (DSS) depends on it.
    Checking if kernel module dm_raid is loaded: WARNING Kernel module dm_raid is not loaded. DC/OS Storage Service (DSS) depends on it.
    
    To resolve the issue, configure the raid1 and dm_raid kernel modules to load at OS boot time.
  • Using NVMe storage with DSS may require additional modifications to the underlying OS. For more information see these suggested commands and helper scripts.
  • The device names (e.g. sda) used to create volume providers can be unstable over time thus precautions should be taken to avoid this condition.
  • The DC/OS UI shows an incorrect unit for DC/OS Storage volume size in the service create modal – the value will be treated as MiB instead of GiB as stated in the UI.
  • The DC/OS cluster’s reported total disk resources is inflated due to double-counting of DSS devices.

New Metrics

All metrics related to the DC/OS Storage Service have a prefix of csidevices_, csilvm_, or dss_.

New devices provider metrics

  • csidevices_uptime: the uptime (in seconds) of the process
  • csidevices_requests: number of requests served, tagged by:
    • result_type: one of success, error
    • method: the RPC name, e.g., /csi.v0.Controller/ListVolumes
  • csidevices_requests_latency_(stddev,mean,lower,count,sum,upper): the request duration (in milliseconds), tagged by:
    • method: the RPC name, e.g., /csi.v0.Controller/ListVolumes
  • csidevices_devices: the number of devices reported by ListVolumes

New lvm volume provider metrics

  • csilvm_uptime: the uptime (in seconds) of the process
  • csilvm_requests: number of requests served, tagged by:
    • result_type: one of success, error
    • method: the RPC name, e.g., /csi.v0.Controller/CreateVolume
  • csilvm_requests_latency_(stddev,mean,lower,count,sum,upper): the request duration (in milliseconds), tagged by:
    • method: the RPC name, e.g., /csi.v0.Controller/CreateVolume
  • csilvm_volumes: the number of active logical volumes
  • csilvm_bytes_total: the total number of bytes in the volume group
  • csilvm_bytes_free: the number of bytes available for creating a linear logical volume
  • csilvm_bytes_used: the number of bytes allocated to active logical volumes
  • csilvm_pvs: the number of physical volumes in the volume group
  • csilvm_missing_pvs: the number of pvs given on the command-line but are not found in the volume group
  • csilvm_unexpected_pvs: the number of pvs not given on the command-line but are found in the volume group
  • csilvm_lookup_pv_errs: the number of errors encountered while looking for pvs specified on the command-line

New DSS metrics

  • dss_agent_lookups_hits: number of successful agent address lookups (via cache)
  • dss_agent_lookups_misses: number of failed agent address lookups (via cache)
  • dss_mesosclient_master_getAgents_shared: count of coalesced API calls
  • dss_obj_providers_missing: number of MISSING providers
  • dss_obj_volumes_missing: number of MISSING volumes
  • dss_ops_providers_create: duration of provider create operations
  • dss_ops_providers_modify: duration of provider modify operations
  • dss_ops_providers_remove: duration of provider remove operations
  • dss_ops_volumes_create: duration of volume create operations
  • dss_ops_volumes_remove: duration of volume remove operations
  • dss_sched_hb_disabled: non-zero if scheduler is subscribed to mesos w/o heartbeats enabled
  • dss_sched_hb_missed: missed mesos heartbeats
  • dss_sched_hb_missed2Many: how many times the number of consecutively missed mesos heartbeats triggered reconnection to mesos