配置更新错误

配置更改后，服务可能会进入不健康状态。这通常在用户进行无效配置更改时发生。某些配置值可能不会被更改，或可能不会被减少。要验证是否如此，对于任何错误，请检查服务的 deploy 计划。

$ dcos hdfs --name=<service-name> plan show deploy

访问日志

调度程序和所有服务节点的日志可从 DC/OS Web 界面查看。

调度程序日志对于确定节点未启动的原因会有帮助（这位于调度程序视界下）。
节点日志对于检查服务本身的问题有帮助。

在所有情况下，日志通常被管道传递到名为 stdout 和/或 stderr文件中。

要查看给定节点的日志，执行以下步骤：

访问<dcos-url> 以进入 DC/OS Web 界面。
导航至 Services 并单击要检查的服务。
在服务的任务列表中，单击要检查的任务（调度程序依据服务命名，节点则根据其类型命名为 node-<#>-server）。
在任务详细信息中，单击 Logs 选项卡进入日志查看器。默认情况下，您将查看 stdout，但 stderr 也很有用。使用右上方的下拉菜单选择要检查的文件。

您也可通过 Mesos UI 访问日志：

访问<dcos-url>/mesos 查看 Mesos UI。
单击左上方的 Frameworks 选项卡，以获取群集中运行的服务列表。
按需要导航至正确的框架。调度程序在 marathon 下运行，并有与服务名称匹配的任务名称（默认 hdfs）。服务节点在名称与服务名称匹配的框架下运行（默认 hdfs）。
您现在应该看到两个任务列表。Active Tasks 是当前正在运行的任务，Completed Tasks 是已退出的任务。单击您要检查的任务的 Sandbox 链接。
Sandbox 视图将列出名为 stdout 和 stderr的文件。单击文件名以在浏览器中查看文件，或单击 Download 将其下载到您的系统进行本地检查。请注意，很旧的任务将自动删除其沙盒，以限制磁盘空间的使用。

替换永久故障节点

DC/OS Apache HDFS 服务对临时 Pod 故障有复原能力，如果其停止运行，则会自动重新启动它们。然而，如果托管 Pod 的机器永久丢失，需要手动干预来丢弃停工的 Pod，并在新机器上重建它。

应使用以下命令获取可用 Pod 的列表。

$ dcos hdfs --name=<service-name> pod list

然后应借助上述列表中提供的适当的 pod_name，使用以下命令更换故障机器上的 Pod。

$ dcos hdfs --name=<service-name> pod replace <pod_name>

Pod 然后可通过 recovery 计划恢复。

$ dcos hdfs --name=<service-name> plan show recovery

重新启动节点

如果您必须强制重新启动 Pod 的进程，但不想清除该 Pod 的数据，使用以下命令在其当前所在的同一代理机器上重新启动该 Pod。这不会导致中断或数据丢失。

应使用以下命令获取可用 Pod 的列表。

$ dcos hdfs --name=<service-name> pod list

然后应使用以下命令重启 Pod，使用在上述列表中提供的相应的命令 pod_name 。

$ dcos hdfs --name=<service-name> pod restart <pod_name>

Pod 然后可通过 recovery 计划恢复。

$ dcos hdfs --name=<service-name> plan show recovery

任务未部署/资源匮乏

当调度程序尝试启动任务时，它将记录有关于其已收到资源的决定。在确定任务未能部署的原因时，这很有用。

最近的服务版本中，调度程序端点位于http://yourcluster.com/service/<service-name>/v1/debug/offer 将显示一个HTML表，其中包含最近评估的商品摘要. 此表的内容目前与日志中的内容非常相似，但格式更易于访问. 另外, 我们可以查看调度程序的日志 stdout。

以下示例假设一个需要预先确定资源数量的假定任务。滚动调度程序日志，我们看到了几个模式。首先，存在这样的故障，其中缺少唯一的东西是 CPU。剩余的任务需要两个 CPU，但此资源供应显然不够：

INFO  2017-04-25 19:17:13,846 [pool-8-thread-1] com.mesosphere.sdk.offer.evaluate.OfferEvaluator:evaluate(69): Offer 1: failed 1 of 14 evaluation stages:
  PASS(PlacementRuleEvaluationStage): No placement rule defined
  PASS(ExecutorEvaluationStage): Offer contains the matching Executor ID
  PASS(ResourceEvaluationStage): Offer contains sufficient 'cpus': requirement=type: SCALAR scalar { value: 0.5 }
  PASS(ResourceEvaluationStage): Offer contains sufficient 'mem': requirement=type: SCALAR scalar { value: 500.0 }
  PASS(LaunchEvaluationStage): Added launch information to offer requirement
  FAIL(ResourceEvaluationStage): Failed to satisfy required resource 'cpus': name: "cpus" type: SCALAR scalar { value: 2.0 } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
  PASS(ResourceEvaluationStage): Offer contains sufficient 'mem': requirement=type: SCALAR scalar { value: 8000.0 }
  PASS(MultiEvaluationStage): All child stages passed
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 9042 end: 9042 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 9160 end: 9160 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7000 end: 7000 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7001 end: 7001 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 8609 end: 8609 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 8182 end: 8182 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7199 end: 7199 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 21621 end: 21621 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 8983 end: 8983 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7077 end: 7077 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7080 end: 7080 } }
    PASS(PortEvaluationStage): Offer contains sufficient 'ports': requirement=type: RANGES ranges { range { begin: 7081 end: 7081 } }
  PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
  PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
  PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
  PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
  PASS(LaunchEvaluationStage): Added launch information to offer requirement
  PASS(ReservationEvaluationStage): Added reservation information to offer requirement

如果我们从此拒绝摘要中向上滚动，我们发现一条描述代理提供 CPU 相关资源的消息：

INFO  2017-04-25 19:17:13,834 [pool-8-thread-1] com.mesosphere.sdk.offer.MesosResourcePool:consumeUnreservedMerged(239): Offered quantity of cpus is insufficient: desired type: SCALAR scalar { value: 2.0 }, offered type: SCALAR scalar { value: 0.5 }

可以理解，当节点需要 2.0 CPU 而系统只剩 0.5 CPU 时，我们的调度程序拒绝在系统上启动节点。

我们看到的另一种模式是出于如下若干原因，所提供的资源被拒绝：

INFO  2017-04-25 19:17:14,849 [pool-8-thread-1] com.mesosphere.sdk.offer.evaluate.OfferEvaluator:evaluate(69): Offer 1: failed 6 of 14 evaluation stages:
  PASS(PlacementRuleEvaluationStage): No placement rule defined
  PASS(ExecutorEvaluationStage): Offer contains the matching Executor ID
  FAIL(ResourceEvaluationStage): Failed to satisfy required resource 'cpus': name: "cpus" type: SCALAR scalar { value: 0.5 } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
  PASS(ResourceEvaluationStage): Offer contains sufficient 'mem': requirement=type: SCALAR scalar { value: 500.0 }
  PASS(LaunchEvaluationStage): Added launch information to offer requirement
  FAIL(ResourceEvaluationStage): Failed to satisfy required resource 'cpus': name: "cpus" type: SCALAR scalar { value: 2.0 } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } } FAIL(ResourceEvaluationStage): Failed to satisfy required resource 'mem': name: "mem" type: SCALAR scalar { value: 8000.0 } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
  FAIL(MultiEvaluationStage): Failed to pass all child stages
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 9042 end: 9042 } } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 9160 end: 9160 } } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7000 end: 7000 } } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7001 end: 7001 } } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 8609 end: 8609 } } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 8182 end: 8182 } } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7199 end: 7199 } } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 21621 end: 21621 } } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 8983 end: 8983 } } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7077 end: 7077 } } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7080 end: 7080 } } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
    FAIL(PortEvaluationStage): Failed to satisfy required resource 'ports': name: "ports" type: RANGES ranges { range { begin: 7081 end: 7081 } } role: "hdfs-role" reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
  FAIL(VolumeEvaluationStage): Failed to satisfy required volume 'disk': name: "disk" type: SCALAR scalar { value: 10240.0 } role: "hdfs-role" disk { persistence { id: "" principal: "hdfs-principal" } volume { container_path: "hdfs-data" mode: RW } } reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
  PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
  PASS(VolumeEvaluationStage): Offer contains sufficient 'disk'
  FAIL(VolumeEvaluationStage): Failed to satisfy required volume 'disk': name: "disk" type: SCALAR scalar { value: 10240.0 } role: "hdfs-role" disk { persistence { id: "" principal: "hdfs-principal" } volume { container_path: "solr-data" mode: RW } } reservation { principal: "hdfs-principal" labels { labels { key: "resource_id" value: "" } } }
  PASS(LaunchEvaluationStage): Added launch information to offer requirement
  PASS(ReservationEvaluationStage): Added reservation information to offer requirement

在这种情况下，我们发现在此系统上没有我们的任务需要的可用端口（更不用提缺少足够的 CPU 和 RAM）。当我们查看已经对其进行部署的代理时，通常会发生这种情况。此处的代理可能会为此服务运行一个现有节点，而其中我们已经将那些端口保留给自己了。

我们看到，集群中的剩余代理无一有余度适配我们的第三个节点。要解决此问题，我们需要将更多代理添加到 DC/OS 群集，或者我们需要降低服务的要求以使其适配。在后一种情况中，要了解如果资源使用量减少太多而可能导致的任何性能问题。CPU 配额不足会导致任务受限制，不足 RAM 配额将导致任务内存不足。

这是您可以通过撇开调度程序日志来执行的诊断类型的很好示例。

意外删除 Marathon 任务，而非服务

常见用户错误是从 Marathon 中删除调度程序任务，这不会自行卸载服务任务自身。如果您这样做，您有两个选择：

卸载其余服务

如果您真的想卸载服务，您只需要完成卸载下所述的正常 package uninstall 步骤。

恢复调度程序

如果您想让调度程序恢复，您可以运行 dcos package install 进程，使用您之前配置的选项。这将重新安装一个与前一个调度程序匹配的新的调度程序（假设您正确选择了您的选项），其将在上次离开的地方重新开始。为确保您不忘记配置服务的选项，我们建议在源控件中保留一份服务的 options.json 副本，以便后面轻松恢复。

“框架已删除”

该服务之前安装过同名的实例，且未正确卸载。有关完成不完全卸载的步骤，请参阅卸载。

内存不足 (OOMed) 任务

如果您没有对其提供足够的资源，您的任务可能会因为 OOM（内存不足错误）而被终止。这会通过日志任务中的意外的 Killed 消息而显示，有时持续这样，但通常不是。要验证原因是否为 OOM 错误，可以检查以下位置：

检查调度程序日志（或 dcos <svcname> pod status <podname>） 以查看给定出故障的 Pod 中的“TaskStatus”更新。
直接检查代理日志，看看是否有提及由于超过内存使用量 Mesos 代理节点杀死任务的消息。

在您确认问题确实是 OOM 错误之后，您也可以通过更新服务配置来保留更多内存加以解决。

故障排除已替换的日志节点

以下部分介绍如何执行日志节点的 replace。本指南使用日志节点 0 指示不健康的日志节点，因为它是被替换的日志节点。

替换命令

通过以下方式替换日志节点：

$ dcos hdfs --name=hdfs pod replace journal-0

检测到在`replace` 之后的不健康的日志节点

替换后的日志节点启动并运行后，您应该在 stderr 日志中看到以下内容：

org.apache.hadoop.hdfs.qjournal.protocol.JournalNotFormattedException: Journal Storage Directory

这表示此日志节点不健康。

确定健康的日志节点

从未替换的日志节点，确认日志节点是健康的：

检查 stderr 日志并检查是否存在错误。
在 journal-data/hdfs/current 中，检查以下方面：
连续 edits_xxxxx-edits_xxxxx 个文件，每个时间戳差别约 2 分钟。
一个 edits_inprogess_ 文件在过去 2 分钟内修改。

确定后，记录哪个日志节点是健康的。

修复不健康的日志节点

通过以下方式将 SSH 置入不健康日志节点的沙盒

$ dcos task exec -it journal-0 /bin/bash

在此沙盒中，创建目录 journal-data/hdfs/current：

$ mkdir -p journal-data/hdfs/current

从之前确定的健康日志节点中，复制 VERSION 文件的内容到 journal-data/hdfs/current/VERSION。
在不健康的日志节点上，创建与健康日志节点上的 VERSION 文件具有相同路径的文件： journal-data/hdfs/current/VERSION. 将复制的内容粘贴到此文件中。
通过以下方式重新启动不健康的日志节点：

$ dcos hdfs --name=hdfs pod restart journal-0

重新启动的日志节点启动并运行后，通过检查 stderr 日志确认节点已经是健康状态。您应该看到：

INFO namenode.FileJournalManager: Finalizing edits file

故障排除

诊断 HDFS 问题