Uninstalling

Uninstalling Confluent Kafka

DC/OS 1.10 or later, and package 2.0.0-x or later

If you are using DC/OS 1.10 or later and the installed service has a version later than 2.0.0-x, then uninstalling the service is simple.

From the DC/OS CLI, enter

dcos package uninstall --app-id=<service-name> confluent-kafka

Uninstall process

WARNING: Once the uninstall operation has begun, it cannot be cancelled because doing so can leave the service in an uncertain, half-destroyed state. Any data stored in reserved disk resources will be irretrievably lost.

Uninstalling the service consists of the following steps:

  1. The service scheduler is relaunched with the environment variable SDK_UNINSTALL set to “true”. This puts the service into an uninstall mode.
  2. The service scheduler performs the uninstall with the following actions:
    • All running tasks for the service are terminated.
    • All reserved resources are unreserved by the scheduler.
    • Once all known resources have been unreserved, the scheduler’s persistent state is deleted.
  3. The cluster automatically removes the scheduler task once it advertises the completion of the uninstall process.

Debugging an uninstall

There are certain situations under which the automated uninstall process cannot complete without manual intervention.

Zombie service scheduler

If the service scheduler reports in its stdout that the uninstall is complete, but the scheduler is not removed from the system, you can safely delete it via Marathon.

Missing DC/OS agent

The service will not complete the uninstall if it is not able to unreserve all the resources it has previously reserved. A common cause of this is a DC/OS agent for which the service had reserved resources, which is no longer part of the DC/OS cluster.

  • If the agent is only temporarily gone due to maintenance, the service will unreserve the resources when the agent returns to the cluster.

  • If the agent has been permanently removed, then the uninstall will hang indefinitely. The easiest way to prevent this, is to perform pod replace on any pod deployed to the lost agent before initiating the uninstall.

  • If the missing agent is only discovered after the uninstall is initiated, then the following process can be used to allow the uninstall to complete.

First determine what resources the service is waiting on.

  • CLI: dcos confluent-kafka --name=confluent-kafka plan show deploy (after running dcos package install --cli confluent-kafka if needed)
  • HTTP: https://yourcluster.com/service/confluent-kafka/v1/plans/deploy
dcos confluent-kafka --name=confluent-kafka plan show deploy
deploy (IN_PROGRESS)
├─ kill-tasks (COMPLETE)
│  ├─ kill-task-node-0-server__1a4114bc-48bb-47f6-be99-1b5ca6d55c4e (COMPLETE)
│  ├─ kill-task-node-1-server__0c42118e-04fd-40e1-b49d-0d3f71d2d243 (COMPLETE)
│  └─ kill-task-node-2-server__e00cad38-f27f-4332-b1df-5118ca480d50 (COMPLETE)
├─ unreserve-resources (IN_PROGRESS)
│  ├─ unreserve-f41351a2-b478-4e13-a94c-705f530989ef (COMPLETE)
│  ├─ unreserve-48f64612-8427-4cde-86f4-4edeb9efff37 (COMPLETE)
│  ├─ unreserve-402d51f5-6014-4ca3-bd13-324dae62b888 (PENDING)
│  ├─ unreserve-cb95e869-277f-48b9-954f-08c0d7a26bcf (PENDING)
│  ├─ unreserve-cbd748d0-df7b-4d01-b0b7-6acf915d8f98 (COMPLETE)
│  ├─ unreserve-00ed63d6-427c-4492-9713-772390cc5241 (COMPLETE)
│  ├─ unreserve-5dd56b1d-4522-4bbd-88b5-de9fa0f181f2 (PENDING)
│  └─ unreserve-c9915f07-f446-4e14-a6b4-12c8dd2f914b (COMPLETE)
└─ deregister-service (PENDING)
   └─ deregister (PENDING)

As we can see above, some of the resources to unreserve are stuck in a PENDING state. We can force them into a COMPLETE state, and thereby allow the scheduler to finish the uninstall operation. This may be done using either of the following methods:

  • CLI: dcos confluent-kafka --name=confluent-kafka plan show deploy
  • HTTP: https://yourcluster.com/service/confluent-kafka/v1/plans/deploy/forceComplete?phase=unreserve-resources&step=unreserve-<UUID>
dcos confluent-kafka --name=confluent-kafka plan force-complete deploy unreserve-resources unreserve-402d51f5-6014-4ca3-bd13-324dae62b888
dcos confluent-kafka --name=confluent-kafka plan force-complete deploy unreserve-resources unreserve-cb95e869-277f-48b9-954f-08c0d7a26bcf
dcos confluent-kafka --name=confluent-kafka plan force-complete deploy unreserve-resources unreserve-5dd56b1d-4522-4bbd-88b5-de9fa0f181f2

At this point the scheduler should show a COMPLETE state for these steps in the plan, allowing it to proceed normally with the uninstall operation:

dcos confluent-kafka --name=confluent-kafka plan show deploy
deploy (IN_PROGRESS)
├─ kill-tasks (COMPLETE)
│  ├─ kill-task-node-0-server__1a4114bc-48bb-47f6-be99-1b5ca6d55c4e (COMPLETE)
│  ├─ kill-task-node-1-server__0c42118e-04fd-40e1-b49d-0d3f71d2d243 (COMPLETE)
│  └─ kill-task-node-2-server__e00cad38-f27f-4332-b1df-5118ca480d50 (COMPLETE)
├─ unreserve-resources (COMPLETE)
│  ├─ unreserve-f41351a2-b478-4e13-a94c-705f530989ef (COMPLETE)
│  ├─ unreserve-48f64612-8427-4cde-86f4-4edeb9efff37 (COMPLETE)
│  ├─ unreserve-402d51f5-6014-4ca3-bd13-324dae62b888 (COMPLETE)
│  ├─ unreserve-cb95e869-277f-48b9-954f-08c0d7a26bcf (COMPLETE)
│  ├─ unreserve-cbd748d0-df7b-4d01-b0b7-6acf915d8f98 (COMPLETE)
│  ├─ unreserve-00ed63d6-427c-4492-9713-772390cc5241 (COMPLETE)
│  ├─ unreserve-5dd56b1d-4522-4bbd-88b5-de9fa0f181f2 (COMPLETE)
│  └─ unreserve-c9915f07-f446-4e14-a6b4-12c8dd2f914b (COMPLETE)
└─ deregister-service (PENDING)
   └─ deregister (PENDING)

NOTE: As it may take the operator some time to complete these actions, the service scheduler may be left in a zombie state after the uninstall is complete. See the "Zombie service scheduler" section above for instructions on how to correct this.

Manual uninstall

If it is not possible to recover in other ways, then the service can be manually uninstalled.

  • Delete the uninstalling scheduler from Marathon.
  • Unregister the service from Mesos using its UUID as follows:
    dcos service --inactive | grep confluent-kafka
    confluent-kafka     False     3    3.3  6240.0  15768.0  <UUID>
    
    dcos service shutdown <UUID>
    
  • Run janitor.py to clean up the remaining reserved resources, as described in the DC/OS 1.9 instructions below.

DC/OS 1.9 and earlier, or packages earlier than 2.0.0-x

If you are running DC/OS 1.9 or earlier, or a version of the Confluent Kafka service that is earlier than 2.0.0-x, follow these steps:

  1. Stop the service. From the DC/OS CLI, enter dcos package uninstall --app-id=<service-name> confluent-kafka.
  2. Clean up remaining reserved resources with the framework cleaner script, janitor.py.
MY_SERVICE_NAME=<service-name>
dcos package uninstall --app-id=$MY_SERVICE_NAME confluent-kafka`.
dcos node ssh --master-proxy --leader "docker run mesosphere/janitor /janitor.py \
    -r $MY_SERVICE_NAME-role \
    -p $MY_SERVICE_NAME-principal \
    -z dcos-service-$MY_SERVICE_NAME"

If you use a strict-mode clusters should also include either of the following credentials as janitor.py arguments:

  • -a <token>: Token to be used for authentication, such as the one printed by running dcos config show core.dcos_acs_token in the DC/OS CLI
  • -u <username> -p <password>: Username/password for a user which may be used to clean up the resources in question.

See DC/OS documentation for more information about the framework cleaner script.