Introduction

This article documents a case where Elasticsearch errors occurred due to disk pressure caused by Docker containers and images, along with the investigation and resolution methods. We hope this serves as a useful reference for those facing similar issues.

The Problem

The following error occurred in a running Elasticsearch instance:

{
  "error": {
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    ...
  },
  "status": 503
}

Initial investigation revealed that indices had entered a close state, and insufficient disk space was suspected.

Investigating Disk Usage

Checking Root Directory Usage

First, we checked the overall disk usage of the system.

sudo du -h --max-depth=1 / | sort -hr | head -n 20

Output:

60G     /
50G     /var
4.7G    /usr
2.1G    /home
1.2G    /opt
...

The /var directory was found to be abnormally large at 50 GB.

Detailed Investigation of /var Directory

sudo du -h --max-depth=1 /var | sort -hr

Output:

50G     /var
49G     /var/lib
342M    /var/log
240M    /var/cache
128M    /var/spool
...

Since /var/lib accounted for nearly the entire volume, we investigated further.

sudo du -h --max-depth=1 /var/lib | sort -hr

Output:

49G     /var/lib
49G     /var/lib/docker
256M    /var/lib/snapd
128M    /var/lib/apt
...

Root cause identified: Docker data was occupying 49 GB.

Analyzing Docker Disk Usage

We checked Docker’s detailed usage.

docker system df

Output:

TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          38        5         39.8GB    35.99GB (90%)
Containers      5         4         10.44MB   0B (0%)
Local Volumes   2         1         646MB     32.57kB (0%)
Build Cache     129       0         2.972GB   2.972GB (100%)

Analysis Results

  • Images: 33 out of 38 (approximately 36 GB) were unused
  • Build Cache: All 3 GB were reclaimable
  • Containers: Most were active and not candidates for deletion
  • Volumes: Nearly all in use

Performing Cleanup

Bulk Cleanup Command

The following command was used to remove all unused resources at once.

docker system prune -a --volumes

This command removes:

  • Stopped containers
  • Unused images (the -a option includes untagged images)
  • Unused networks
  • Unused volumes (the --volumes option)
  • Build cache

Result

WARNING! This will remove:
  - all stopped containers
  - all networks not used by at least one container
  - all images without at least one container associated to them
  - all build cache

Are you sure you want to continue? [y/N] y

Total reclaimed space: 39.2GB

Approximately 39 GB of free disk space was recovered.

Preventing Recurrence

Configuring Docker Log Rotation

To prevent Docker container logs from accumulating indefinitely, we edited /etc/docker/daemon.json.

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Configuration explanation:

  • max-size: Maximum size of a single log file
  • max-file: Number of log files to retain

Applying the Configuration

sudo systemctl restart docker

Considering Periodic Cleanup

In production environments, automating periodic cleanup can also be considered.

# Example: delete unused images weekly
0 2 * * 0 /usr/bin/docker image prune -f

Results and Lessons Learned

Resolution Results

  • Elasticsearch errors were resolved
  • Disk usage was reduced from 60 GB to 21 GB
  • System stability improved

Lessons Learned

  1. Importance of regular monitoring: Regular monitoring of disk usage is necessary
  2. Docker operational management: Unused resources tend to accumulate especially in development environments
  3. Importance of log management: Log rotation configuration is essential
  4. Preventive maintenance: Periodic cleanup before problems occur is effective

Summary

In environments using Docker, images, containers, and build caches tend to accumulate, making regular cleanup important. We recommend performing appropriate operational management using the investigation and resolution methods introduced in this article.

Through this response, we were able to restore stable server operation. We hope this helps those facing similar issues.


Reference Command List

# Investigate disk usage
sudo du -h --max-depth=1 /path | sort -hr

# Check Docker resources
docker system df
docker images
docker ps -a

# Cleanup
docker system prune -a --volumes  # Bulk deletion
docker image prune                # Unused images only
docker container prune            # Stopped containers only