Investigating and Resolving Disk Pressure Caused by Docker [Ubuntu 22.04 Case Study]

Introduction

This article documents a case where Elasticsearch errors occurred due to disk pressure caused by Docker containers and images, along with the investigation and resolution methods. We hope this serves as a useful reference for those facing similar issues.

The Problem

The following error occurred in a running Elasticsearch instance:

{
  "error": {
    "type": "search_phase_execution_exception",
    "reason": "all shards failed",
    "phase": "query",
    ...
  },
  "status": 503
}

Initial investigation revealed that indices had entered a close state, and insufficient disk space was suspected.

Investigating Disk Usage

Checking Root Directory Usage

First, we checked the overall disk usage of the system.

sudo du -h --max-depth=1 / | sort -hr | head -n 20

Output:

60G     /
50G     /var
4.7G    /usr
2.1G    /home
1.2G    /opt
...

The /var directory was found to be abnormally large at 50 GB.

Detailed Investigation of /var Directory

sudo du -h --max-depth=1 /var | sort -hr

Output:

50G     /var
49G     /var/lib
342M    /var/log
240M    /var/cache
128M    /var/spool
...

Since /var/lib accounted for nearly the entire volume, we investigated further.

sudo du -h --max-depth=1 /var/lib | sort -hr

Output:

49G     /var/lib
49G     /var/lib/docker
256M    /var/lib/snapd
128M    /var/lib/apt
...

Root cause identified: Docker data was occupying 49 GB.

Analyzing Docker Disk Usage

We checked Docker’s detailed usage.

docker system df

Output:

TYPE            TOTAL     ACTIVE    SIZE      RECLAIMABLE
Images          38        5         39.8GB    35.99GB (90%)
Containers      5         4         10.44MB   0B (0%)
Local Volumes   2         1         646MB     32.57kB (0%)
Build Cache     129       0         2.972GB   2.972GB (100%)

Analysis Results

Images: 33 out of 38 (approximately 36 GB) were unused
Build Cache: All 3 GB were reclaimable
Containers: Most were active and not candidates for deletion
Volumes: Nearly all in use

Performing Cleanup

Bulk Cleanup Command

The following command was used to remove all unused resources at once.

docker system prune -a --volumes

This command removes:

Stopped containers
Unused images (the -a option includes untagged images)
Unused networks
Unused volumes (the --volumes option)
Build cache

Result

WARNING! This will remove:
  - all stopped containers
  - all networks not used by at least one container
  - all images without at least one container associated to them
  - all build cache

Are you sure you want to continue? [y/N] y

Total reclaimed space: 39.2GB

Approximately 39 GB of free disk space was recovered.

Preventing Recurrence

Configuring Docker Log Rotation

To prevent Docker container logs from accumulating indefinitely, we edited /etc/docker/daemon.json.

{
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "10m",
    "max-file": "3"
  }
}

Configuration explanation:

max-size: Maximum size of a single log file
max-file: Number of log files to retain

Applying the Configuration

sudo systemctl restart docker

Considering Periodic Cleanup

In production environments, automating periodic cleanup can also be considered.

# Example: delete unused images weekly
0 2 * * 0 /usr/bin/docker image prune -f

Results and Lessons Learned

Resolution Results

Elasticsearch errors were resolved
Disk usage was reduced from 60 GB to 21 GB
System stability improved

Lessons Learned

Importance of regular monitoring: Regular monitoring of disk usage is necessary
Docker operational management: Unused resources tend to accumulate especially in development environments
Importance of log management: Log rotation configuration is essential
Preventive maintenance: Periodic cleanup before problems occur is effective

Summary

In environments using Docker, images, containers, and build caches tend to accumulate, making regular cleanup important. We recommend performing appropriate operational management using the investigation and resolution methods introduced in this article.

Through this response, we were able to restore stable server operation. We hope this helps those facing similar issues.

Reference Command List

# Investigate disk usage
sudo du -h --max-depth=1 /path | sort -hr

# Check Docker resources
docker system df
docker images
docker ps -a

# Cleanup
docker system prune -a --volumes  # Bulk deletion
docker image prune                # Unused images only
docker container prune            # Stopped containers only

Introduction#

The Problem#

Investigating Disk Usage#

Checking Root Directory Usage#

Detailed Investigation of /var Directory#

Analyzing Docker Disk Usage#

Analysis Results#

Performing Cleanup#

Bulk Cleanup Command#

Result#

Preventing Recurrence#

Configuring Docker Log Rotation#

Applying the Configuration#

Considering Periodic Cleanup#

Results and Lessons Learned#

Resolution Results#

Lessons Learned#

Summary#

Introduction

The Problem

Investigating Disk Usage

Checking Root Directory Usage

Detailed Investigation of /var Directory

Analyzing Docker Disk Usage

Analysis Results

Performing Cleanup

Bulk Cleanup Command

Result

Preventing Recurrence

Configuring Docker Log Rotation

Applying the Configuration

Considering Periodic Cleanup

Results and Lessons Learned

Resolution Results

Lessons Learned

Summary