Overview
This is a memo about issues I encountered when running ndlocr_cli (the NDLOCR (ver.2.1) application repository) and the steps taken to resolve them.
Note that many of these issues were caused by my own configuration oversights or atypical usage, and are unlikely to occur during normal use. Please refer to this article if you encounter similar issues.
Shared Memory Shortage
When running ndlocr_cli, the following error occurred.
Predicting: 0it [00:00, ?it/s]ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
DataLoader worker (pid(s) 3999) exited unexpectedly
The response from ChatGPT was as follows.
The “Unexpected bus error encountered in worker” error message typically occurs when there is insufficient shared memory when using PyTorch’s DataLoader. This is especially seen when the dataset is large or many workers are used.
And the following instructions were given.
If you are using Docker or another virtual environment, you need to increase the shared memory size. When using Docker, set the
--shm-sizeoption when starting the container. For example, set it asdocker run --shm-size 2G ....
Upon checking my Docker execution command, I found that the --shm-size specification was missing. The following script specifies --shm-size=256m.
https://github.com/ndl-lab/ndlocr_cli/blob/master/docker/run_docker.sh
After adding this option, the shared memory shortage error was resolved.
(Reference) Checking Current Shared Memory Size
This could be checked with the following command.
df -h /dev/shm
When the above error occurred, it was 64m.
KeyError: ‘STRING’
I encountered KeyError: 'STRING' several times. To address this, I made changes to the following two files.
https://github.com/ndl-lab/ndlocr_cli/blob/master/cli/core/inference.py#L681
Errors were occurring at the line_xml.attrib['STRING'] and elm.attrib['STRING'] sections, so I added the following handling.
if 'STRING' not in line_xml.attrib:
continue
Reference: Adding a Progress Bar
There was a case where I wanted to display a progress bar during OCR processing. Modify the following section.
https://github.com/ndl-lab/ndlocr_cli/blob/master/cli/core/inference.py#L213
Specifically, add tqdm as follows.
from tqdm import tqdm
# for img_path in single_outputdir_data['img_list']:
for img_path in tqdm(single_outputdir_data['img_list']):
...
This allows you to check the current progress and estimated remaining time.
Summary
When using ndlocr_cli in a standard manner, the error handling described in this article is likely unnecessary, but I hope it serves as a useful reference when encountering similar issues.