OCR » Market Overview
I’ve already actively used or tested two free OCR systems, od kojih je one is Microsoft’s Text Extractor utility from PowerToys, and the other is the OCR tool from the open-source ShareX package.
I have tested both on some Serbian govenment website, with amazing quality of fonts in screenshots of perfect image quality, written in Cyrillic. To my surprise, both performed abysmally.
Now it is obvoius why both underperformed, as they are powered by the same engine, as ShareX uses the Microsoft OCR engine locally, as evidenced by this author commend here.
The current stable version is major version 5, which started with release 5.0.0 on November 30, 2021.
Tesseract is used from the command-line and does not have a built-in GUI, but there are plenty of independent GUI tools listed here.
Installation on native Windows is straightforward and installs the regularly updated build from UB Mannheim:
scoop install tesseract tesseract-languages
As I am searching for the best tools, I will only mention those that are somewhat interesting.
Thus, I became interested in dpScreenOCR, which uses Tesseract, as its results were significantly better. I also downloaded a “larger” model from here, but my results were identical to those I obtained before.
The installation is completely manual, but the tool is not that bad.
OCR in PDFs
Tools that can iterate through all PDFs on a disk and invisibly insert an OCR-generated text layer into the original PDF file, creating a searchable PDF. Both tools work on Windows, but better on WSL.
ocrmypdf/OCRmyPDF is a more polished tool, and although it works on native Windows, it can be installed within WSL with just one line:
apt install ocrmypdf.
You also need to add languages, which you can obtain with the command:
apt-cache search tesseract-ocr. So, install what I need:
apt install tesseract-ocr-srp tesseract-ocr-srp-latn tesseract-ocr-eng tesseract-ocr-deu
Although it can do a lot with the PDF, its basic use is:
ocrmypdf -l eng+deu+srp+srp-latn input_scanned.pdf output_searchable.pdf
Choosing the languages is optional, but it’s said to speed up and improve recognition quality.
pdf2pdfocr can also use an incredibly good CuneiForm engine and has even a “relatively” easy installation on native Windows with the help of Scoop, as explained in the install_windows.txt file.
Interesting Projects and Libraries
Project Naptha and Other Magic
An interesting one is tesseract-ocr-for-php, a PHP wrapper library.
Not-Tesseract Open-Source OCR
The same Project Naptha author also evaluated the only two open-source alternatives other than Tesseract, and wrote an interesting comment on the CuneiForm OCR engine that I know it used to be quite good:
GOCR and Ocrad are essentially the only other open source OCR engines (there’s technically also Cuneiform, but the source code is in a really really big zip file from some website in Russian and its also really slow according to benchmarks). And something I didn’t realize until I had peered into the source code is that they are powered by (presumably) painstakingly written rules for each and every detectable glyph and variation. This kind of blew my mind.