OCR » Market Overview

OCR » Market Overview

I’ve already actively used or tested two free OCR systems, od kojih je one is Microsoft’s Text Extractor utility from PowerToys, and the other is the OCR tool from the open-source ShareX package.

I have tested both on some Serbian govenment website, with amazing quality of fonts in screenshots of perfect image quality, written in Cyrillic. To my surprise, both performed abysmally.

Now it is obvoius why both underperformed, as they are powered by the same engine, as ShareX uses the Microsoft OCR engine locally, as evidenced by this author commend here.

Tesseract

The current stable version is major version 5, which started with release 5.0.0 on November 30, 2021.

Classic Tesseract

Tesseract is used from the command-line and does not have a built-in GUI, but there are plenty of independent GUI tools listed here.

Installation on native Windows is straightforward and installs the regularly updated build from UB Mannheim:

scoop install tesseract tesseract-languages

GUI Tesseract

As I am searching for the best tools, I will only mention those that are somewhat interesting.

dpScreenOCR

Thus, I became interested in dpScreenOCR, which uses Tesseract, as its results were significantly better. I also downloaded a “larger” model from here, but my results were identical to those I obtained before.

The installation is completely manual, but the tool is not that bad.

OCR in PDFs

Tools that can iterate through all PDFs on a disk and invisibly insert an OCR-generated text layer into the original PDF file, creating a searchable PDF. Both tools work on Windows, but better on WSL.

OCRmyPDF

ocrmypdf/OCRmyPDF is a more polished tool, and although it works on native Windows, it can be installed within WSL with just one line: apt install ocrmypdf.

You also need to add languages, which you can obtain with the command: apt-cache search tesseract-ocr. So, install what I need:

apt install tesseract-ocr-srp tesseract-ocr-srp-latn tesseract-ocr-eng tesseract-ocr-deu

Although it can do a lot with the PDF, its basic use is:

ocrmypdf -l eng+deu+srp+srp-latn input_scanned.pdf output_searchable.pdf 

Choosing the languages is optional, but it’s said to speed up and improve recognition quality.

pdf2pdfocr

pdf2pdfocr can also use an incredibly good CuneiForm engine and has even a “relatively” easy installation on native Windows with the help of Scoop, as explained in the install_windows.txt file.

Interesting Projects and Libraries

Project Naptha and Other Magic

Project Naptha used to be science fiction until a few years ago. It’s almost unbelievable that there’s an excellent pure Javascript OCR, tesseract.js, completely complete for all languages, working on both the server and client, completely well and easily. There’s a demo on the website and of course the legendary Chrome extension.

PHP Library

An interesting one is tesseract-ocr-for-php, a PHP wrapper library.

Not-Tesseract Open-Source OCR

The same Project Naptha author also evaluated the only two open-source alternatives other than Tesseract, and wrote an interesting comment on the CuneiForm OCR engine that I know it used to be quite good:

GOCR and Ocrad are essentially the only other open source OCR engines (there’s technically also Cuneiform, but the source code is in a really really big zip file from some website in Russian and its also really slow according to benchmarks). And something I didn’t realize until I had peered into the source code is that they are powered by (presumably) painstakingly written rules for each and every detectable glyph and variation. This kind of blew my mind.

Interesting…

date 03. Feb 2023 | modified 03. Feb 2023
filename: AI » OCR