ABBYY OCR Editor

Overview

Teaching: 30 min
Exercises: 5 min

Questions

How do I OCR PDFs with tables?

How do I change the language setting in ABBYY?

Objectives

Understand how to extract tabular data from a PDF and export to Excel

How to add new patterns/characters to

Working with Tables

Many PDFs include data in the form of tables. While recognizing the text and numbers in a table is straightforward, maintaing the table structure is more difficult.

From PDF to Excel

ABBYY can recognize the structure of data tables. It uses the lines seperating rows & columns to idenitify the data in each cell.

In the ABBYY folder, right-click to open ‘tables.pdf’ in ABBYY FineReader.

Click the ‘Recognize Text’ drop-down and select ‘Open in OCR Editor’.

ABBYY should recognize the tables in this PDF. ABBYY designates tables with blue shading.

We can manually adjust columns and rows to match the table structure.

We can split, merge, and create new cells, columns, and rows.

Add a horizontal divider between rows 1 and 2.

Click ‘Save as XLXS’, this will open the table in excel. We can edit or save the table from there.

Screenshot of ABBYY recognizing data tables

Non-Latin Text

ABBYY is not limited to English or Latin text. ABBYY has a robust list of languages that are supported. You can evern choose multiple languages if a document is multi-lingual.

Non-English OCR

Right-click on PDF named ‘Russian.pdf’, select Open with ABBYY FineReader 14.

Click the ‘Recognize Text’ drop-down and select ‘Open in OCR Editor’.

Select Language as ‘Russian’ and ‘Russian (Old Spelling) if not automatically detected.

Click ‘Recgonize’ again and see how the results have improved. We can edit this PDF in the same way we would with an English text.

Improving Pattern Recognition

ABBYY comes trained on many different alphabets and languages. Of course, ABBYY does not know every font used through history. Also, special characters or ligatures might not be in ABBYY’s dictionaries.

We can enhance ABBYY’s existing patterns through training.
You can custom train ABBYY from scratch or you can add to an existing pattern dictioanry.

Pattern Training and Non-Roman Type

Typically, modern English is printed using Roman typeface. Other types like Blackletter Gothic are no longer popular, but were used in historic text. Many historic texts in English use a combination of old and modern types. We can enhance our existing pattern dictionary by training ABBYY on the meanings of older style type.

Right-click on PDF named ‘Non-Latin.pdf’, select Open with ABBYY FineReader 14.

Click the ‘Recognize Text’ drop-down and select ‘Open in OCR Editor’.

In the Options menu, under OCR select

Screenshot of ABBYY's Pattern Trainer

Key Points

ABBYY can OCR data tables and non-English texts

We can train ABBYY to recognize new, unique characters

previous episode

Text Recognition Introduction

lesson home

ABBYY OCR Editor

Overview

Working with Tables

From PDF to Excel

Non-Latin Text

Non-English OCR

Improving Pattern Recognition

Pattern Training and Non-Roman Type

Key Points

previous episode

lesson home