OCR with Adobe Acrobat

Overview

Teaching: 20 min
Exercises: 0 min

Questions

How do I OCR a PDF using Acrobat?

Objectives

Understand how to OCR a document in Acrobat

Understand the limitations of Acrobat

OCR with Adobe Acrobat

Adobe Acrobat is a great entry-level tool for OCR. It works best for good quality PDFs (we’ll use ABBYY on our ugly PDFs). It’s also free for all current Yale students, faculty, and staff.

First Steps with Acrobat

In the sample data, go to the ‘Acrobat’ folder and open ‘CeremonialMagic_1.pdf’ in Adobe Acrobat.

Right-click the file

Select ‘Open with’ and choose Adobe Acrobat

If this is your first time using Acrobat, you will be asked to sign-in to your account. Use your Yale crendaentials (NetID & password).

Select ‘Scan & OCR’ from the ‘Tools’ menu.

Click the ‘Recognize Text’ drop-down.

Change ‘Settings’ and ‘language’ if necessary.

Click the blue ‘Recognize Text’ button to begin OCR.

Fixing errors

Acrobat cannot be 100% accurate with it’s OCR. It will highlight words with when it’s confidence in the accuracy of the OCR is low. We can manually verify and edit any OCR text.

Correct Text

Click the ‘Recognize Text’ drop-down.

Select ‘Correct recognized text’

Each word with a low confidence rating will appear in a red box.

Click on words in box.

Correct transcription as necessary.

Selct ‘Apply’ and move to the next potential error.

Viewing the Text

After we OCR any PDF, we create a hidden text layer. While invisiable to us, this text layer allows us to copy/paste and search the recognized text.

Hidden Text

We can view the hidden text layer in Acrobat as an additional means of quality control.

While the ‘Correct recognized text’ option is open, check the box for ‘Review recognized text’.

This option will show us the hidden text layer on top of the image of the text.

Now we can edit the text for each word, not just the words that Acrobat identified as potential errors.

You’ll notice in our example PDF there are several words which are incorretly recognized. These were not identified by Acrobat as potenital issues. It’s important to remember that 100% accuracy with OCR software is nearly impossible.

Bulk processing

Using the ‘Action Wizard’

Adobe provides a way to create workflows through the Action Wizard. We can save these workflows and apply them to multiple PDFs or entire folders of PDFs.

From ‘Tools’, select ‘Action Wizard’

In the next menu, select ‘New Action’

There are several settings to change to complete our worflow

Under ‘Files to be processed, choose the ‘Acrobat’ folder. This is the folder where your PDFs to recognize are saved.

From ‘Recognize Text’, add ‘Recognize Text using OCR’.

Under ‘Save & Export’, add ‘Save’ twice.

Choose ‘Specify Settings’ and change ‘Output Format’ to ‘Export File(s) to Alternative Format’ and select ‘Text (Plain)’ form the ‘Export to:’ drop-down list.

Rename the process and click ‘Save’. We can now apply these steps to any folder and Acrobat will OCR each file and save two versions: one PDF and one Text file.

Key Points

How to enhance and OCR documents.

How to manually correct OCR mistakes.

How to set-up a workflow using Action Wizard.

previous episode

Text Recognition Introduction

next episode

OCR with Adobe Acrobat

Overview