Introduction to OCR

Overview

Teaching: 10 min
Exercises: 0 min

Questions

What are PDFs? What is OCR?

Objectives

Define OCR and how it works

Understand ‘good’ vs ‘bad’ PDFs

Vocabulary

PDF – Portable Document Format
- Commonly used file format that can inlcude text, images, graphics, and interactive forms.
OCR – Optical Character Recognition
- The process of using a computer to extract text from an image.

What are PDFs?

PDF (Portable Document Format) files are commonly used to create documents which can combine images, graphics, and text.
There are many types of PDFs.

What is OCR?

OCR stands for Optical Character Recognition. This is the process of using a computer to extract text from an image. Generally, this is done to PDFs, but OCR can be run on any image with typed text.
OCR works best on standard type faces, but is not effective at identifying text from handwriting.

Good vs Bad PDFs

Most PDFs that we OCR come from photos or scans of physical items. PDFs created digitally, including journal publications, have a text layer already built-in. We can check if a PDF already has text built-in by trying to copy/paste the text or by searching the PDF (Ctrl+F or Cmd+F).
The quality of a scanned document has a great effect on the accuracy of the OCR. It’s best to scan documents at high quality (600 dpi) and with good, overhead lighting. Book scanners are great for this job.

Key Points

OCR is used to recognize text in PDFs and images

PDFs come in many formats

Not all PDFs have text recognition

Enhacning a PDF can improve OCR results

lesson home

Text Recognition Introduction

next episode

Introduction to OCR

Overview

Vocabulary

What are PDFs?

What is OCR?

Good vs Bad PDFs

Key Points

lesson home

next episode