Introduction to OCR
Overview
Teaching: 10 min
Exercises: 0 minQuestions
What are PDFs? What is OCR?
Objectives
Define OCR and how it works
Understand ‘good’ vs ‘bad’ PDFs
Vocabulary
- PDF – Portable Document Format
- Commonly used file format that can inlcude text, images, graphics, and interactive forms.
- OCR – Optical Character Recognition
- The process of using a computer to extract text from an image.
What are PDFs?
- PDF (Portable Document Format) files are commonly used to create documents which can combine images, graphics, and text.
- There are many types of PDFs.
What is OCR?
- OCR stands for Optical Character Recognition. This is the process of using a computer to extract text from an image. Generally, this is done to PDFs, but OCR can be run on any image with typed text.
- OCR works best on standard type faces, but is not effective at identifying text from handwriting.
Good vs Bad PDFs
- Most PDFs that we OCR come from photos or scans of physical items. PDFs created digitally, including journal publications, have a text layer already built-in. We can check if a PDF already has text built-in by trying to copy/paste the text or by searching the PDF (Ctrl+F or Cmd+F).
- The quality of a scanned document has a great effect on the accuracy of the OCR. It’s best to scan documents at high quality (600 dpi) and with good, overhead lighting. Book scanners are great for this job.
Key Points
OCR is used to recognize text in PDFs and images
PDFs come in many formats
Not all PDFs have text recognition
Enhacning a PDF can improve OCR results