Illustration of a scanned document being converted into searchable machine-readable text

What is OCR?

Knowledge, data and integration

OCR means Optical Character Recognition. It is the process of turning text that appears inside a scan, photograph or image-based PDF into machine-readable text that software can search, copy, index and reuse. In practical terms, OCR creates a text layer from documents that would otherwise behave like pictures. That matters for search, accessibility, document review, workflow automation and AI retrieval.

Reviewed by Jackie, Head of Learning & Development, Levellers - Last reviewed 8 June 2026

What this means

A plain-English way to think about OCR is this: a scanned contract may look readable to a person, but to a computer it is often just a page-shaped image. Until OCR runs, the document might not be searchable, selectable or easy for assistive technologies to interpret. After OCR, the system can usually recognise words, record where they appear on the page and make the document far more usable in digital workflows.

That is why OCR matters in ordinary operational work, not just in archives. Many organisations still hold years of invoices, letters, forms, case files, meeting packs and signed documents as scans or image-heavy PDFs. If those materials are not converted into text, staff spend time opening files one by one, search tools miss relevant content, and retrieval systems have little to work with. OCR is often the step that turns "we have the documents somewhere" into "we can actually find and review them".

Why it matters

The accessibility angle matters as well. A scanned PDF made only of images is not the same as a born-digital PDF with real text. People using screen readers or other assistive technologies may not be able to read or navigate image-only files effectively. OCR can help by creating actual text, but it is not a complete accessibility programme on its own. The text still needs to be checked, and the document may still need tagging, structure and remediation.

OCR is also highly relevant to enterprise search and RAG. If the source documents are scans, retrieval quality depends on whether the text was recognised accurately and linked to the right permissions and metadata. Poor OCR can pollute a search index with junk text, causing irrelevant hits or missing the documents that matter. Good OCR, on the other hand, can make decades of document history available to staff without forcing them to retype or manually summarise everything first.

For back-office teams, OCR can be the difference between a document repository and a working knowledge source. Finance, legal, operations and customer support all benefit when the basic text inside old files becomes searchable. The point is practical: less hunting, better review, faster response and fewer decisions based on partial context.

How it works

A typical OCR workflow starts with the input file itself. The quality of that file has a huge effect on results. Resolution, contrast, skew, lighting, rotation, compression and page damage all influence recognition accuracy. A clean, high-resolution scan of typed text is much easier than a faint photocopy, a camera photo taken at an angle, or a page covered in stamps and handwritten notes. Good OCR projects begin by sampling representative documents, not by assuming that one test file tells the whole story.

Once the file is ingested, OCR software usually performs preprocessing. It may straighten the page, detect text zones, reduce noise, separate lines and identify languages. It then predicts characters and words and often returns confidence scores. Some tools simply create a searchable text layer. Others go further and attempt to detect layout, tables, key-value pairs or handwriting. That extra structure can be useful, but it also introduces more room for error if the document format is messy.

This is why leaders should understand the difference between readable, searchable and reliable. A document can become searchable after OCR and still contain mistakes. Names can be misread. Tables can lose their row and column logic. Multi-column pages can be read in the wrong order. Handwriting may be partially recognised or ignored altogether. Confidence scores help, but they do not replace testing. They are a way to route low-confidence results into review queues rather than pretending automation is infallible.

It is also useful to distinguish OCR from wider document processing. Basic OCR identifies text. More advanced systems may also try to understand form fields, table structure and layout relationships. Those tasks overlap, but they are not identical. If your goal is keyword search across scanned PDFs, OCR may be enough. If your goal is extracting exact invoice totals or case references into a live workflow, you need a stricter review design.

Where it shows up in real workflows

Real workflow examples make the point clearer. A finance team can OCR incoming invoices so that staff can search supplier names, amounts and dates instead of opening hundreds of attachments manually. A legal or procurement team can OCR legacy contracts before indexing them for clause review and retrieval. A records team can OCR historical planning files or correspondence so that searches work beyond title metadata alone. In each case, the value is not "AI magic"; it is the very practical improvement in findability and review speed.

OCR is also relevant to knowledge capture. A business may have a large archive of scanned board packs, signed policies, operational forms or supplier documents. Without OCR, that archive is only partly usable because staff can search the file name but not necessarily the content. With OCR plus sensible permissions and metadata, the archive becomes far easier to navigate and much more useful as a source for enterprise search or internal retrieval.

A final example sits in service and compliance review. If a team receives a DSAR, a complaint or an internal investigation request, scanned records become much easier to search once OCR has created machine-readable text. But that also raises the bar for governance, because the same change that makes documents easier to find for legitimate review also makes them easier to expose if permissions are weak.

Common misunderstandings

A common misunderstanding is that OCR turns every PDF into trustworthy text. It does not. Some PDFs already contain real text and do not need OCR. Others are scanned images and do need it. Even after OCR, the output may be wrong in subtle ways that matter a lot in high-stakes contexts, such as patient identifiers, invoice totals, legal names or case reference numbers. Searchability is a useful milestone, not a guarantee of correctness.

Another misunderstanding is that OCR and accessibility are the same thing. OCR can create actual text, which is a major step because image-only PDFs are inherently difficult for assistive technologies. But fully accessible PDFs may still need reading-order fixes, headings, tags, table structure and other remediation work. OCR improves the foundation. It does not finish the job on its own.

It is also wrong to assume that modern OCR handles every difficult case equally well. Printed text, handwriting, tables, low-quality scans and multilingual pages do not all behave the same way. Support varies by model and vendor, especially for handwriting and complex layouts.

Risks and boundaries

The main risks and boundaries are operational and governance-related. Low-quality scans, unusual fonts, mixed languages, rotated pages, stamps, annotations and handwritten notes can all reduce accuracy. Tables and forms are especially tricky because recognising text is not the same as preserving structure. Some tools support many printed languages but a narrower set of handwriting scenarios. That means an apparently successful pilot on clean English documents may fail on multilingual forms or legacy archives.

Human review becomes essential when the consequence of error is meaningful. If the OCR output will decide a payment, populate a case record, support a DSAR search or feed a sensitive AI workflow, define review thresholds up front. Confidence scores should push uncertain results into a queue, not hide them. Keep the original image, the OCR text and the audit trail linked together so that staff can verify what the system thought it saw. That matters for accountability as much as for quality.

There is also a data-protection dimension. OCR can surface personal data that was previously buried inside images and therefore harder to search at scale. That can be helpful for legitimate review, but it also expands the need for proper permissions, retention controls and minimisation. Once image-based content becomes indexable text, it is easier to copy, search and expose. Organisations should treat that as a governance change, not just a convenience upgrade.

What leaders should do next

If you are deciding what to do next, begin with a document inventory and a representative sample. Separate born-digital files from image-only scans. Group documents by type, layout, language and consequence of error. Test OCR on the messy real-world sample, not just the cleanest examples. Decide whether you need searchable text, field extraction, accessibility remediation or all three, because each requires a slightly different workflow and quality bar.

Then define your acceptance model. Set confidence thresholds by document type, create a human review path for low-confidence pages or fields, preserve originals, and make sure the resulting text inherits the right access permissions when it is indexed. If the output will support enterprise search or RAG, include retrieval testing as part of quality assurance. The right question is not "Did OCR run?" but "Can the right person now find and trust the right information?"

For leaders, that usually means resisting one-size-fits-all rollout. A low-risk archive search project, a public accessibility project and a high-stakes document extraction workflow should not all be judged by the same accuracy threshold or review model.

Have a question or a suggestion, or want to understand how we research and review these guides? Read about our editorial standards and how to reach us.

FAQs

Can OCR reliably read handwriting?

Sometimes, but with important limits. Modern systems can handle some handwriting or cursive, especially in supported languages and on cleaner forms, but performance is usually less consistent than for typed text. Handwritten annotations, cramped forms and mixed printed-handwritten pages often need manual review or a narrower automation scope.

Does OCR make a scanned PDF fully accessible?

Not by itself. OCR can create actual text, which is a major step because image-only PDFs are inherently hard for assistive technologies to use. But fully accessible PDFs may still need tagging, reading order fixes, headings, table structure and other remediation work. OCR improves the foundation; it does not finish the job on its own.

Is OCR good enough for enterprise search and AI retrieval?

It can be, if the organisation tests it properly and ties it to document permissions, metadata and review rules. OCR is often the missing step that makes scanned archives usable, but low-quality recognition can damage retrieval just as easily as high-quality recognition can improve it. For important collections, treat OCR quality as part of search quality.

Sources

W3C: PDF7 Performing OCR on a scanned PDF document to provide actual text - Support for claims about scanned PDFs as image-only content, OCR creating actual text and OCR's role in accessibility workflows.
US National Archives: New Search Feature Optical Character Recognition - Support for the explanation that OCR converts images containing text into text that can be read and searched by a computer.
The National Archives UK: Digitisation - Support for quality-input and verification points, including not assuming OCR output is automatically accurate.
US National Archives and FADGI: Technical Guidelines for Digitizing Archival Materials for Electronic Access - Support for OCR as conversion of raster text images into searchable data and the need for testing and image quality.
Information Commissioner's Office: Principle accuracy - Support for accuracy cautions where OCR output is used to process personal data.
Information Commissioner's Office: Principle data minimisation - Support for minimisation cautions once scanned content becomes searchable and easier to index broadly.

‹ What is NLP?

What is PII? ›