My company needed to OCR (recognise text in) about 120,000 pages of data and export the results into searchable multi-page PDFs and XML files.  The quality had to be very high to meet the client’s expectations.

InformationThis was accomplished in 2004, so please adjust for inflation as appropriate.

The Problem

Acrobat was far too slow, had limited batch control and the recognition was sub-standard, especially with certain fonts.  We tried many alternative products (few could create searchable PDFs), and eventually came across one that was wildly ahead and produced the best result, whilst keeping within a reasonable time frame.  This product had 3 considerably different versions:

1) A professional partly-manual OCR system that could handle at best 50 pages at a time before getting a performance hit.  This costed about £40.

2) A version with that was capable of scanning up to a certain number of pages total for a fixed price.  The cheapest (and maximum) price break for this was around £5,000 for 100,000 pages.

3) A version with unlimited pages for one year only, costing ~£6,500.

Option 1 was not feasible, as the user would have to physically select 50 files, do all the actions required, and wait whilst the PC was busy each time.

Option 2 would not quite do the job, and had the danger of not being able to reuse the software in case of an unexpected eventuality.

Option 3 was rather expensive and did not give us much room for using it on another job.

The Solution

Most companies simply pay for the higher level products and pass the costs onto clients, increasing job prices.  I had the idea that I could probably write a front-end and simulate all the complex (but relatively consistent) user action involved, thus only spending £40 for the basic product, but getting a version that could actually do more than the most expensive one.  There was no viable or legal reason we could not do it ourselves, and it’d only take me a few days to design.

Since the user-controlled version could only handle 50 pages at a time (before the time taken exceeded a reasonable amount), I would break up the original TIFFs into blocks of 50 single-page files, number them sequentially, and then perform the following simulations:

  1. Feed each block of 50 into the OCR product, exactly like a user would do manually.
  2. Perform all the actions required to do the OCR process (which could vary, so catering for all possibilities and dialog boxes).
  3. Perform the XML export process.
  4. Export the blocks of files as single-page PDFs.

After the process finished, I would then:

  1. Check all the files had worked.
  2. Reverse the process used to break up and rename the TIFFs.
  3. Use a command-line open source product to combine the pages of the PDFs, giving the client their desired result.

After various tests were carried out, the process was run solid over two weeks, and only required basic checkups to ensure it was all working.

The result: All the documents were OCRed to the highest possible standard, and the client was very happy with the results.  We received a complimentary response on our job turnaround and quality.

We spent just £40 to do something that costed other companies thousands.