I wanted the text in scans to be embedded in the PDF document which PDF does support. Other solutions I’ve seen have been to include a copy of the text in a separate text file with the same name as the scan. That seems like a clunky solution. I had a quest to find the best OCR software that I could use for PDFs and hopefully into my workflow.
The first thing to tackle was the means to get something from the real (physical world) to an electronic format (virtual). As per usual this was done with the collaboration with Dave Bradford, my Apple counter part. He used an app called PDFPen which also has OCR built in. Unfortunately it’s Mac / iOS software only. Our benchmark image to test the OCR capabilities was from a 300 DPI scanned leaflet from a pub. The image was scanned from an Epson multifunction printer using the flat bed scanner.
PDFPen managed the OCR extremely well picking up the title at the top including the overlayed green text. This led me on a quest to find something as good if not better than PDF pen. I tried the following OCR software on Windows in no particular order:
All of them failed to live up to the mark of PDFPen even though I believe PDFPen uses Omnipage’s OCR engine. The main problem the Windows software I tested with were the text on the green background which PDFPen managed to pick up. Here’s the sample from PDFPen.
I’m unable to match the functionality found in PDFPen on the Windows side. I used trial software on Windows 7 computer to test all of them. The test were brief so I did not tweak any of them to see if there were settings that would increase the accuracy of the OCR technology but PDFPen didn’t need this either.
I’d be interested in hearing from people who may solve this problem on the Windows side and it must be able to do it from command line so that it can work in my workflow.