ABBYY Recognition Server OCR, Document Conversion and Archiving Software

ABBYY Recognition Server OCR, Document Conversion and Archiving Software

August 27, 2019 0 By Kailee Schamberger


Hello and welcome to a brief introduction to ABBYY’s Recognition Server. Just to explain what Recognition Server is: It is a product that has been designed to convert image-only documents such as TIFFs or PDFs to text-searchable PDFs or other image formats or document types such as Microsoft Office documents, rich text format documents or CSV, XML files etc. etc. It is operated through an administration console, which you can see on the screen now. Which is basically an extension of MMC, for anyone who is familiar with Microsoft MMC. Within that administration console there are a number of tools that can be used. The first one is ‘Workflow’. ‘Workflow’ allows you to configure how the documents will be processed from import through to export. So the first stage is to choose the import format: Batch, Shared folder, FTP folder, exchange for pop3 email. You then choose the process. So that allows you to select which recognition languages you want to to use, to prioritize the system between quality and speed and also apply things like custom dictionaries and options like that. Next one is document separation. So this is where you have a large table – a large PDF with lots and lots of documents in it. It is deciding how to split those documents up. And that can be done in a number of ways including blank pages or by barcode. Next step is quality control. This relates directly to what we call our ‘Verification station’. So quality control allows you to select whether or not you want to verify the documents that have been processed. By verifying, what we mean is to check the characters that have been recognized by the system are correct. Sometimes you find things like capital ‘B’s, that have been recognized as an ‘8’ or vice versa. Or a ‘I’ is a ‘1’. And so this like to go through the document to make sure that any of such issues could be corrected. You can set a threshold. Indexing is about extracting metadata on the documents that you are processing. And again this is an optional stage. If you choose to do it, the first step is to select the document type which is being processed Here you can see I’ve got a letter and a document. And then finally you collect output which is the exportation. As I said there’s a very wide variation of options there for export. And within each export setting, there are then sub-options for things like quality control. With PDFs where you want them to be searchable, where you want to put the text layer etc.etc. So that is really how you go about setting up ‘Workflow’ – which as a step describes how a document moves through the system. There are other options in the list: There is, as you can see, ‘Jobs’, which will allow you to monitor the process of an individual document through the system. There is also a log file associated with that. And then there are the controls for the ‘Scanning Station’, which is an image import profile if you like. Instead of scanning images directly, you load them from a folder. That process can be fully automated, by the way. It’s just that there is a station there to do it manually, if you wish, If it suits your environment. We then have ‘Processing Stations’. ‘Processing Stations’ are the workhorse of Recognition Server. They do the the number-crunching, if you like – all the actual recognition process. And you can configure them in a number of different ways: You can have it all happening on a single server where you have the implications of program stations installed. Or you can break out the processing stations. So you got an office environment where you might have 20-30 desktop pcs that go into stand by in the evening when staff go home. What you can do is actually utilize the processing power of those 20-30 desktop PCs to actually carry out the recognition process. And then you’ve also got things like the licensing server logs and user control. So if I drop in a couple of images into Recognition Server: I’ve got a PDF here and I’ve got a word document. This is just to show you how different document types can be handled. So as you can see I have started the process and those documents are being picked up. And if we go into the ‘Job’ block, you’ll see that there will be an error. And that’s because the word document isn’t a recognizable format. In order for Recognition Server to be able to recognize it needs to be in an image format, so a bitmap a TIFF, PNG file, PDF – anything that’s an image file, the system is able to read. So that doc, the ‘Letters’ PDF, will now be being processed. And while that is being processed, you will be able to find that within the Application Station. And if I open up the Verification Station, you’ll see that there are two windows. One on the left and one on the right – and that’s the image that is being recognized and the result, that recognition. And you can see that uncertain characters are highlighted in the turquoise color that you can also see on the far left-hand side by page. You are able to see the recognition results, so you can see the number of all uncertain characters and the percentage accuracy. Obviously these are all very clean, crisp images of the recognition results, very, very good. But it may be that if you’re working with older images, a computer output from the late nineties, things like that, the recognition is not as good. And then once you’ve verified, you accept the document and then the documents move through to the next stage, which as I said is an optional stage. And that stage is ‘Indexing’. So within Indexing, you select the document type that you want to process. Then you will select ‘Letter’ . And you can see that by doing that, we now have two fields: one for address and one for the name of the person who the letter is addressed to. And what we let you do, you just select the text that you want to populate those filled with, from the image. So it’s very, very quick and easy and you can index a large number of documents very quickly doing that. And once you have finished indexing, then the next stage of the process is automatically to export those documents. I go into the ‘Export’ folder. You can see the three, that one we’ve just done and the two previous exports from other demonstrations. And what you can also do is, as I said, you can output meta data. So in this folder I got an XML file, which is the exported metadata from that ‘Letters’ file. And also in there we got the unrecognized Word document. So we can place those for someone to have a look at manually, if we need to. And that concludes the demonstration. I hope that gave you a good indication what is possible – many thanks.