The following are common scenarios during which Monarch may not be able to import a particular PDF document, as well as some suggestions on handling them.
- 1. Scanned PDF Files - If a PDF file contains no text, it may actually be a scanned image or some other embedded image. A scanned image is a picture of a document, taken by a scanner, which is then embedded into a PDF document. Monarch cannot extract text from a picture. The only way to deal with images is to use OCR (optical character recognition) software to try and recognize and extract text from them. CAUTION: It is NOT recommended that OCR software be used with critical financial documents, due to the fact that the extraction accuracy varies with each document and the OCR software being used. It is very easy for small errors in the recognition to creep in when using OCR software, which may not be noticed until a review or audit of the data is performed.
- 2. Damaged PDF Files - Even though a PDF file may appear correctly in Adobe Acrobat, during the creation process the text layer may have become damaged beyond repair, the result being that Monarch is unable to extract text from it. Adobe Acrobat is able to detect and repair many small errors in PDF documents, so opening the offending PDF file in Acrobat and using the File>Save As menu option to re-save it as a new PDF file may correct the problem.
- 3. Text Extraction Prohibition - When a PDF file is published, there are security options that can be specified to prevent the extraction of content from it. When you attempt to import a PDF document for which content extraction has been prohibited, Monarch will issue a message "Cannot import from PDF file because it does not allow text extraction". If this occurs, you will have to ask the publisher of the PDF file to republish it for you, and to allow content extraction when doing so.
To determine whether or not a PDF file contains text:
A quick and easy way to check to see if any text actually exists in a PDF file is to open it in Adobe Acrobat and use the Find feature to search for some text you can plainly see on screen. If the text is not found, the text layer has been damaged or does not exist, in which case the document is most likely an image and is therefore unreadable by Monarch or Acrobat.
Another test is to use the text extract tool in Acrobat, copy some text and then paste it into Notepad. (Note: If the text extract tool fails to highlight any text when you left-click and drag over it, then the text you can see on screen is an image.) If the text you pasted into Notepad is not the same as the text you can see on the page of the PDF file, then the text layer is damaged.
We reiterate that Monarch doesn't have the capability of capturing images/graphics.
Hope this helps. Thanks!
Datawatch Global Support Supervisor