The following are common scenarios during which Monarch may not be able to import a particular PDF document, as well as some suggestions on handling them.
Scanned PDF Files
If a PDF file contains no text, it may actually be a scanned image or some other embedded image. A scanned image is a picture of a document, taken by a scanner, which is then embedded into a PDF document. Monarch cannot extract text from a picture. The only way to deal with images is to use OCR (optical character recognition) software to try and recognize and extract text from them. CAUTION: It is NOT recommended that OCR software be used with critical financial documents, due to the fact that the extraction accuracy varies with each document and the OCR software being used. It is very easy for small errors in the recognition to creep in when using OCR software, which may not be noticed until a review or audit of the data is performed.
Damaged PDF Files
Even if a PDF file may appear correctly in Adobe Acrobat, during the creation process the text layer may have become damaged beyond repair, the result being that Monarch is unable to extract text from it. Adobe Acrobat is able to detect and repair many small errors in PDF documents, so opening the offending PDF file in Acrobat and using the File > Save As menu option to re-save it as a new PDF file may correct the problem.
Text Extraction Prohibition
When a PDF file is published, there are security options that can be specified to prevent the extraction of content from it. When you attempt to import a PDF document for which content extraction has been prohibited, Monarch will issue a message "Cannot import from PDF file because it does not allow text extraction". If this occurs, you will have to ask the publisher of the PDF file to republish it for you, and to allow content extraction when doing so.
A quick and easy way to check to see if any text actually exists in a PDF file is to open it in Adobe Acrobat and use the Find feature to search for some text you can plainly see on screen. If the text is not found, the text layer has been damaged or does not exist, in which case the document is most likely an image and is therefore unreadable by Monarch or Acrobat.
Another test is to use the text extract tool in Acrobat. Copy some text and then paste it into Notepad. (Note: If the text extract tool fails to highlight any text when you left-click and drag over it, then the text you can see on screen is an image.) If the text you pasted into Notepad is not the same as the text you can see on the page of the PDF file, then the text layer is damaged.
We reiterate that Monarch cannot capture images/graphics.
The PDF engine in Monarch gets better with each release so it may be that v14 will handle it. In Acrobat, you can get the PDF properties like what program generated it - if you can email these to support they might confirm whether there's a known issue.
I've also found that generating a new XPS file or PDF file from a damaged PDF sometimes fixes this.
MONARCH ? | ? | ? | ? EXPERTS
Thanks Olly! I tried saving it as a new PDF and that did not work. Never thought to print it to XPS, but just tried it and ended up with a blank document. The PDF program that generated the file was different from the prior month. I will ask Datawatch support if that is the problem. Thanks again. As always you are a fountain of knowledge. Happy Holidays!
If the PDF is coming from a difference source (i.e. maybe a different host program that would explain the different PDF program) it may have been set to produce with alternative parameters. So text output disabled for example.
In some situations parts of the file - formatting instructions perhaps or something completely unrelated - may look like text within the file when another application looks at it. Seeing something that presents as text is often enough for the interpreter to think it has "done the job".
It may be that the new producer program decided to encapsulate most of the content as graphics. Of you have an OCR tools available you could try opening the PDF to see what you get.
Is there any possibility you might be able to share this problem file privately?
I recently stumbled across something when working with a PDF file with similar challenges and could use a few more sample files to see if the "fix" I came across may be applicable more widely. Your file sounds like a good one to check out although as it seems to find something it thinks is text I suspect it might not work.
For your purpose it sounds like trying to "fix" the source would get you back on track and would be a better option, assuming it is possible to fix, than introducing a change to the process. UNLESS, your are stuck with the new output no matter what - in which case you may have no choice but to change the existing process a little.
Grant - The file is from a client and contains some sensitive data so I am unable to share it. Previously they ran the report off a desktop but that unit was removed and they now use thin clients. Fortunately that had another desktop available and ran the report from there so we do have a good copy. I will pass the info along regarding parameters within the PDF producer but I'm guessing they are using default parameters and changing those may be outside their comfort zone.