2 Replies Latest reply: May 15, 2014 9:55 AM by Grant Perkins RSS

    Tip for PDF's that Monarch won't import

    Data Kruncher



      I ran into another one of these scanner -> PDF type documents today. Monarch sees this as a graphic and can't import it.


      My solution is similar to  [url="http://mails.datawatch.com/cgi-bin/ultimatebb.cgi?ubb=get_topic;f=1;t=001027#0000010"]this thread on PDF challenges[/url], but with a twist I thought you might be able to use.


      I happen to have Acrobat Pro, but re-saving to a new PDF file didn't do the trick.


      Instead, I discovered Acrobat's built-in OCR feature. Thought I'd share it as I'd not seen it discussed here previously. Forgive me if I missed it in searching the archives.


      Anyway, once you open the troublesome PDF in Acrobat Pro, go to the Document, Paper Capture, Start Capture... menu item (I have 6.0, yours may be different). This is Acrobat's built-in character recognition. I didn't know it even had it until today.


      Select All pages, and edit the settings to PDF Output Style to Formatted Text & Graphics.


      Save to a new PDF file, and open it with Monarch.


      Click the Optimize button, and you're good to go.


      Now, Acrobat didn't do a perfect[/i] job for me. The figures weren't nicely aligned like the original was.


      If you're luckly, a floating trap will resolve your final problem. If not, as was my case - I had too many items too far off - it's not entirely insurmountable.


      I trapped each line as one huge character field, and copied the Table records to Word with Courier font. A little housekeeping later, I saved to a new, final, text file.


      Time to start modeling.


      Hope this helps.



        • Tip for PDF's that Monarch won't import
          WaldenS _

          What if you do not have Acrobat Pro, etc.  How else can you get the 'blank page' to show the PDF data?


          Thanks, (first time user of this Forum)


          SDW   :confused:

          • Tip for PDF's that Monarch won't import
            Grant Perkins

            SDW, welcome.


            The basic requirement for a pdf file which you know contains text information (not just graphics data) but cannot be read is that you try a re-write of the file to see if the new version is more readable. The problem is that there are now many pdf file writing programs but not all are consistent in the rules they apply and indeed not all documents have been been well structured to start with.


            Using Acrobat to open and then save the file can be expected to create a new file which does obey the Acrobat standards - or at least a file which obey standards that Monarch will recognise! In effect it takes the components back out of the original file and starts again.


            Any other program that can deconstruct a pdf in its entirety (not just seek to extract the text it contains) and then creeate a new file form the result should be able to do the same thing. 


            Provided it was not that program which created the uninterpretable file in the first place, you would likely get a good result.


            One alternative would be to recreate the pdf file using a different pdf writer program.