1 Reply Latest reply: May 15, 2014 9:53 AM by Grant Perkins RSS

    PDF Files

    JoeB _

      What are the limitations on Monarch 8 reading PDF files?

       

      Our company just updated our Monarch licenses to version 8 mainly because of the new ability to read PDF files.  However, we have not had good luck with it.  Most of the PDF files that we have tried to open up in Monarch either are not read at all or the formatting is distorted to the point that we can not extract the data.  Here are a few examples of the PDF files that we have had problems with.

       

      Scanned Documents -   We have a new high-end scanner that creates PDF documents but have not been able open any of those PDF files in Monarch regardless of the resolution settings.  We have tried both complex documents like medical billings and simple two column reports.  In either case these PDF files do not show up in the Import screen.

       

      Hyperion Explorer -  We use this program to query our data warehouse and it has the ability to export tables in a PDF format.  However, when we open these PDF documents Monarch locks up and we have to shut it down.

       

      IRS Forms - I have downloaded tax forms in PDF format off of the IRS web site but we are unable to use them with Monarch because the formatting is too distorted.

       

      Don’t get me wrong I love Monarch.  But I was wondering if there is a more detailed explanation, other than the Version 8 Learning Guide, on how to mine data from PDF files using Monarch.

        • PDF Files
          Grant Perkins

          Hi Joe,

           

          I think you might do best to get some specific technical contact for most of your questions as with this being a first phase development and PDF's being potentially complex files created by many different programs in many different ways, it is likely that there will be pdf files that Monarch is not currently able to address.

           

          However I do know that the development considered and tested many files (and their fonts and formatting and different ways of hcandling graphics positioning and so on ...) that had been put forward by users or potential users looking for a pdf translation facility for data presentation files. I.E. Monarch looks for text and table data types not graphics.

           

          There are restrictions if the file is secure or password protected, which seems reasonable.

           

          Here are some guesses (and they are purely guesses!) about your list of problem inputs. They may suggest a few other things for you to check up on usefully before getting into technical discussions with the support team.

           

          High Speed scanner.

           

          Is this a physical document scenner rather than an electronic document conversion tool?

           

          If so my guess is that is producing a pdf file which is a graphic/picture not a text or mixed content file. Not much to be done with graphics. Can the process include an OCR opportunity in order to try to get the text content in place?

           

          Hyperion Explorer Output.

           

          Difficult to guess. Do you have a copy of Acrobat available to see what happens if you open the files that way?

           

          IRS forms.

           

          Do the forms contain data or are they pdf with fields embedded ready to be completed?

           

          Can you point me to a URL or are they the sort of documents that are only released to specific registered individuals or companies?

           

           

          I am afraid I cannot comment on the V8 Learning Guide as I have not seen that yet.

           

          When I have tried to use the Adobe export facility in the past I have come across many documents that simply did not export to text at all well. So we might expect so limitiations, possibly severe, from time to time. I'm sure the development team, where they have half a chance of being able to extract successfully, will be more than willing to look at the problem and seek a way to deliver what you need - despite the immense variability to be discovered in pdf files from many different source programs. After all, even individual characters, as they appear on screen or printed output, may be graphics, not text, in the file.

           

          This is no simple exercise - if it was the all conquering solution would already be out there somewhere.

           

          Sorry to ramble but this is a big subject.

           

           

          Grant