1 Reply Latest reply: May 15, 2014 9:57 AM by Grant Perkins RSS

    PDF created from a printed document

    srco _

      In version 8 I can open pdf files created from a software program but not ones created from a printed document. Is this proper or am I missing a step?

        • PDF created from a printed document
          Grant Perkins

          If the printed document has been scanned to a pdf file (which is how I understand what you have written, I could be wrong) then the pdf will have been created with the scanned document as a graphic image rather than as text.


          Monarch can deal with text in pdf files but does not attempt to address trying to identify any text element of a graphic block. I would imagine that Monarch is opening the file but returning nothing or at least nothing usable to the screen.


          If you use a pdf reader tool (the Adobe pdf reader for example) to open the pdf file and than try to export to text you should get a similar result for the exported file - little text and a few cntrol characters perhaps. (If that is not the case and you do get some real text further investigation may be required.)


          I hope this information is helpful if it does not correspiond with what you have found please let us know and we can seek an alternative explanation.


          If you are interested as to why only text is available from a pdf file my observation and explanation is below. Not sure if it agrees with the official explanation!






          The problem with text as graphics is that is has to be interpreted from patterns in the graphic area. This is typically the task of an Optical Character Recognition program (OCR) and whilst many of these can produce excellent results they are complex pieces of code, not very fast by comparison with reading text and are likely to produce some error or at least areas of doubt about the interpretation which the user needs to respond to. Not so bad for a letter maybe but not very appropriate for data processing in bulk. In other words not currently very suitable for the sort of things that Monarch is mostly used for.


          As an example, if you run an OCR program on a file which contained formatted dates "mm/dd/yy" it may well, often encouraged by the font that has been used, interpet this as "mm1dd|yy" or something like that. It may not treat each occurrence of a formatted date in the same way. If you have a 1000 page report full of dates the last thing you need is a process that constantly asks if it has got a date format right. That is just a simple and obvious example - other difficult to handle situations are much more complex and less easy to identify and predict or to try to offer a solution.


          I have not had a chance (or the need) to experiment with the latest enterprise scale OCR programs to see what they can do in terms of speed, accuracy and volume handling. However, even if they can perform to a much better level than my description above might suggest, I am not sure that it would be appropriate to build that functionality into Monarch. It would seem more sensible to use the specialist software to create an ouput file that Monarch can then read and interpret as usual and then automate a production process to do the whole thing in one step PROVIDED the results from the file interpretation were consistent enough day by day or week by week. (I am assuming that such a requirement would have to satisfy more than one-off activities.)