8 Replies Latest reply: Dec 19, 2016 8:26 AM by Dirk Schulze RSS

    PDF Looks Like Gibberish

    Dirk Schulze

      I have a large PDF file that in Classic v12.4 and v13.5 looks like this:

       

      #$%        %$ &      '     (                   '  '         &

             (                     &   $    )     (    &      & & )

      7 8 ( &      %$     &$  ( &         '          9    $ &

       

      So it is all gibberish.  Anyone ever had this happen?  Any suggestions?

        • Re: PDF Looks Like Gibberish
          Edrun Yuen

          Hi Dirk,

           

          The following are common scenarios during which Monarch may not be able to import a particular PDF document, as well as some suggestions on handling them.

           

          Scanned PDF Files

          If a PDF file contains no text, it may actually be a scanned image or some other embedded image. A scanned image is a picture of a document, taken by a scanner, which is then embedded into a PDF document. Monarch cannot extract text from a picture. The only way to deal with images is to use OCR (optical character recognition) software to try and recognize and extract text from them. CAUTION: It is NOT recommended that OCR software be used with critical financial documents, due to the fact that the extraction accuracy varies with each document and the OCR software being used. It is very easy for small errors in the recognition to creep in when using OCR software, which may not be noticed until a review or audit of the data is performed.

           

          Damaged PDF Files

          Even if a PDF file may appear correctly in Adobe Acrobat, during the creation process the text layer may have become damaged beyond repair, the result being that Monarch is unable to extract text from it. Adobe Acrobat is able to detect and repair many small errors in PDF documents, so opening the offending PDF file in Acrobat and using the File > Save As menu option to re-save it as a new PDF file may correct the problem.

           

          Text Extraction Prohibition

          When a PDF file is published, there are security options that can be specified to prevent the extraction of content from it. When you attempt to import a PDF document for which content extraction has been prohibited, Monarch will issue a message "Cannot import from PDF file because it does not allow text extraction". If this occurs, you will have to ask the publisher of the PDF file to republish it for you, and to allow content extraction when doing so.

           

          A quick and easy way to check to see if any text actually exists in a PDF file is to open it in Adobe Acrobat and use the Find feature to search for some text you can plainly see on screen. If the text is not found, the text layer has been damaged or does not exist, in which case the document is most likely an image and is therefore unreadable by Monarch or Acrobat.

           

          Another test is to use the text extract tool in Acrobat. Copy some text and then paste it into Notepad. (Note: If the text extract tool fails to highlight any text when you left-click and drag over it, then the text you can see on screen is an image.) If the text you pasted into Notepad is not the same as the text you can see on the page of the PDF file, then the text layer is damaged.

           

          We reiterate that Monarch cannot capture images/graphics.

           

          HTH

            • Re: PDF Looks Like Gibberish
              Dirk Schulze

              Thanks Edrun.  It appears that the text is damaged.  It isn't a picture because I can copy and paste.  But it pastes the gibberish and the search does not return any results.  And I did not get the extraction prohibition message.  Good to know.  Thanks again.

                • Re: PDF Looks Like Gibberish
                  Olly Bond

                  Hello Dirk,

                   

                  The PDF engine in Monarch gets better with each release so it may be that v14 will handle it. In Acrobat, you can get the PDF properties like what program generated it - if you can email these to support they might confirm whether there's a known issue.

                   

                  I've also found that generating a new XPS file or PDF file from a damaged PDF sometimes fixes this.

                   

                  Best wishes,

                   

                  Olly

                   

                  Olly Bond

                  MONARCH ? | ? | ? | ? EXPERTS

                  www.monarchexperts.com<http://www.monarchexperts.com>

                  olly@monarchexperts.com<mailto:olly@monarchexperts.com>

                    • Re: PDF Looks Like Gibberish
                      Dirk Schulze

                      Thanks Olly!  I tried saving it as a new PDF and that did not work.  Never thought to print it to XPS, but just tried it and ended up with a blank document.  The PDF program that generated the file was different from the prior month.  I will ask Datawatch support if that is the problem.  Thanks again.  As always you are a fountain of knowledge.  Happy Holidays!

                        • Re: PDF Looks Like Gibberish
                          Grant Perkins

                          Dirk,

                           

                          If the PDF is coming from a difference source (i.e. maybe a different host program that would explain the different PDF program) it may have been set to produce with alternative parameters. So text output disabled for example.

                           

                          In some situations parts of the file - formatting instructions perhaps or something completely unrelated - may look like text within the file when another application looks at it. Seeing something that presents as text is often enough for the interpreter to think it has "done the job".

                           

                          It may be that the new producer program decided to encapsulate most of the content as graphics. Of you have an OCR tools available you could try opening the PDF to see what you get.

                           

                          Is there any possibility you might be able to share this problem file privately?

                           

                          I recently stumbled across something when working with a PDF file with similar challenges and could use a few more sample files to see if the "fix" I came across may be applicable more widely. Your file sounds like a good one to check out although as it seems to find something it thinks is text I suspect it might not work.

                           

                          For your purpose it sounds like trying to "fix" the source would get you back on track and would be a better option, assuming it is possible to fix, than introducing a change to the process. UNLESS, your are stuck with the new output no matter what - in which case you may have no choice but to change the existing process a little.

                           

                           

                          Grant

                            • Re: PDF Looks Like Gibberish
                              Olly Bond

                              Hello Grant,

                               

                              I was going to suggest OCR – if printing to PDF or XPS doesn’t fix it, running the PDF through ABBYY or similar might solve the problem.

                               

                              Hello Dirk, I’ve got ABBYY Finereader on my Munich PC and would be happy to convert it for you to see if it solves the problem.

                               

                              Best wishes,

                               

                              Olly

                              • Re: PDF Looks Like Gibberish
                                Dirk Schulze

                                Grant -  The file is from a client and contains some sensitive data so I am unable to share it.  Previously they ran the report off a desktop but that unit was removed and they now use thin clients.  Fortunately that had another desktop available and ran the report from there so we do have a good copy.  I will pass the info along regarding parameters within the PDF producer but I'm guessing they are using default parameters and changing those may be outside their comfort zone.