19 Replies Latest reply: May 15, 2014 10:04 AM by Olly Bond RSS

    Reoccuring PDF Problem!!!

    jbvinny _

      I have a pdf file that I opened using Adobe Acrobat 9 Pro. This was a scanned image document that I converted to text using Acrobat's OCR Text Recongnition feature.  I then saved the document.  I am trying to import this file into Monarch Version 9.0 but I keep getting this error....

       

      " File is not a valid PDF File. It will be opened as plan text"  If i click ok I get the error "input file has lines longer than the 4000 character maximum. they will be truncated."

       

      The file appears to be created using Adobe Acrobat 9.1 Paper Capture Plug In.

       

      This is a reoccuring problem.  Once imported the file is a mess. Am I doing something wrong here? Any ideas? Oh, and i should mention that I already tried exporting the document as tiff files and then converting it back to a PDF which also resulted in the same error message.

        • Reoccuring PDF Problem!!!
          Grant Perkins

          I then saved the document. /quote

           

          Saved as what sort of file?

            • Reoccuring PDF Problem!!!
              jbvinny _

              As a pdf file.

                • Reoccuring PDF Problem!!!
                  Grant Perkins

                  As a pdf file.[/quote]

                   

                   

                  Does this mean it was saved as a text file with a PDF extension or written (printed) to a PDF file by a PDF generation engine (prsumably Acrobat?)

                   

                  I have to say that my initial thought would be that since Monarch seeks to work with a text document and you used Acrobat to create a text document using OCR I would have been tempted to llok at working with whatever the OCR had produced ('fixed' where required) rather than take it back to a PDF. However as I have not used Acrobat for some versions there may well be some part of the process that prevents such an approach or makes it unusable in some other way.

                   

                  Bear in mind thet Monarch, given a text based PDF, will endeavour to convert it to a text file as the first part of the analysis process.

                   

                  What have I missed?

                   

                   

                   

                  Grant

                    • Reoccuring PDF Problem!!!
                      jbvinny _

                      The file was a scanned image document originally.  When you use Adobe Pro V9 you can convert scanned image documents to searchable text PDF's using Adobe's OCR Tool.  Once completed i saved that file.  The file is still a searchable text pdf document.

                        • Reoccuring PDF Problem!!!
                          Olly Bond

                          Hello jbvinny,

                           

                          I've not used Adobe Acrobat Pro, but I've done the same thing with Abbyy's PDF Transformer and that worked. www.pdftransfomer.com[/url] has a 15 day free trial so it might be worth testing that. Failing that, if you want to email me the PDF file you've produced I'll see if I can replicate things here.

                           

                          Best wishes,

                           

                          Olly

                            • Reoccuring PDF Problem!!!
                              jbvinny _

                              With this being a reocccuring problem I would prefer to find a solution with Acrobat and the other tools at my disposal (Acrobat and other standard windows programs) if possible. I doubt I can talk management into another software tool given the economic times.

                               

                              I cannot forward you the document (confidential). 

                               

                              Thank you for taking the time to respond though.

                                • Reoccuring PDF Problem!!!
                                  Grant Perkins

                                  Adobe PDF Reader has had a Convert/export to text facility for a version or so. Presumably Acrobat also has this?

                                   

                                  I would be tempted to skip the searchable PDF creation for the purpose of using Monarch and see what results you get with a straight conversion to text using the OCR component. Is that an option?

                                   

                                  Experiences with PDF documents of unknown parentage in the past has produced variable results using the Adobe conversion to text or any other similar tool but I would imagine anything generated with more recent PDF writers could be more consistent.

                                   

                                  I might hazard a guess that Acrobat V9.1 Paper Capture plugin (is this an Adobe development or a thrird party program?) may have added something to the file that is not yet identified by the Monarch PDF validation routine - hence the message. However that is pure speculation on my part and you would probably need to check that possibility through Datawatch Technical Support.

                                   

                                  HTH.

                                   

                                   

                                  Grant

                                    • Reoccuring PDF Problem!!!
                                      jbvinny _

                                      Grant

                                       

                                      Exporting to text does creates a blank text document and does not work. 

                                       

                                      Acrobat only allows you to use their OCR tool to create a searchable text pdf file.  It does not allow for an export directly to text from the OCR tool. 

                                       

                                      I understand that the parentage could be an issue. Which is why I tried converting the pdf to tiff files and then converting back to PDF which was a solution I saw from another board member.   Still no luck!

                                       

                                      Not sure of the origin of the paper capture plugin.  This was a document provided to us by a third party.  What is the easiest way to contact Datawatch Technical Support?  This is just frustrating because you buy these tools and I have repeatedly had problems importing PDF documents into Monarch!!! 

                                       

                                      Once again, thanks all of you for your help and if anyone has another other ideas I am more than willing to try.

                                        • Reoccuring PDF Problem!!!
                                          Grant Perkins

                                          Grant

                                           

                                          Exporting to text does creates a blank text document and does not work.

                                           

                                          Acrobat only allows you to use their OCR tool to create a searchable text pdf file. It does not allow for an export directly to text from the OCR tool.

                                           

                                          /quote

                                           

                                          Ah.

                                           

                                          At the back of my mind there is a vague recall that I read something about this and that the searchable OCR'd document is actually the original PDF with a mapped search word table embedded in it rather than a text conversion as such. I could be wrong, but it would explain the 'Not a PDF file' message to some extent.

                                           

                                          I do appreciate that any externally supplied document can provide greater problems than something in-house over which you may have control but I didn't make the link earlier to this being an externally provided document.

                                           

                                          I would be tempted to take Olly's suggestion for trying one or more OCR to text converters just to check the feasibility. It should (subject to in-house IT constraints of course) be a relatively quick and inexpensive way to identify a route to any potential solutions OR a way of ascertaining that it is likely to be something of a challenge.

                                           

                                          Either way you would be in a better position to decide on how to progress.

                                           

                                          Contact Support here:

                                           

                                          http://www.datawatch.com/_support/contact_tech_support.php[/URL]

                                           

                                           

                                           

                                          HTH.

                                           

                                           

                                          Grant

                                            • Reoccuring PDF Problem!!!
                                              jbvinny _

                                              Thanks again Grant and Olly.   I will proceed with Tech Support.

                                              • Reoccuring PDF Problem!!!
                                                Steve Caiels

                                                Hi,

                                                 

                                                It's worth updating to 9.01 from http://www.datawatch.com/_support/downloads_updates.php[/URL] if you don't already have it.

                                                PDF import improves version on version.

                                                 

                                                Regards,

                                                Steve.

                                                  • Reoccuring PDF Problem!!!
                                                    jbvinny _

                                                    I tried downloading the update. The file still does not work. Its very frustrating to purchase a product that is not able to do the exact thing you purchased it for.  Anyone else have any ideas on a work around solution. Otherwise, the software is useless to me if cannot handle simple PDF conversions that are created and or saved by Adobe Pro 9.0.  I wish I would have known that before buying the product.

                                                      • Reoccuring PDF Problem!!!
                                                        Data Kruncher

                                                        As Grant alluded to earlier, there are any number of possible complications and complexities with this type of activity (scanned -> OCR -PDF).

                                                         

                                                        Did you proceed with contacting Tech Support directly as you mentioned in April? And if so, did they attempt to extract data from the file, and what were their results?

                                                          • Reoccuring PDF Problem!!!
                                                            jbvinny _

                                                            Data,

                                                             

                                                            The problem is I do not want to have to go to Tech Support each time I need to open a file that was saved using Adobe Pro.  There is no particular file that I need open at this time (however there in my first post). I have just tried repeated tests with these types of documents and each time i get the same error which opens the documents in plain text.

                                                              • Reoccuring PDF Problem!!!
                                                                Grant Perkins

                                                                As I understand it from the post trail here you get the same result if you open the processed PDF using the Acrobat Reader - a text file that is empty.

                                                                 

                                                                So your scanned document is held as an image, the OCR reader leaves it like that and creates a word index connected to pixel positions on the image and there makes no difference when you come to re-process it using the Reader or Monarch. In other wpords the Adobe OCR and index routine is not making a conversion. So to really convert the image to text (the index having little relevance here it seems) you need a different tool. either from Adobe or a third party to convert directly to a text file or to a 'text' pdf file of some sort. As I recall Olly made a suggestion about a useful tool for that process in an earlier post. It's an area of functionality that is very specialised and Monarch has never claimed to address that aspect of PDF files.

                                                                 

                                                                HTH.

                                                                 

                                                                 

                                                                Grant

                                                                  • Reoccuring PDF Problem!!!
                                                                    jbvinny _

                                                                    The problem is not just when converting scanned images. This appears to be an issue anytime I save with Acrobat Pro 9.  I understand Monarch is not going to be able to solve all the worlds problems I just have a hard time believing I am the only one using Monarch and Acrobat Pro 9.  Are other people having these problems? Or is it just operator error on my part

                                                                     

                                                                    Acrobat Pro does have a convert to text option but this creates a very ugly file that is no the most user friendly when pulled into Monarch. I appreciate your replies.

                                                                      • Reoccuring PDF Problem!!!
                                                                        Grant Perkins

                                                                        Acrobat Pro does have a convert to text option but this creates a very ugly file that is no the most user friendly when pulled into Monarch. /quote

                                                                         

                                                                        That is not uncommon to PDF files converted to text  - by whatever route. Despite that there are certain approaches that often help to get the text contents into a usable form - either using the adjustment facilities provided or, if they fail to deal with the embedded fonts and spacing instructions, by taking an approach based on (some times fairly drastic) slicing and dicing text manipulation techniques

                                                                         

                                                                        The problem is that without an example file to experiment with it is extremely difficult to know where to start to provide suggestions. Also there is no absolute guarantee that any approach will be successful and consistent.

                                                                         

                                                                        Does Acrobat 9 have any options to write files as if they were earlier produced by versions?

                                                                         

                                                                         

                                                                        Grant