You had me at beer.
I run into this problem daily and there are really several things I will do to try and fix it. See below for some options to try:
- Option 1 - Open the PDF in Adobe and go to Print. Select Advanced beside the printer and at the bottom, select "Print to File". Then print it. It will pop up a save box and just save the .ps (postscript) file in the same directory of the PDF you wish to alter. Once you have that .ps file, double click it and it will re-compile your PDF and potentially make it cleaner when you pull it into Monarch.
- Option 2 - Try using Foxit Reader versus Adobe to try and re-print to PDF. It encodes things a bit differently than Adobe so that could help.
- Option 3 - On the Monarch PDF settings, play around with the different PDF engines. I've had a lot of success using some of the older engines on certain files.
- Option 4 - Depending on how the PDF data is moving around in Monarch, you could create a template that simply selects the entire row of data as a detail line and then export that to a fixed length text file and then import the fixed length text file into Monarch and at least all of the data will start in the same column.
- Option 5 - Write a fairly complex regular expression trap using explicit capture
Give these a shot. I generally try one or some combination of all of them at times in order to get a file in.
When this works for you, I'll get you the address to ship to
As a sub note to Austin's suggestion number 4. You may be able to improve it in the first pass by using a combination of the floating trap, extracting several fields worth of data into a single filed, then tidying it with calculated fields in the table. The Extract, InTrim, Split and If functions are likely to be the most useful.
As Austin mentioned, it is definitely worth trying the older PDF engines in some cases.
If you can post a screen shot showing the variability you are seeing, then we may be able to comment some more.
Steve makes a good point, try sharing some examples with us so we can hone in on a more specific approach. I'm a bit partial to regular expressions (specifically regular expressions using explicit capture). The reason for this is that you can set a pattern to capture and it can be dynamic in length and not capture things it isn't supposed to. I taught myself regex using a website (RegexOne - Learn Regular Expressions - Lesson 1: An Introduction, and the ABCs) and although the regex framework is different, it was able to help me understand it more. I believe Monarch uses a .NET framework of regex but there are other frameworks. Using the explicit capture option requires the syntax of (?<FieldName>[regex pattern]).
Hello Austin, Steve,
Thank you both very much for the quick reaction. Regexone was very helpful and I can handle regex for a while, they suit well my needs, also floating traps an so the other features.
But, considering that not all my colleagues have v14, so they cannot use regex, my goal is to find a way to re write the pdf in order to appear in Monarch after import exactly the same way it appears when you open the PDF in Adobe. This way they can easily work the reports and the errors are minimal. We pull data from different reports on a daily basis.
Tried it today: converted a PDF to a PDF with JPGS which I then converted with an online OCR to PDF again. The result was a mess. I am going to try with Adobe full asap and let you know if I am successful .
Like Austin and Steve we often come across this and my observation is that if it is likely to be bad it will be bad and probably very bad.
Unless your colleagues tasked with working with these files are at least as competent and interested as you are ... I doubt whether you will find a "one fix fixes all" solution.
There are many possibilities to address and resolve the problems but they are usually quite complex and may involve several steps in the process. The maximum benefit to doing that development work comes where the results are to be automated - in which case you are working for the needs of automation not human colleagues. That probably makes the task easier!
I have successfully found a way to convert a jpg to text via software scanning (it surprised me at the time) but I'm not sure how reliable it would be (testing was limited) and it does involve human interaction. The downside is that I am not at all sure that the software I used is available now. Early this year the developers merged two related products and it is just possible that functionality as I used it is no longer available. At the same time they moved to a subscription model (iir) and doubled the price of the main product (which I did not use anyway) and as I use the software for no more than a few minutes once every 2 or 3 years I chose not to sign up for the new version at that time.
However, if you have a sample file that might be possible to share I would be happy to try it with what I DO have and let you know whether it looks likely to be of use.
Thanks a lot for sharing your experience. Unfortunately the data I work with and the sources are sensitive.
I tried to convert the original pdf report to a pdf with jpgs and then to use the OCR from Adobe to convert the pdf with jpgs again in a pdf to text. The resulted pdf was worse than the original one. However, I will still try to find a software which will help us to have displayed in Monarch the data approx. the same way it is displayed when opened with Adobe Reader. Maybe you can share the name of the software you use?
Thank you all for your time and efforts. Best of luck this week as well.
I am reluctant to post the name of the software I used for a number of reasons. Note that I only looked at the requirement to convert from a non-text PDF (effectively a jpg).
Firstly, on a public forum, to do that might appear to be a semi-official recommendation and that would not be the case.
Secondly I have not had an opportunity to try it out on a usefully wide range of problem reports to see whether the process performs sufficiently well in all examples.
Thirdly the product I used is no longer available as a "stand alone" product but has been "merged" with a larger and more comprehensive application for the same vendor at the point of the last new version. As the previous version of the main product (which I tested but do not have a licence for) did not seem to have the same functionality and the one before that (for which I do have a licence) certainly does not have the same capability according to my investigation, it is quite possible that the option may have been lost for the new version too.
As I recall I was unable to run a quick trial installation as the application seemed to want to delete earlier versions as part of its installation and at the time I had no useful purpose for going through that exercise and then having to re-install the older applications - assuming they would re-install without problems.
If you have corporate access to fairly modern OCR and PDF scanning software that offers the opportunity to work with existing electronic documents (although not Adobe it seems from your comments above) it would be a good idea to check to see what functionality it (or they if you have more than one possibility) might offer.
However, from my past testing, those that I used as they are intended to be used produced a less than useful result. Which is why I was surprised to stumble across the successful result as I did and have concerns that it may have been an unintended feature that may not persist through future "upgrades". It may not even work for most of the reports which defeat the main applications - my few successful uses may have been no more than luck.
I would really appreciate an opportunity to do some more testing (of-line) with known examples of problem reports but it's not so easy to obtain them!
I do understand why that is the case.
For text based pdfs that are not well interpreted there are some techniques that can be used to make "generic" extraction form badly formatted results and then use text manipulation options to make the required data selection and get a workable data table. I have seen a very small number of examples of reports that are challenging even after that approach is taken. Usually the most unhelpful ones come from report writers that have fancy features for inserting clever headers and footers that look OK for on screen or printed output but are not well handled internally for text extraction purposes form the PDF database. The problem perhaps arises from how the PDF Writer program deployed by the application creating the PDF document does its work.
If you are dealing with reports coming from different source systems, even if they are the same report in the same format the internal structure as written to disk may be very different and so interpret differently when presented to a PDF reader program. If it is possible to create a generic model that works around such challenges that is usually a good idea for long term effective usage.
In some cases you may find you need a multi-step process involving 2 or 3 models to be reasonably confident that you have a solution that can be deployed for any incoming pdf of a particular report irrespective it the report's source.
I've used Abbyy FineReader to good effect - both for manual jobs, desktop batch conversion, and server work with a monitored folder. I believe that other Datawatch customers have also used Abbyy for fairly serious processes.
I've seen posts here going back a few years where other users were doing desktop PDF conversion with CutePDF and NitroPDF, but I've no experience with these either on the desktop for manual work or in scaling up to a server or automated solution.
Please do bear in mind that Monarch's PDF parsing libraries evolved from v8 to the latest v14.2 - so if you're trying to open a PDF generated by some modern engine using Monarch Pro v9.0 or something, you shouldn't expect perfect results.