There are Region traps in the latest version of Monarch, which handle this problem vertically, so you can exclude the first n pages of the annual report and just tell Monarch to trap on the balance sheet, for example. But that's not what you need.
You could explore the Regex trap - introduced in 13.2 I think (it's hard to tell, as releases come so fast these days but aren't announced here in the community) - or use the Not trap function to exclude lines where there are numbers or / characters in columns 1 to 10 or 30 to 40, and the character list trap to look for numbers and / and spaces in the intervening area.
But without sight or your PDF, it's pretty tricky to advise, and I'd be reluctant to shout for a new feature if there's a trick to handle it already.
Thank you for the reply. You are correct, the region traps have been useful since they were released but unfortunately they aren't helpful in this situation as you stated. I have not ever used the Regex trap (haven't done much research or found any training on that yet - certainly open to learning). I use the NOT trap often, but again, it can only do so much when the column data is moving left and right throughout the report. This was just a general suggestion, I could find an example around here somewhere but would have to work to find something without personally identifiable information as the work I do is patient related.
I'll look into the Regex traps and once I've figured it out, I'll post my solution (assuming it fixes it!).
In addition to Olly's suggestions the quick and classic way to get a result in a hurry might be to get the trap selection as accurate as possible, extract the records and then filter for what you need.
That's a very crude description of the outline of an approach and may not always work for you if the data in the rows can be extremely random and perhaps result in false record identification.
Another approach would be create a basic model that simply selects every row as a complete field. Then add a calculated field to the table that extracts whatever is in row positions 10 to 40 AND excludes (by a filter factor built into the formula) anything you don't want. The target result will be that the field is only populated for the records you need and so can be used as a pseudo trap.
Now output the entire table as a new "text" file of some sort with fixed width fields. Use a fixed font.
Open the new text file and create a second model model with a trap using the new calculated field wherever it holds a value. Select the required data fields as normal from the main lines of the report using templates.
As solutions go it is not very elegant but it should be relatively simple to understand and make the model easily adaptable for any future changed needs. It also offers the benefit of quickly deliverable functionality - especially if you have Automation available to you.
I am assuming you have thoroughly investigated the possibilities for "tuning" the PDF in an attempt to optimize output consistency. (And yes I am aware that there can be extreme challenges with some files!)
Thank you for taking the time to respond! That is essentially the method I use currently to try and work around this issue. I'll select the entire width as a field, then export it out as a fixed length text file so that now, all of the data starts in the same column. Not sure if you all have ever seen a hospital PS&R (Provider Statistical and Reimbursement System), but this is one file that I generally have to do that with.
I appreciate both of you responding so quickly. I just became a member yesterday but am looking forward to learning many things from you all.
From what I have seen the Hospital Reporting systems (and indeed all Healthcare systems) produce a number of reports which may have changed over time and be constantly evolving. I assume that's why they absorb so much time and attract administrators to Monarch!
That said I have seen very few reports from any source (from memory) where a line containing only uppercase characters was an evident record selection criterion. Certainly none where the Uppercase aspect would have been the key to unlocking the data extraction trap. So you have piqued my interest.
When we add in a PDF as the source the game often becomes a bit more random. The internal data content of a PDF file can be highly variable and often heavily influenced by whichever program was used to write the file. Some are better than others when you move beyond the original PDF concept of fixing a document to a well controlled format and allowing the inclusion of graphics elements.
Similarly a frequently edited PDF file (or PDF file template) can become challenging to read, adding to the potential for the output program to deliver something that looks right on screen and in print but is a mess behind the scenes.
We have been told of instances where host system upgrades have caused problems with well proven PDF based extraction models simply because the host update included a new PDF writer routine that produced the same output on screen and paper but differed internally thereby messing up when the model that dealt with the previous version of the output was applied.
Things have greatly improved over the years that the extraction form PDF functionality has been available but there still plenty of systems out there with old and potentially unpredictable PDF writer programs embedded within them.
So you are potentially dealing with 2 separate challenges - whatever the PDF extraction can be tuned to produce and then how to model for the records you need to extract.
It sound like you have taken the pragmatic approach. However I am wondering if there is another option that might be identifiable working with a sample of the report.
I appreciate that you are unlikely to be able to share a report with anyone due to confidentiality requirements but is there any possibility of a a report from a test or training system that contains dummy data?
Alternatively I have seen the system training/User Guide documentation on line. (Whether up to date I am not sure).
The systems seem to have several reports illustrated. If you could point us to any links to those that are challenging I would be interested to make a visual assessment to see if any ideas and suggestions come to me. It's not the same as dealing with the output from a PDF of course but it may produce some ideas for alternative approaches.
If not it would mean that you are on top of the game as it currently stands and the feature suggestion may well be a powerful and useful idea.
I believe I may have found the solution through Olly's suggestion regarding Regular Expression traps. I did a brief RegEx training at regexone.com to understand the syntax and started experimenting with the traps in Monarch. The sample report I was working with was a PDF and the data moved substantially from left to right and I was having trouble trapping the detail line due to the floating trap trapping wrong rows. The first column of the detail had a date and using a simple RegEx expression, it captured all of the detail rows exactly.
I plan on exploring the uses through the various other reports that I have (including the hospital Provider Statistical and Reimbursement System report) to see if that can eliminate the need for all the steps I'm having to use. I'll keep everyone updated on that. Do either of you use RegEx traps very often?
I like the news of the exact capture using Regex. Excellent!
Normally a trap row with a date in the first column would be quite trappable (at least as a row even if it's not so easy to map the fields within it) without too much of a challenge no matter what the source PDF was doing. So I do wonder if there are perhaps some special complications in the PDF internal structure or some as yet undiscovered possibilities for making adjustments to the way the PDF is being interpreted.
As for Regex - it's a recent addition to the Monarch toolkit so I suspect few people will yet have extensive experience of it working with Monarch. You may be a pioneer here!
I would certainly encourage you tell us all about your experiences with this new feature so that we can all join in the discovery and develop a thorough working knowledge quickly for what should prove to be a very powerful addition to the tool set.
By the way, have you used the Auto Define Trap feature yet?
My personal opinion is that it will often seem to be overkill for the purpose. Many traps can be remarkably simple and simple tends to be faster and easier to understand and maintain if things change over time.
However as a quick guide to expose the trapping possibilities available in the report files - especially if initial "manual" attempts prove unsatisfactory - it is an invaluable tool.
While there are many input reports that are well structured and have consistent data making modelling fairly straightforward there are still a large number of often data heavy reports out in the wild that need special attention to work around "interesting" programming of the outputs.
Auto Define can be very handy in such situations by providing clues about what you might be up against.
Whilst Monarch can always (in my experience) provide a way to deliver the desired result being able to work out early in the process what sort of approach might be the most effective can be important for responding to requests or needs promptly. Auto Define can be part of the tool kit for that even if you choose not to simply accept its suggestions.
Well done and have fun with Monarch.
I do on occasion use the auto-define feature, however it rarely helps with the PDFs that I get. I was able to trap the sample file earlier using the following RegEx expression: ^\s+(\w\w/\w\w/\w\w)\s+\w (where ^\s+ denotes that there are 1 or more spaces before the date and the remainder is the date format). I will be the first to tell you that my syntax is sloppy and there are better, cleaner, ways of writing expressions, but that will take more than 20 minutes of online training to achieve. I was also able to play around with some trapping of other items where a RegEx wasn't exactly necessary but for instance trapped the "Insurance Company" using: ^\s+Insurance Company:\s*(?<company>[A-Z].*) (this expression takes into account spaces and then looks for "Insurance Company:", then spaces, and selects the text right after it).
Again, I'd post sample reports but it would take a decent amount of time to clean something up so that all of the patient data is removed and unfortunately I have to account for every 15 minutes of my time and I doubt my boss would like to see that on the hour detail! Gotta love the CPA firm time management system.