    Processing drawn tables in pdfs with variable records

    esegura _

      Hi there!


      I'm in need of advice. I'm new to Monarch and I need to process a 1000-pages pdf file that looks like this:




      As you can imagine, this layout represents a challenge. To start with, those vertical lines seem like they won't make it to the text version in any form. Second, as you can see in the picture, even the headers are not aligned properly. Third, I'm seeing some variance in the number of spaces that Monarch is inserting when I create the text version. Like in the following example:


      (on page 3)

      Doe, John


      (and then on page 629)

      Doe, John


      So, I can't even rely on the number of spaces. Can you folks here help me develop a strategy to address this problem?


      Thanks in advance!


          Grant Perkins

          Hi Ed and welcome to the forum.


          Y'know it seems that 90% of the people who suddenly find themselves with Monarch as a new work mate also discover they have been dropped in at the deep end of the pool. Welcome to that club Ed!


          PDFs are interesting beasts and looking at the relative complexity of the contents of your challenge you have one of the more interesting ones.


          The report content looks familiar (though not in PDF form) from earlier posts so I wonder if any of the other members have already grappled with this one.


          Some thoughts as a start.


          1. What does the Adobe Reader make of the report if you get it to convert to text? (Try it with a recent version. If it looks Ok you are in with a chance.)


          2. Have you experimented with the parameters over which you have manual control? If so what sort of results could you get?


          3. What sort of font does the document use? I'm sort of wondering if it is proportional or maybe some of the columns are centred? Sort of. Fixed fonts for input and output are invariably more consistent, but on input you have to work with what you have.


          4. If all else fails you may need to get into very wide fields to capture horizontally 'moving' strings (one or more fields) and then tidy up or use slice and dice techniques to liberate the specific data strings you need. Now that's not usually too difficult for the most part BUT can look like a bit of a challenge when just starting out so for now I won't attempt to offer any details. We can go there if or when we need to, which is entirely dependent on the success factor of the text 'file' that you able to extract from the PDF.


          The key to the next stage is getting the most useful extraction (not necessarily a perfect rendition) from the PDF and working out the details of moving forward from there. Could be a steep learning curve - but interesting.