    PDF misalignment

    Karen B

      I have a PDF whose columns don't align properly in Monarch. Most of the columns at the beginning and end of the lines don't "overlap," so I've been able to capture those fields just by their width, but three columns in the middle overlap, and I'm trying to figure out how to separate out the three fields.


      What I've done so far is to capture all three columns into one field, and I'm trying to figure out how I can make calculated fields from that. The three columns are just text with no consistent identifying text; any one, two, or three of them may be blank; but there are at least two spaces between columns. So several lines might look like this (I've put in underscores to stand for spaces:


      __Smith, Mary_____descriptive text____12345 - more text

      The Name of a Company____description______Some other text

      Mary Smith____________________12345 - in 3rd column

      _______________________________third column text only

      _________Second column text only________

      ___First only___________________________

      ____________Second col___________Third col


      How can I separate out these three fields?

          Data Kruncher

          Hello and welcome, Karen.


          One of first things that I like to try with this type of problem is simply increasing the Output Scaling number when importing the PDF file.


          This type of problem often, but not always, disappears at higher scale values.


          It's worth trying, then we'll go from there. OK?



              Karen B

              I've tried that already. If I click the "optimize import options for sample page" button, the scaling is set to 2.0. That doesn't eliminate the problem. Anything larger, and more lines and then eventually more columns have the overlap problem. I get the least overlap with 1.0.

                  Karen B

                  The deadline has passed on this, which was our pilot project for using Monarch. When I sent in the results, I had to report that I had trouble with the three columns overlapping once the text was imported into Monarch and that consequently some text from one field for some records ended up in the wrong field. I had to report that, for instance, if the original trx looked like this in the PDF:


                  ABC Company M...     inv#131135     630050 • Hous...[/CODE]


                  The name, memo, and split fields might have ended up like one of these:


                  Name                                Memo                    Split

                  ABC Company M               ...   inv#131135           630050 • Hous...

                  ABC Company M   .           ..   inv#131135   6       30050 • Hous...

                  ABC Company M   ..          .   inv#131135  63       0050 • Hous...

                  ABC Company M   ...         inv#131135                 630050 • Hous...


                  Since those names look like four different entities when aggregating, I created a shortened name field that contains only the first 17 characters to add up the totals by name.


                  I promised that I'd continue to explore a resolution to this problem, so here I am again asking: Does anyone have any ideas? As I mentioned before, I've already experimented with the output scaling and haven't found a scale that entirely eliminates the problem.


                  Thanks for any suggestions!

                      Data Kruncher

                      Hi Karen,


                      This latest data sample looks much more manageable, but I have a couple of questions.


                      Does the middle column always[/I] begin with the text inv#[/B]?


                      And is it always only one invoice number, followed by what's in the third column?


                      Or do the implied rules of your first post still stand? Originally it looked like you could have what was essentially free form text in any or all of what were meant to be three distinct columns, with no clear way to differentiate what really belonged where.

                          Grant Perkins



                          If you have a recent version of the Acrobat Reader for PDFs it should have the ability to convert to text - rather like Monarch does.


                          Have you tried that and if so what did the results look like?


                          There have been a number of posts saying that opening a difficult pdf in Acrobat and re-writing it as a new pdf can also work wonders.


                          Other than that I think what you are seeing is an attempt at proportional spacing on rather empty lines - things look strange as the interpreter treis to understand how to position things. Sometimes one can use an approach that creates a very very wide page in such a wat that although th field don't line up they also don't overlap but might be extra wide.


                          On the other hand if the overlap is predictable, which is where Kruncher is coming from I think, then fields can be captured as a single string and split using slice and dice techniques.


                          Unfortunately without some original sample files to work with offering advice in this sub-field of report mining is usually rather an imprecise activity.






                              BaWahoo _

                              I have had a similar problem.  The MOD will work with one PDF file, but on another PDF with the same layout would move the data when importing and thus the trapping would be misaligned. Any suggestions?

                                  Grant Perkins

                                  If the available adjustments in the PDF interpretation routines don;t allow you to gain some control over trapping - no matter how extreme the settings you use - then you may need to consider if you can rough extract as entire lines, reformat and then export the result as something more manageable.


                                  Also you could try editing the PDF using Adobe Acrobat or pehaps a third party alternative and simply re-create from the opened file to see if you get a better starting point from the new version. Many have been successful with that approach.


                                  In the unlikely event that you have some influence over the production of the original PDF file you could seek a change of font (ensure it is fixed and not proportional) and a different PDF writer program in an attempt to find something that is more consistent than you seem to be getting currently.