    How to keep Monarch (10.5) from changing PDF spacing?

    rattlehead02 _

      Hi all, I'm new to Monarch. I'm importing PDF data to then be exported to CSV files. The problem is that some of the fields in the original PDF are rather close, so when importing Monarch changes the spacing thinking 2 fields are 1. Here is a sample:


      Area   SIC     ListNUM  Suburb                    ZIP   Address                                           School               

      E09    KM      1246537  Kings Mill              45034   5473 Oak St Kings Mills, OH 45034                 Kings Local SD                                                                               

      E09    HA      1248268  Hamilton Twp.           45039   7922 Sycamore St Hamilton Twp, OH                 Little Miami Local S                                                                               

      E09    DE      1249832  Deerfield Twp.          45040   4511 N Shore Dr Deerfield Twp., OH 45040          Mason City SD                                                                               

      E09    MA      1249085  Mason                   45040   4190 Spyglass Hill Mason, OH 45040                Mason City SD                                                                               

      E09    HA      1259174  Hamilton Twp.           45039   7940 Sycamore St Hamilton Twp, OH 45039           Little Miami Local S                                                                               

      E09    MA      1257884  Mason                   45040   209 W Church St Mason, OH 45040-1607              Mason City SD                                                                               

      E09    MA      1259498  Mason                   45040   927 Cambridge Dr Mason, OH 45040-1006             Mason City SD                                                                               

      E09    DE      1255240  Deerfield Twp.          45140   9989 Columbia Rd Deerfield Twp., OH 45140 Kings Local SD               



      Note that the last record, the school district is butted up against address. In the original PDF the fields are all fixed width so there is no overlap. I could use floating traps, I guess, but it seems it would be easier to ask Monarch to, pretty please, not change PDF spacing.


      How might I do that?





          Grant Perkins

          Hi Jeff and welcome to the forum.


          PDFs can be a bit of a black art and standards of PDF writers found scattered around the IT world can be variable. Even the text extraction facilities of the Adobe reader can struggle with some files. What does it do with this file? Knowing the answer may indicate something of the chances for success.


          Monarch provides some adjustment tools to help with this, not so much to make the extraction look pretty - more to make the text extractable for re-purposing. Have you played with the settings adjustments available? If so did they take you closer to a solution or further away?


          If i read it correctly the sample you posted it the result of the Monarch extraction. Is that correct?


          If the adjustment tools don't offer a viable approach that splits things neatly into fields a common alternative is to capture the problems parts of the line(s) - sometimes entire lines, as a single field and then use calculated fields to slice them up again, provided the field content allows the rules for doing so to be defined.


          There may not be a one-size-fits-all solution for you here but there should be a solution of some sort that is better by far than re-keying everything! Let us know where you get to with answers to the questions. Is there a representative sample file that you could share for people to experiment with if necessary?




              rattlehead02 _



              Thanks for the reply! I've uploaded a pdf that I'm having issues with here https://docs.google.com/viewer?a=v&pid=explorer&chrome=true&srcid=0B93nXDEmUOlPNDczODBlMGUtYjczNS00NGExLWIyMWQtZDcxNzg0NjhkZTE2&hl=en&authkey=CPWbuMoI[/url]


              Note that there are 3 "section" in it. It's the first third that is giving me fits.


              Monarch's pdf import adjustment settings didn't help my problem, so for now I'll try the single line, calculated fields method you mentioned.





                  rattlehead02 _

                  It's not the prettiest, I'm sure, but I managed to use calculated fields to get the results I wanted.


                  First I set the template up normally for all the fields I could. Then for Address and School District I combined them into a single field called "SplitField."


                  Then I created a calculated field to extract Address, and since the state and zip code info are of no use, and "OH" would be in nearly all fields I used this formula:


                  LSplit(SplitField, 2, ",",1)


                  Here's where it gets ugly. To separate the school district I could rely on "OH" 99% of the time. And for the 1 percent where "OH" was no good I found that there were always a few extra spaces separating Address from School District. So, i hacked out this mess:


                  intrim(if(instr("    ", SplitField) > 0, rsplit(splitfield,2, "    ", 1),replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(replace(rsplit(SplitField,2, "OH",1),"0",""),"1", ""), "2", ""), "3", ""), "4", ""), "5", ""), "6", ""), "7", ""), "8", ""), "9", ""), "-", "")))


                  I am wondering if there isn't a more elegant way to replace a more than one character at a time, or even to replace all numerals. I'd much rather something like REPLACE(SplitField, "0123456789-", ""), but that assumes "0123456789-" as an exact string.


                  All in all though, the task has been handled.