3 Replies Latest reply: Sep 23, 2016 10:22 AM by Data Kruncher RSS

    Trying to mine names and addresses (tricky)

    MikeMetta _

      I have converted some PDF files into two-column text with Monarch but the data is inconsistent. Here is a representative sample.  For each PDF there was a graphic in the upper left hand corner which is why there is a blank area there:


                                                                              Smith Company                                                                               

      10 Company Street, Sacramento CA 95630-6798                                                                               

      Telephone: (555) 555-0000                                                                               

      URL: www.website1.com[/url]                                                                               

      J and K Company                                                                               

      123456 New Road, Los Altos Hills CA 94022-4599                                                                               

      Telephone: (555) 555-2468                                                                               

      URL: www.website2.com[/url]                                                                               

      East Company                                                         XYZ Company                                                      

         Branch campus of North Orange County Community Company District      Subsidiary of Johnson and James Company District Office located in      

         located in Anaheim, CA                                               Fresno, CA                                                               

         9999 East Street, Cypress CA 90630-5897                              1111 East XYZ Avenue, Fresno CA 93741-0002                           

         Telephone: (555) 555-2222                                            Telephone: (555) 555-1357                                                

         URL: www.anotherwebsite.com[/url]                                          URL: www.website3.com[/url]                                                                               

      West Company                                                         ABC123 Company                                                        

         112233 West Boulevard, Cupertino CA                            Subsidiary of Smith and Jones Company District Office located in         

         95014-5793                                                    Anaheim, CA                                                   

         Telephone: (555) 555-4444                                       98765 Another Avenue, Fullerton CA 92832-2095                            

         URL: www.websitehere.com[/url]                                             Telephone: (555) 555-7777                                                                               

      URL: www.website4.com[/url]                                                                               



      Each entry may or may not have 2 extra descriptive lines of information after the company name.


      There is always a telephone number and a URL for each entry.


      And sometimes the zip code will appear in its own line


      This is just one part of one of many PDF pages (separate files), but as far as I know there are no other anomolies with the data besides the ones listed above.


      Your help in creating model(s) to get the data please? Thank you

        • Trying to mine names and addresses (tricky)
          Olly Bond

          Hello Mike,


          I fear you might need two passes at this one.


          Firstly, create a two column multi column region that fits the PDF scaling, or use the trick to handle variable column widths written up elsewhere.


          Then trap using a blank in an empty column to capture every single line as one row of data.


          Then use Page(), Line() and Column() functions to order the data correctly, then reexport, possibly using a summary with Page() as a key value, as a single column fixed width text report.


          Then use standard address block trapping.





            • Trying to mine names and addresses (tricky)
              Grant Perkins

              Hi Mike and welcome to the forum.


              Taking Olly's lead here and using your sample layout I figure it looks like you need a max of 6 lines to be able to create a Detail template that will capture all fields in a single template using the MCR concept.

              The smalles record will, presumably, be 4 lines and there are 2 blank lines between records. If that is consistent - big if? - then this idea may work ... or it may not.


              Set up the MCR stuff and then select a six line sample ending with a URL line (for reference). Create a trap using the word "Telephone:" and tell Monarch that is is on the 5th line of the 6 line detail block.


              'Paint' the field for Phone number and URL on lines 5 and 6. Be suer ethe URL field is wide enough for all possibilities!


              In line 1 paint a full width field (for the column width)  for the address. Right click the field and go to the advanced properties and set it to end after 4 lines. OK that, and head for the table to see what it gives you.


              Using the sample I had a slight problem with the record at the top of the second column because there are not enough lines available to fit he template. (Think of the second column being shifted to be under the first column and read as one long page.) Adding 2 or 3 blank rows to the top or bottom of the report fixed that. You may not have the problem with your extract from pdf. If you do ... hmm. You could create the text from the conversion and edit in (or auto that and concatenate) a couple of extra lines but whether that is a practical solution for you I don't know. Depends on how many files you have and how often the process is required.


              I'll leave it at that point for now since if that does not work for you that rest of the process, such as it is, will be no help either.







                • Trying to mine names and addresses (tricky)
                  Data Kruncher

                  Greetings all.


                  I have a solution to this challenge that is different enough from those that have been posted here to date to warrant suggesting it as an alternative.


                  As certain components of the solution reinforce some topics recently posted on [URL is no longer valid][URL is no longer valid], I've posted the details of this proposed solution as part of the [URL is no longer valid] series.