15 Replies Latest reply: May 15, 2014 10:13 AM by Grant Perkins RSS

    Can a template be made to work with badly formatted data?

    Assured _

      So I have a lovely piece of data coming to me in the form of 444 non-conforming pages.

       

      I have tried in vain to set up traps, only to have an error (too far left or right and running into another field) on at least one field.

       

      A hard "/" on the first date field almost worked, but 3 units of the 444 pages worth of data were non-conforming as you can see in the sample.

       

      The CA# field works best, but I need to use a multi-line field in order to capture the full "legal name" / "dba name" etc, and then my phone is jumping all around as well as other fields.

       

      I'm using 9.01, is there hope for me?  Any trapping/ field suggestions? (besides getting new data)

      The data is from the CA website, it is downloaded as a PDF (only), and that is how it looks from the get go!

      Thanks.

                                                                                      MCP Active Carrier Listing                                                                               

      Last Updated:   08/05/2011                                                                               

      CA #                   Legal Name                         Dba Name                   Business                                   Liability                               Date                Workers' Comp.                    Date         Private      For Hire                                                                               

      Phone                                Insurance Company                         Effective          Insurance Company                 Effective         Fleet         Fleet                                                                               

      0410833      A & S CONSTRUCTION INC                                                (209) 334-5200             GOLDEN EAGLE INSURANCE                                  05/04/2011       STATE COMPENSATION 12/30/2010 10                                                                               

      CORPORATION                                           INSURANCE FUND                                                                               

      0367653      A & S LANDSCAPING INC                                                        (818) 882-6998          ARCH INSURANCE COMPANY                              05/18/2011          TOWER INSURANCE 04/01/2011  1                                                                               

      COMPANY OF NY DBA                                                                               

      TOWER SELECT INSURAN                                                                               

      0147929      A & S MANUFACTURING &                             S & A                      (209) 394-8048          GOLDEN EAGLE INSURANCE                              10/01/2002          DELOS INSURANCE 04/01/2009                                                   

                                WELDING INC                    MANUFACTURING                                                                 CORPORATION                                        COMPANY NAIC 35408                                                                               

      0000025      A & S METAL RECYCLING                           A AND S               (213) 623-9443          ACE FIRE UNDERWRITERS INSURANCE 06/01/2011                                  STATE COMPENSATION 10/01/2003                                                                               

      INC                      ENVIRONMENTAL                                                        COMPANY                                                         INSURANCE FUND                                                                               

      RECOVERY                                                                               

      0223184          A & S MOLD AND DIE                                                       (818) 341-5393       ALLSTATE INDEMNITY COMPANY                             11/23/2001       STATE COMPENSATION 01/01/2004  2                                                

                         CORPORATION                                                                                INSURANCE FUND                                                                               

      0130144          A & S PRODUCE INC                    GREENLAND FOOD               (323) 260-5100             GOLDEN EAGLE INSURANCE                                  05/13/2011            FREMONT       07/22/1999  9                                                                               

      COMPANY                                                                  CORPORATION                                             COMPENSATION                                                                               

      INSURANCE COMPANY                                                                               

      0364432          A & S PUMP SERVICE                                                (559) 322-5838             STAR INSURANCE COMPANY                                  12/22/2010      NATIONAL UNION FIRE 09/01/2010 10                                                                               

      INSURANCE COMPANY                                                                               

      OF PITTSBURGH PENNSY                                                                               

      0290647        A & S TRANSPORTATION                                                       (805) 218-3292      NEW YORK MARINE AND GENERAL                             03/22/2011          DELOS INSURANCE 08/27/2007                                                                               

      INC                                                                                INSURANCE COMPANY                                      COMPANY NAIC 35408                                                                               

      0342868          A & S TRUCKING                                                                  -                                    UNITED FINANCIAL CASUALTY COMPANY04/21/2011                                                                               

      0264043          A & S TRUCKING LLC                                                              -            SUA INSURANCE COMPANY 40134                             10/05/2008                                                                                1                                                                               

      0367457         A & T ARBORISTS AND                                                (805) 431-2581        HARTFORD UNDERWRITERS INSURANCE                              03/01/2010       STATE COMPENSATION 07/01/2011  5                                                

                        VEGETATION MGMT INC                                                                                COMPANY                                                         INSURANCE FUND                                                                               

      0250821        A & T HARVESTING AND                                                       -              FINANCIAL PACIFIC INSURANCE COMPANY                          03/22/2011       STATE COMPENSATION 04/19/2011                                                                               

      TRUCKING                                                                                INSURANCE FUND                                                                               

      Page 9 of 444                                                                               

      /CODE

        • Can a template be made to work with badly formatted data?
          elginreigner _

          What are your PDF input settings? Do you have monospaced or freeform checked? I typically start with the autoadjust option, then checking off monospace.

            • Can a template be made to work with badly formatted data?
              Assured _

              What are your PDF input settings? Do you have monospaced or freeform checked? I typically start with the autoadjust option, then checking off monospace.[/QUOTE]

               

              I tried both monospaced and the other button, along with the auto adjust, the size option at the top left.  Most everything.  Looking at the original information, the formatting just stinks. Even grabbing a new copy, the data is jumbled, and fields overlap.

               

              I've got 98% done with success at least, so, some manual correction is in order for the last few of the 5500 plus items. 

              I was hoping there was a silver bullet in there somewhere.

                • Can a template be made to work with badly formatted data?
                  Olly Bond

                  Hello there,

                   

                  There's no document that can be read by a human that can't be parsed with Monarch, or at least I've not found one yet, so if you can email me your model and report I'll have a bash.

                   

                  Best wishes,

                   

                  Olly

                    • Can a template be made to work with badly formatted data?
                      Assured _

                      OK Olly, sounds great, data on the way

                        • Can a template be made to work with badly formatted data?
                          Grant Perkins

                          I'm not sure what you are getting from the downloaded PDF, Assured, but I get a Table representation which seems to have centre aligned data.

                           

                          Out of interest I tried am Adobe Reader 'Save as Text' run and got the document presneted as if read in columns  but with the column headers below the column data. Interesting idea but I'm not sure if I could make use of it.

                           

                          However opeing in Monarch using the v10 or 10.5 pdf engine gives a pretty good greenbar style representation of the data for which it looks pretty easy to position fields in columns with a few small lissues to deal with. (So far ...)

                           

                          The common issue would be the field centered alignment. That would require a calculated field in addition to the extracted field in order to strip leading spaces and any odd gaps that the regular multi-line field extraction may not deal with. Easy to do.

                           

                          Slightly more complicated is that sometimes the insurance company name columns and their following date columns may 'merge' on the first line. Not always though. It seems to be something to do with the fill of the first line within the centring of the name string.

                           

                          Another calculated field (or 4 I think in this case) should be the answer to that. Slightly more complex that straight formating but not too bad from what I have seen so far.

                           

                          However ... if you are not seeing something that looks pretty much like a table without the 'table lines' when you are assessing the results of Monarch's PDF interpretation then we need to take a step back and find out what is going on there. I did look at the results form the 9.01 PDF engine and whilst they looked not quite as good as the 10 or 10.5 results they seemed much better than the results from the sample you posted so I'm wondering what is different. You certainly don't want Monospaced or Free Form set for use. Experimenting with the other adjustments (the "Stretch" slider in V10) shows that the default analysis seems to work pretty well.

                           

                          Let's see if Olly has the same results that I see.

                           

                           

                          Grant

                            • Can a template be made to work with badly formatted data?
                              Olly Bond

                              Hello,

                               

                              I've had a first bash at it - the CA column is very well behaved, with 5324 unique rows as you'd expect, but as for the rest of the data, it's got some annoying inconsistencies. Not every record has a Liability insurer, for example, so hanging a floating trap on one date can give you false positives.

                               

                              It's do-able - as with some judicious use of Trim and Instr you can attack the data multiple ways, and glue it together using a the CA record as the key, but I don't think that one detail template is going to suffice.

                               

                              I think you need five models - one to pick out CA and Legal & DBA names, with a bit of a clean-up, another to fetch in those with phone numbers, another to fetch the Liability insurer, another to fetch those with  a Workers insurer, and a final one which grabs the crucial measures. From the first, make external lookups to the rest, and you've got your data.

                               

                              I'd be happy to do it, but it's a couple of hours work, and I've promised another forum user (Oli from Ingolstadt) that I'll have a look at a tricky model for him first, so I'd like to leave this one to the weekend.

                               

                              Best wishes,

                               

                              Olly

                                • Can a template be made to work with badly formatted data?
                                  Grant Perkins

                                  It's interesting to see Olly's reply.

                                   

                                  I downloaded a couple of PDFs directly from the  MCP site and found them to be reasonably well behaved.

                                   

                                  The first 4 columns (CA# thru Business Phone) and the last 2 (The 'fleet' numbers) look OK so far.

                                   

                                  The remaining 4 columns of Ins Co names and dates are less well behaved but not too bad (so far ...)

                                   

                                  A simple trap for having something in the CA number position seems to pick the lines. The fields above can be mapped directly (so far). The 4 fields in the middle require a little more work.

                                   

                                  My current approach is to create a single field for each Ins Co. Name and related Date column. i.e. to go for 2 fields not 4. Make then multi-row fields using the field Advanced Properties. ( I originally went for 4 fields and some tweaks for misfits but there are enough odd misfits to suggest doing this differently.)

                                   

                                  The multi row fields created can then be 'sliced and diced' quite easily to get the Dates Effective out of them into its own field and then to remove the date (where it exists) from the middle of the name string.

                                   

                                  I suspect there may be some potential for a few even messier records - in which case I would consider taking all 4 columns into a single field and then seek to slice and dice that unless a few tweaks to the PDF interpretation will overcome the anomalies OR, for some reason, the errant data could be ignored for the extraction.

                                   

                                  However, before describing that I have to say that what I see on screen is much better formatted then the posted data sample so I am wondering if there are other factors involved here that mean my model will be of less use to the file extractions  that you are seeing.

                                   

                                  Grant

                                    • Can a template be made to work with badly formatted data?
                                      Assured _

                                      It's interesting to see Olly's reply.

                                       

                                      I downloaded a couple of PDFs directly from the  MCP site and found them to be reasonably well behaved.

                                       

                                      The first 4 columns (CA# thru Business Phone) and the last 2 (The 'fleet' numbers) look OK so far.

                                       

                                      The remaining 4 columns of Ins Co names and dates are less well behaved but not too bad (so far ...)

                                       

                                      A simple trap for having something in the CA number position seems to pick the lines. The fields above can be mapped directly (so far). The 4 fields in the middle require a little more work.

                                       

                                      My current approach is to create a single field for each Ins Co. Name and related Date column. i.e. to go for 2 fields not 4. Make then multi-row fields using the field Advanced Properties. ( I originally went for 4 fields and some tweaks for misfits but there are enough odd misfits to suggest doing this differently.)

                                       

                                      The multi row fields created can then be 'sliced and diced' quite easily to get the Dates Effective out of them into its own field and then to remove the date (where it exists) from the middle of the name string.

                                       

                                      I suspect there may be some potential for a few even messier records - in which case I would consider taking all 4 columns into a single field and then seek to slice and dice that unless a few tweaks to the PDF interpretation will overcome the anomalies OR, for some reason, the errant data could be ignored for the extraction.

                                       

                                      However, before describing that I have to say that what I see on screen is much better formatted then the posted data sample so I am wondering if there are other factors involved here that mean my model will be of less use to the file extractions  that you are seeing.

                                       

                                      Grant[/QUOTE]

                                       

                                      if you have looked directly at the site, you know that the data is bad out of the box.  I think for the letter "d" on the first page the name actually continues into the date filed and types over the 2 digits in the month.

                                        • Can a template be made to work with badly formatted data?
                                          Grant Perkins

                                          if you have looked directly at the site, you know that the data is bad out of the box.  I think for the letter "d" on the first page the name actually continues into the date filed and types over the 2 digits in the month.[/QUOTE]

                                           

                                          Hadn't seen that in the reports I downloaded but looking at the 'd' report specifically the problem is clear.

                                           

                                          I'm surprised in some ways. It's been a while since I can recall anything quite this poor from a PDF writing engine that is only attempting to produce a tabular output and which, quite clearly, has been designed to wrap long names.

                                           

                                          Another file to experiment with ...

                                           

                                           

                                          Grant

                                            • Can a template be made to work with badly formatted data?
                                              Grant Perkins

                                              Hadn't seen that in the reports I downloaded but looking at the 'd' report specifically the problem is clear.

                                               

                                              I'm surprised in some ways. It's been a while since I can recall anything quite this poor from a PDF writing engine that is only attempting to produce a tabular output and which, quite clearly, has been designed to wrap long names.

                                               

                                              Another file to experiment with ...

                                               

                                               

                                              Grant[/QUOTE]

                                               

                                              That said, the first example of the overlap problem that Monarch comes to on the 'd' list is not a problem in the extraction. Name and date are well separated.

                                               

                                              To be honest this is what one would expect since the PDF data should be separate (in a straight text pdf rather than a 'graphics image' pdf) and it will be the display layer that it not interpreting the display instuctions correctly. (Sort of).

                                               

                                              More surprising is the well separated name and date fields (the same 2 columns) which ARE separated in the table view but are displayed as a contiguous string after extraction analysis.

                                               

                                              It suggests that there is something not quite right in the way the PDF engine is dealing with those two columns when it is doing its encoding.

                                               

                                               

                                              Grant

                                          • Can a template be made to work with badly formatted data?
                                            Grant Perkins

                                            It's interesting to note that the sample report posted by Assured at the top of the thread looks like it is complletely mixed up BUT, in fact, if you click on 'Reply' to the thread and then scroll down to see the previous posts the result looks much more reasonable.

                                             

                                            For some reason my extraction looks better than Assured's but not by much.

                                             

                                            The only difference I can see is that I am using 10.5 rather than 9.01 (albeit with the 9.01 'engine' ) and it is just possible that the slightly different control method offered is making the difference to the extraction. Just enough to make life easier.

                                             

                                             

                                            Grant

                                              • Can a template be made to work with badly formatted data?
                                                Assured _

                                                So the additional model was a great idea, I just created a single trap of the CA number and a fairly broad character trap to grab the first effective date.  Simple calculated field right(effective, 10) and I have perfect dates. I can join them on the common CA number later...unless it is possible to run multiple models at the same time? or a macro to run them for me. Ha, then I need to run them on all the rest of the letters of the alphabet.

                                                 

                                                Thanks for a great kick in the right direction.

                                                  • Can a template be made to work with badly formatted data?
                                                    Grant Perkins

                                                    So the additional model was a great idea, I just created a single trap of the CA number and a fairly broad character trap to grab the first effective date.  Simple calculated field right(effective, 10) and I have perfect dates. I can join them on the common CA number later...unless it is possible to run multiple models at the same time? or a macro to run them for me. Ha, then I need to run them on all the rest of the letters of the alphabet.

                                                     

                                                    Thanks for a great kick in the right direction.[/QUOTE]

                                                     

                                                    If you download all of the pdf files you should be able to open several of them at the same time - maybe all of them depending on how big it all gets.

                                                     

                                                    Assuming the extracted format inconsistencies are, er, consistent (i.e. there are no other strange formatting events due to specific start letters or something equally unlikely), then a common model (or models) should suffice for all of them so long as the same PDF generator engine is used by the source.

                                                     

                                                     

                                                    Grant

                                              • Can a template be made to work with badly formatted data?
                                                Assured _

                                                Very smart, I had not thought about writing multiple models, with the same key, pulling specific items per model, and joining all the output.

                                                 

                                                Genius.