13 Replies Latest reply: Aug 16, 2016 4:13 AM by Chris Porthouse RSS

    PDF Fields inconsistent horizontally

    Craig Kortlandt

      Example Report 1:

      Sectors                     2016 2017 2018 2019 20,020 2021 2022 2023 2024 2025+

       

      Example Report 2:

          Sectors             2016    2017 2,018 2019      2020 2021 2022 2023 2024    2025+

       

      Example Report 3:

      Sectors             2016         2017      2018      1000,002,019      2020      2021      2022      2023      2024      2025+


      Bringing in a PDF that has consistent spacing when viewing in Acrobat Reader.


      When brought into Monarch the spacing is inconsistent as I tried to illustrate above.


      I believe I've figured out how to tell Monarch which lines to look at with a floating trap, but I'm not sure how to set the fields.

       

      The field size is inconsistent as well could be anywhere from 15 to 1 digits.

       

      Any ideas?

        • Re: PDF Fields inconsistent horizontally
          Austin Perkins

          Craig,

           

          Can you share what version of Monarch you are using and if you have Data Prep Studio?

           

          Austin

          • Re: PDF Fields inconsistent horizontally
            Grant Perkins

            Hi Craig,

             

            Assuming your sample line is a representation of an actual line extracted ... there are clearly some problems with what is appearing line by line and they look likely to make the next processing steps unnecessarily complicated.

             

            What do get if you use something like Adobe PDF Reader or Acrobat to extract the text? As a test.

             

            Some other applications will make an attempt to convert to text too.

             

            The internal structure of a PDF can often be nasty even if it looks OK when viewed.

             

            Grant

              • Re: PDF Fields inconsistent horizontally
                Craig Kortlandt

                It's quite a mess in text.  I think I'm headed down Austin's path of grabbing complete lines and separating into columns.

                 

                (I've also inquired with the author of the PDF about grabbing the base data this report is made out of.)

                  • Re: PDF Fields inconsistent horizontally
                    Grant Perkins

                    Seeking an alternative source type is never a bad idea for a PDF when you cannot be sure that it will consistent in itself page by page (etc.) let alone day by day, week by week, month by month.

                     

                    Whole report lines (or even text blocks in some circumstances) and slicing and dicing can be a very successful technique indeed. In fact there are times when it is easier to go straight for that approach than to try to work with columns and floating traps.

                     

                    However ....

                     

                    In your sample lines you seem to have some spurious data values on 2 lines that make an extra mess in the year numbers.

                     

                    If that is reality that makes life much more difficult when trying to slice and dice the fields you need.

                     

                    If the "inserts" are variable in their appearance in a line or the content they insert things may get even more complicated - especially if you're objective is to produce a model that can be applied, time after time, with no further adjustments.

                     

                    It may well be possible to make such a model - but if the incoming file is inconsistent it's difficult to be certain about that.

                     

                    If that IS a problem and no alternative better source can be made available then one strategy is to create a model that allows for interaction when used but endeavours to keep the effort required to a minimum and has built in "fix" options.

                     

                    As a simple example here  - if the insert data always trashes the consistent reporting of the same year position you could make the model use a calculated field for that year and simply inputs the value you need. Remember to change it as the years change over!

                     

                    Alternatively simply provide an option of a field for the user to input the value to appear in a "bad" field and replace the extracted string that way using a conditional value formula field or similar.

                     

                    There are several options. Which to use to be most effective depends on circumstances and requirements on a case by case basis.

                     

                    HTH.

                     

                    Grant

                • Re: PDF Fields inconsistent horizontally
                  Chris Porthouse

                  Just to throw my 2 cents in as well...Austin's suggestion is good for DPS, but if you want to stick with classic, you can always try a regular expression trap which is pretty good in dealing with "floating" and "space" issues.

                    • Re: PDF Fields inconsistent horizontally
                      Grant Perkins

                      Chris,

                       

                      I'm not sure that the majority of existing or new Monarch users will be familiar with regex concepts.

                       

                      I don't recall seeing anything mentioned - I may have missed it - but is there a tutorial for Regex functionality that people refer to?

                       

                      Craig's sample lines are representative of the sort of challenges that frequently emerge from  PDF files on regular basis. They make a good basis for a basic use case example of a solution - especially in a situation where the expectation and preference would be for one model to understand and correctly interpret all three line variations.

                       

                      If you could put together a brief description of how Regex could be best deployed in this case it would surely provide a useful introduction for those who are not familiar with Regular Expressions.

                       

                      Would that be possible?

                       

                      It would be great to see an uptick in interest generated through the community.

                       

                       

                      Grant

                      • Re: PDF Fields inconsistent horizontally
                        Austin Perkins

                        I always tend to call out Chris on being able to do a much better job of writing the syntax for RegEx but I created an example text file with the 3 variations in format that were included on the original post and then trapped it with RegEx. I used the following syntax:

                         

                        .\s*(\d+\,?\d+?\,?\d+)\s+(\d+\,?\d+?\,?\d+)\s+(\d+\,?\d+?\,?\d+)\s+(\d+\,?\d+?\,?\d+)\s+(\d+\,?\d+?\,?\d+)

                         

                        It's not pretty but it grabs it all. The important thing to note here is when you right click on each "orange" highlighted field and create field from capture, be sure to expand the width of the capture for each field so that it is the width of the largest possible number you may end up with. Don't worry about it being too long that it will take over the next field because it should stop at the beginning of the next field.

                         

                        Hope this helps.