11 Replies Latest reply: May 15, 2014 10:07 AM by Grant Perkins RSS

    Replacing text with blanks when importing a PDF

    Kyle_M _

      I have a client using Monarch 9 having an issue when importing a PDF.

       

      The PDF contains City/Town names in one column and any cities with the text "field" in it has it replaced with spaces/blank characters. So "Enfield" becomes "En     ", "Richfield" becomes "Rich     ", etc. This substitution of text with blanks also happens on "Efland" which becomes "E     ".

       

      I have a copy of the PDF and have duplicated it on Monarch 9, and the same thing happens for every PDF Conversion option. I tried using Monarch 10.5 to see if it has been fixed but PDF Conversion option 10.00/10.50 changes "Enfield" to "En  eld" and "Efland" to "E  and".

       

      I tried copying the data into MS word and generating a PDF from it to see if I could replicate the issue replacing data with blanks, but there was no problem with the PDF's I created.

       

      Any ideas on what is causing the problem?

        • Replacing text with blanks when importing a PDF
          Olly Bond

          Hello Kyle,

           

          My first thought was that the client had perhaps used "field" as a floating trap and so Monarch was selecting everything to the left and right of that string, but from your investigation that doesn't sound likely.

           

          In Monarch v10, can you open an XPS of the PDF cleanly? And in Acrobat Reader, can you discover any more information about the PDF generator? (File > Properties usually reveals the version of PDF and the distiller program used to create it.)

           

          Best wishes,

           

          Olly

            • Replacing text with blanks when importing a PDF
              Grant Perkins

              Just to add a couple of ideas to Olly's suggestions.

               

              If there is a problem with the fiel you may discover it by opening the PDF using the Acrobat Reader - if will tell you, sometimes, if it finds potential problems.

               

              If you have a suitable program for editing PDF files load the problem file and re-write it (force it to re-write by editing if necessary, the edit does not matter for your purpose here) and see if the resulting output gives a different result. Many posts in the forum suggest that it probably will - but it's not guaranteed.

               

              HTH.

               

               

              Grant

              • Replacing text with blanks when importing a PDF
                Kyle_M _

                The client just contacted me again because he discovered it's also happening "Ruffin" and "Woodfin" becoming "Ruf   " and "Wood   " in 9 and "Ruf  n"/"Wood  n" in 10.

                 

                Properties from Adobe Reader:

                Application: Adobe InDesign CS2 (4.0.5)

                 

                PDF Producer: Adobe PDF Library 7.0

                PDF Version: 1.6 (Acrobat 7.x)

                 

                I've tried saving the PDF as an XPS but opening the XPS in Monarch 10 results in the entire field being blank. I also tried printing to the XPS writer but checking the "Print to file" option so it saves as a .prn file. When I open the resulting .prn file in Monarch, Monarch crashes.

                  • Replacing text with blanks when importing a PDF
                    Grant Perkins

                    InDesign is, if I am not mistaken, an Adobe DTP/Graphics package. An unusual place to start a report for PDF output, maybe?

                     

                    I would guess that the PDF creator is either treating certain characters as graphics for some reason (Monarch will not read them) or what it is inserting is being interpeted as some sort of command related to formatting or a program instruction rather than text.

                     

                    Does the Adobe PDF reader read the file OK? Can the Reader export (or is it Save?) the file to something that looks about right as text? (It may lose all the format but do the words appear correctly?)

                     

                     

                    Grant

                      • Replacing text with blanks when importing a PDF
                        Kyle_M _

                        Both Foxit and Adobe Reader read the file without issue. I saved the file as text from Adobe and all the names are there as expected (no spaces in place of certain text).

                         

                        It's weird that all instances of certain patterns of text are affected. What do "field" "fin" and "fland" have in common? Well I checked and all instances of "fi" text is replaced with blanks in Monarch 10. Monarch 9 is much worse in that some arbitrary amount of text is removed, like "Foxfire Village" becoming "Foxillage". There are several other instances of "fl" together that Monarch leaves as-is, so I'm unsure why Efland is altered since it doesn't seem to match.

                          • Replacing text with blanks when importing a PDF
                            Olly Bond

                            Hello Kyle,

                             

                            It's just struck me that this looks like a kerning pair problem. Design programs like InDesign store tables of pairs of letters for each font with instructions for the fine adjustment of the kerning space between the letters when they appear. Most letters are an "en" or an "em" wide, but very thin letters like "i" and "l" can be kerned close to the letters around them, especially after "f". So my hunch is that the PDF has got some special code like "En//fi//eld" where the // sets the kerning width. Monarch sees this and can't work out the right column to put the letters in so drops in blanks.

                             

                            Two, no three, possible solutions. Firstly, make sure that Monarch isn't thinking that the text is fixed-width. This will perhaps make Monarch look more carefully for character spacing issues. Secondly, increase the scaling, perhaps to a ridiculous value like 9.9, so that Monarch puts more space between each character. This may overcome the kerning instruction, but it may also inject more spaces than you want to have elsewhere. Thirdly, see if the PDF can be regenerated using a fixed width font, or, in Adobe InDesign, turning off the kerning options.

                             

                            HTH,

                             

                            Olly

                            • Replacing text with blanks when importing a PDF
                              Gareth Horton

                              Hi Kyle,

                               

                              The problem seems to be that the application is exporting the text as ligatures, rather than as “normal” text.

                               

                              This means that fi, ff and fl is written into the PDF as a single "glyph", which has no equivalent in normal ASCII/ANSI text.

                               

                              It also appears it is not doing this consistently, as you saw from your investigations.

                               

                              See here for a discussion on ligatures:

                               

                              http://en.wikipedia.org/wiki/Typographical_ligature[/url]

                               

                              It seems a very bizarre way of producing a PDF.

                               

                              There is nothing we can do with this, unfortunately.

                               

                              Workarounds may be to do a binary search and replace on the character in the PDF for each ligature, expanding it to the text characters, or perhaps finding some PDF utility that can replace them with normal text.

                               

                              It would also be worth checking if the producing application has an option to suppress ligatures.

                               

                              EDIT: Just found a link on how to disable ligatures in Adobe InDesign

                               

                              "To (disable ligatures) globally, first close all InDesign documents. Then open the Character palette and uncheck the "Ligatures" menu item.

                               

                              Now if you create a new document, you'll see that Ligatures should now be unchecked by default."

                               

                              http://www.mombu.com/computer_design/indesign/t-indesign-cs3-and-glyphs-2674328.html[/url]

                               

                              Gareth

                               

                               

                              Both Foxit and Adobe Reader read the file without issue. I saved the file as text from Adobe and all the names are there as expected (no spaces in place of certain text).

                               

                              It's weird that all instances of certain patterns of text are affected. What do "field" "fin" and "fland" have in common? Well I checked and all instances of "fi" text is replaced with blanks in Monarch 10. Monarch 9 is much worse in that some arbitrary amount of text is removed, like "Foxfire Village" becoming "Foxillage". There are several other instances of "fl" together that Monarch leaves as-is, so I'm unsure why Efland is altered since it doesn't seem to match.[/QUOTE]

                                • Replacing text with blanks when importing a PDF
                                  Kyle_M _

                                  I tried using a hex editor to replace the ligatures with other unicode characters, but it didn't work. The data within the PDF is a binary object (BLOB) and I suspect it's compressed so you can't do a change to the binary in that manner.

                                   

                                  Neither I, nor the client had access to InDesign so we couldn't try out that option.

                                   

                                  In the end I offered the client two text files,  one produced via Adobe Reader, the other via A-PDF Text Extractor. The Adobe Reader version has all the correct text, but virtually no formatting. The A-PDF version has a fair bit of the formatting, but was missing all the letter "f"s from the ligatures. I manually went through it and reinserted the missing "f"s for them I let the client know about the missing letters and there is always the possibility that there could have been other missing text that I missed, so it was up to them which version they might want to use.

                                    • Replacing text with blanks when importing a PDF
                                      Grant Perkins

                                      It may not be worth it but is thee any potential for taking the Adobe Reader version of the text and re-formatting it back to original using Monarch?

                                       

                                      Or, perhaps more pertinently, is the client doing any with the converted PDF in terms of analysis? If so could the analysis be obtained form the Adobe Reader derived text version. Whilst not great at least you seem to feel it has all the right characters in the right places.

                                       

                                       

                                      Grant