Thank you Olly for the quick reply. I tried what you suggested but it does not give me the results I want. It is not adding the parts of the line that flows into the next line. For example on the sixth line down in column A "Advertising, community health" I can't get the next line down "education" to be added to the "Advertising, community health" line. Making the whole line as one "Advertising, community health education".
It just adds it as a separate line below the one above.
Advertising, community health
Not like what I want:
Advertising, community health education
Thanks again for all the help you give on this forum.
Thanks again for the help. Adding the N trap did help, but the inconsistency of my data causes Column A to put together several lines that do not go together. My example did show any all of the crazy formatting that is causing the issues.
Again thank you for the help. I think I can work with your suggestions to get what I need.
PDF files very often produce extremely erratic text outputs no matter which of many available commercial tools you might try to use for the extraction.
Adobe's own tools when used with files produces by their own software are usually not too bad (unless the PDFs were written by some old versions of the Writer program) but anything produced by a third party writer program, as often built into some large database based applications, can vary from excellent to unusable.
There are usually ways around the problems but sometimes they are not at all obvious.
Is this a report you could share at all. Not publicly perhaps but maybe on a secure server or perhaps with a anonymized "sample" version?
The file is a government produced file available to anyone.
Here is the link to the original file if you want to take a look. We need it in an Excel format for additional manipulation and reporting.
Well there are some "interesting" structure and formatting decisions but try these settings for the PDF interpretation stage of the process before going to model creation:
You'll be using Classic here.
Set Stretch to 6.3
I ended up with crop set to "1" but I don't think that much matters (other then in may confuse any previously generated "Auto" definition.)
PDF Engine set to the default 4.1
That looks like it traps most lines in good shape and in presentable columns. It will allow use of the multi-line method for field definition.
However there are a few sections where the regular format "rule" is "modified".
The examples seem of have a number of subsections which, in the normal way of things, would have been on their own lines.
Uniforms (hospital furnished):
Are examples. However I noticed 2 section under "repairs that do not present in the same way.
I assume, based on the date of the document, that this is effectively a fixed and non-changing file for practical purposes.
If so I'm tempted to suggest that, although I can imagine ways to deal with this anomaly of style within Monarch, the pragmatic solution would be to consider excluding these lines from the processing to Excel and simply add them afterwards via cut and paste or similar.
However, if you have further processing to do on the extracted data before passing it to Excel there are some approaches that would be interesting to compare for ease of development and practicality in use.
One approach would be to identify a trap that would include the indented lines of the sections mentioned and treat them as "normal" lines.
Another might be to trap them as multiple text lines and then split them out into sets of calculated fields using the TEXTLINE() function. Ultimately those sets would need to be exported to become separate records in the Excel document. There are 2 or 3 practical ways to do that - especially for a one-off creation.
See how you get on with these ideas. If you get stuck we can pick it up again and see where it takes us.
Do you need to extend the output at all - for example by providing the text associated with the (a), (b), etc. notes from Page 3 - before sending on to Excel? Or maybe add your locally relevant codes to fully and explicitly interpret the notes?
Thank you for diving further into this Grant. I’ll look at your suggestions. Since this is just a one-time extraction, the notes you and Olly provided will work. It will save me having to do a lot of the work in Excel.
Thanks to you both.
San Joaquin General Hospital – General Accounting