Nice Acrobat - even my version of the Acrobat Reader can only export the first 2 pages as text ...
In terms of the 'standard' of the pdf doing what you have done is probably about as good as it gets.
As for the change of layout of the data: if that is what the authors decide to do, knowingly or unknowingly, all you can do is modify you model and hope it des not change too often. Keep a version of the orginal models as they change so that you can still use them on the older files.
The only alternative that I can think of that might deal with run by run changes as long as the basic columnar format does not alter would be to generate a generic model that traps each line as a single feld and then slices and dices it into the separate fields independently, as far as possible, of the column widths.
Obviously this could be tricky. You may need to pre-process on the column header line to establish the start positions of the columns. That would mean that the entire process would likely have to be run separately for each pdf unless you already knew that several files had the same format.
From the file properties it looks like the PDF writer is now a MS Access utility whereas for example the 3/2006 file was produced by Actuate.
The format change may simply be a function of the writer engine I guess.
Sorry this is not too much help.
Acrobat reader V7.0.8 has a setting that allows me to save the entire document as text (by default it only saves the first 2 pages as text).
Go to Edit > Preferences > "Reading" and then tick the "Override reading order in tagged documents" option. I also set the Reading Order: to "Use reading order in print stream".
Old and new files (if exported to text) will then be consistent (left justified text) EXCEPT that the column headings and page heading info are swapped around a bit. That may not matter.
In theory this would slightly open up the potential for a slice and dice approach for data extraction using some conditional field handling.
PS: If you are using Acrobat Reader 8 there is a dedicated Accessibility Setup Assistant function to be used rather than the Preferences.
I have been playing with the single field slice and dice idea. (Two fields actually but the Issuer field will be several fields in one.)
About 85% of the records can be established relatively easily.
Of the remainder I estimate that around 50 to 60% will succumb without too much more effort. Splitting the Issuer Name and Issuer Description is the challenge. Some of the description entries appear to be entirely free form.
On the other hand this may not be important to split out or the related records may not be of interest for the analysis intended so I will wait for some feedback before taking the model any further.
I appreciate your time.
I did notice that Adobe could export text - and tried that a few times.
One major problem is that "" character between the cusip and the Sec Description is a necessary field (It indicates if the security is an option. This character is culled out separately (and if there is no "" - then there is NOT a blank generated ..
For example - if there was onr option indicator on the page -- the text would show One "*" and I could not match it to the preceding list of securities.
What i did do was push out the best I could then - with a series of calculated fields - made the data acceptable - where only 40 to 60 records had to manually adjested.
The save as text option does not always work consistently report to report or even reader version to reader version it seems. Internal settings sometimes help - sometimes not.
The "*" ident and separation was relatively easy providing you use the ASCii code for the character (42). So that can be stripped off the front. Also the Status seems consistent and so can be stripped off the back.
If the Issuer description is COM, CALL, PUT, SHS the ID is easy as well. It's the rest that cause more work though some of them are probably easy enough to dig out but will take time to check for consistency and avoidance of possible anomalies. "COM NEW" and a few others are not too big a challenge for example but other strings have no reliable and obvious handle by which to ID them.
I use Acrobat Pro v8 and was able to save the file as an optimized pdf file. From there it created a readable pdf file that I imported into Monarch Pro v9. The resulting file looked somewhat like the previous quarters and I was able to define all of the fields without too much difficulty. I did define the Issuer Name, status and description as one field and then parsed that field to get three seperate fields. Acrobat Pro is an expensive solution to the problem however.
Thanks Very Much!
I did much the same.
I was able to finally get usable data from a Monarch Model - using several calculated fields. Then just did several data integrity checks once it was exported to Access. I only had to manually change 40 to 60 records (which is not bad for almost 15000 records)
(This is my first post, but I've been working for Datawatch in Europe for three years. Grant suggested I have a look as we've come across a few PDF-related problems and have been looking for fixes.)
I've used ABBYY PDF TRANSFORMER to force uncooperative PDFs into a clean format for work in Monarch. This gives you the advantage of retaining the layout, which you would lose with a text export.
I've put the converted PDF file online at:
As you and Grant have found, it's frequently necessary to trap a large input field and use some judicious calculated fields to get the data you want.
If you were automating this, and couldn't tell in advance whether the input PDF was of one type or another, you could either convert all the PDFs using ABBYY beforehand, or, with DataPump, use automatic verification to determine whether modelA works, and if it fails, use modelB.