You have probably already searched and found [url="http://mails.datawatch.com/cgi-bin/ultimatebb.cgi?ubb=get_topic;f=1;t=000266#000000"]this[/url] post from a few years ago.
If the address block gives you a single field for everything EXCEPT the state and zip some of the work will have been done for you but unfortunately in this case that was the easier part!
It might be worth looking for obscure separators between each section - like double instead of single spacing for example, but that is a long shot and in any case may not be reliable.
As mentioned in the post referenced above, using strings like "PO Box" to split with can be a good start but of course will only apply to some addresses unless you are very lucky with the content.
I suppose it may be possible to come up with a small number of variations that work successfully with, say, 80% of addresses and build in a flag that identifies records where there is no clear success so that they can be segregated for re-assessment and possibly manual conversion. Better that than having to deal with large numbers manually.
Alternatives to be considered may include looking at the capabilities of name and address checking software or de-duplicating programs which may have similar functionality and will have been based on analysis of all allowable possibilities for address formats. If you have a small one-off extraction that may not be viable. For a very large one-off extraction it might be justifiable. For a task that will be regularly repeated, almost no matter what the data volume, anything that avoids manual processing would likely be desirable.
Oh for the days when punctuation was used (mostly correctly) everywhere. Commas at the end of each address 'line' would make this split job very simple.
Thanks ever so much, Grant, for pointing out that other thread. I had looked before I posted my question but hadn't seen it.
Regarding correct punctuation, the problem is kind of relative. It depends on what you're used to. This client is actually consistent, which I consider a real gift!
I'll have to study the other thread more completely when I have time and put together a string of functions.
Thanks for your help.
Thanks ever so much, Grant, for pointing out that other thread. I had looked before I posted my question but hadn't seen it.[/quote]Things are easy to miss - especially when they go back some way.
Regarding correct punctuation, the problem is kind of relative. It depends on what you're used to. This client is actually consistent, which I consider a real gift![/quote]Agreed!
I'll have to study the other thread more completely when I have time and put together a string of functions.[/quote]I wonder how close you need to get to perfection for the purpose? Most name and address databases have significant anomalies for which address lines fields are in which fields. Since Stae/ZIP seem reasonably easy to identify it may be that only City is critical from the rest of the line? In which case I guess it would be possible to build a lookup table (external?) from a master list of all possible 'city' names and set a flag for 'Found' (or not found) in order to filter all those records that are questionable.
If that gave you an automated hit rate of, say 80% plus it would be a good result probably.
The failing 20% could then be assessed and 'corrected' or the lookup list added to, thus improving automation over time.
A similar approach could be taken to the analysis of the other parts of the address line but with more variables to consider and many likely anomalies in format and content that could be a frustrating task.
Have you checked with the client to see if there is anything they can do to assist with identifying the break points? It would be a shame to be re-inventing the wheel.
Perfection is not necessary. Nor is it an emergency, since it's another dept. that enters is manually until it can be automated.
Yes, Address Block can break off the state and Zip easily, so it IS only the street/city that need to be separated.
We have 4 clients who have this exact same format, so it's not just someone in their offices being creative. I suspect there is some small-medical-office software that produces this Excel report. It's possible they have some options when printing it, but most of our small-medical-office clients are unsophisticated in their computer usage. (They do, however, make up for it with their kindliness, for the most part.)
External lookup is a good idea. I haven't done that with Monarch yet. Something like looking for towns that start with New or Mount or Saint, etc., and streets that end in Street, Drive, Road, etc., and I guess default to a one-word city when in doubt. The Zip will get the mail delivered, so perfection isn't necessary.
I will work on a string of functions when I get the time (if that ever happens!) In the meantime, our data entry people will just have to do it.
Thanks for your help in thinking about this problem.
Just thinking aloud here ...
It may be that you could provide a list in, say, Excel form where the address block has done what it can resulting in possibly some OK records and others with error flags provided by the address block functionality.
As you know Street, Drive, Road, etc., are useful ways to strip off the front part of the address when they are present. State and Zip will be off the back end. So you could provide a field which just needs the split point identifying by manual changes - even just entering the punctuation perhaps - and then re-present that to Monarch (or maybe even leave it in Excel) for the final stage of automation.
I get the impression that the volume of records with the problem is not enormous so at some point comes the decision about how far to take the development.
In fact if only a small number of sources are involved that may limit the extent of possible variations for the address cities and towns anyway? Which observation, if real, could make the lookup table option, possibly setup using the source to limit the matching whether by program or within Monarch, relatively viable compared to a national identification need.
But I am guessing.
In the end someone is going to have to review the flagged errors so doing much more than highlighting them may be overkill.