4 Replies Latest reply: May 15, 2014 9:55 AM by Grant Perkins RSS

    HTML parsing with V7

    Ozgur Sanli

      Hi ;

      I am new to Monarch software. I am trying to figure out extracting HTML segments with floating traps. The main problem is how to specify the "end" of an HTML chunk from multiple lines such as

       

      <P> This is a sample paragraph that needs to be extracted which occupies several lines <a href="sample url">sample url text</a>. Another line for the paragraph.

       

      I want to extract the segments which start with <p> tag and end when a <a href> tag is seen. I can specify the beginning but could not succeed on the end condition. Thanks for any help..

        • HTML parsing with V7
          Grant Perkins

          Hi,

           

          Firstly, which version of Monarch 7 are you using? (Standard or Pro)

           

          Secondly are you reading a .htm(l) file into Monarch or is it an html file that is not recognised as html?

           

          I ask that because Monarch 7 Pro HTML analysis functionality will normally re-interpret the html code and display only the content directly, line by line, with some reference lines interspersed. Your description seems to suggest that you are analysing 'raw' html code.

           

          The interpeted html would typically produce a block of text for a paragraph that ignores the href reference and includes the rest of the paragraph text automatically. Sounds like you don't want it to do that. If so is there a reason or some other way of identifying what you wish to exclude?

           

          Sorry for the qustions but I will feel more comfortable suggesting things if I feel I have a better understanding of what you are doing and what you require.    smile.gif[/img]  

           

           

          Grant

           

            Originally posted by Ozgur Sanli:

          Hi ;

          I am new to Monarch software. I am trying to figure out extracting HTML segments with floating traps. The main problem is how to specify the "end" of an HTML chunk from multiple lines such as

           

          <P> This is a sample paragraph that needs to be extracted which occupies several lines <a href="sample url">sample url text</a>. Another line for the paragraph.

           

          I want to extract the segments which start with <p> tag and end when a <a href> tag is seen. I can specify the beginning but could not succeed on the end condition. Thanks for any help.. /b[/quote]

           

          [size="1"][ June 08, 2004, 06:53 AM: Message edited by: Grant Perkins ][/size]

          • HTML parsing with V7
            Ozgur Sanli

            I am using Monarch Pro 7. I am processing raw HTML code with tags maintained in the code. Basically, I've been converting a text based web site into a relational one. Therefore links have to be identified and stored in a separate table from the HTML code. The HTML chunks and links will then be displayed in the order by server side code.

             

            In the example paragraph I've given, there are three chunks for the paragraph, the first part till the link, the link itself, and the last part from the end of the link to the end of the paragraph.

             

            thanks..

            • HTML parsing with V7
              Grant Perkins

              Hi Ozgur,

               

              Interesting project.

               

              My first reaction is that this complete requirement is not something I have looked at previously. Mostly people have been trying to extract just the text from the HTML.

               

              My second reaction is that you obviously need to bypass the specific html functionality provided since you need the coded markers. Whether there is an option to extract the text in one process and the markers in another and them marry the two together has crossed my mind but I'm not familiar enough with your requirement to feel comfortable with that idea.

               

              My third reaction is that maybe the way to approach this is to simply include the sections of code in a text field and then extract them from the text via a calculated field. However I could see some potential problems with that approach. For example any paragraph with more than one link coded might be a little difficult to process.

               

              If we take you simple paragraph example, if you can extract the multi-line raw html for the entire paragraph (I am optimistically assuming there is a way to identify the end of the paragraph text for now ...) then it would be possible to split the resulting field into 3 calculated fields using the LSPLIT and RSPLIT functions. Or any of the other functions which might prove appropriate.

               

              So, if you were to LSPLIT the extracted field into 2 parts using left pointer as the split point identifier, part 1 would give you the text before the (first) tag.

               

              RSPLIT and use the right pointer identifier would give you the text after the (last) tag.

               

              Presumably your link would then be the section in the middle. So all we need to do is work out how to extract that most effectively. In a single link occurence example the EXTRACT function should do it nicely. Identify the start and end strings of the section you want and extract it. If necessary add the selection strings back into the calculation as part of the formula for the calculated field.

               

              Will it work?

               

              No idea  - I'm just typing the thoughts going through my head at the moment - but in theory it should be feasible so long as nothing else conflicts with the concept. Multiple separated links would make life a little more interesting although more complex versions of the same idea as above should still be possible using a few additional functions.

               

              So, can you get at the whole paragraph and define the end successfully?

               

              Grant

               

               

              Originally posted by Ozgur Sanli:

              I am using Monarch Pro 7. I am processing raw HTML code with tags maintained in the code. Basically, I've been converting a text based web site into a relational one. Therefore links have to be identified and stored in a separate table from the HTML code. The HTML chunks and links will then be displayed in the order by server side code.

               

              In the example paragraph I've given, there are three chunks for the paragraph, the first part till the link, the link itself, and the last part from the end of the link to the end of the paragraph.

               

              thanks.. /b[/quote]

              • HTML parsing with V7
                Grant Perkins

                Hi Ozgur,

                 

                Just spent a few minutes playing with this and a couple of points come to mind.

                 

                Text fields - you may need to create these as MEMO fields types as character types are limited to 256 chars.

                 

                However when you slice and dice a field you cannot create a calculated field with a memo field type. For large paragraphs that might cause a problem.

                 

                The split on the 'arrows' needs to allow for the start and end arrows respectively and any others in between. The idea may still be sound but require some more thought in its application. There are alternative approaches if needed.

                 

                You could also consider pre-processing the file (using the MSRP utility for example)to alter the content to make the analysis more accessible. However the effects may be unpredictable given the nature of html code. Some trial and error work may be required.

                 

                BTW I'm not sure how the floating trap would help you with this particular requirement.

                 

                Grant