Pull Out Text from HTML Code
The ScrapeText function accepts an HTML string as input and returns a plain text string by simply skipping over any text contained within the tags tags.
August 22, 2006
ScrapeText is a simple function that I use to pull plain text out of strings that contain markup-language formatting. I wrote this function to help build a report for a questionnaire system. The report needed to display only the plain text of each question. However, the text of each question was stored in heavily formatted HTML.
I searched online for a solution but didn’t find anything useful.Then, in a moment of clarity,a solution came to me:Write a function that accepts an HTML string as input and returns a plain text string by simply skipping over any text contained within the tags. For example, if you run the code
SELECT dbo.ScrapeText (‘ SQL Server
the result is "SQL Server".
I wrote the ScrapeText function for SQL Server 2005 and SQL Server 2000. Listing 1 shows the code that does the main processing.This code first wraps the input string (which can be a string or an alphanumeric column in a table) with the > and
After the initial data cleansing is complete, the code examines the input string one character at a time from left to right. An imaginary pen is ready to start copying the desired characters to an output string. When a > tag is encountered, the pen is put to paper because it has reached the beginning of a desired string of text. The pen then writes each character to the output string until a —Bill McEvoy
About the Author
You May Also Like