Pull Out Text from HTML Code

The ScrapeText function accepts an HTML string as input and returns a plain text string by simply skipping over any text contained within the tags tags.

Bill McEvoy

August 22, 2006

1 Min Read
ITPro Today logo


ScrapeText is a simple function that I use to pull plain text out of strings that contain markup-language formatting. I wrote this function to help build a report for a questionnaire system. The report needed to display only the plain text of each question. However, the text of each question was stored in heavily formatted HTML.

I searched online for a solution but didn’t find anything useful.Then, in a moment of clarity,a solution came to me:Write a function that accepts an HTML string as input and returns a plain text string by simply skipping over any text contained within the tags. For example, if you run the code

SELECT dbo.ScrapeText  (‘  SQL Server

the result is "SQL Server".

I wrote the ScrapeText function for SQL Server 2005 and SQL Server 2000. Listing 1 shows the code that does the main processing.This code first wraps the input string (which can be a string or an alphanumeric column in a table) with the > and

After the initial data cleansing is complete, the code examines the input string one character at a time from left to right. An imaginary pen is ready to start copying the desired characters to an output string. When a > tag is encountered, the pen is put to paper because it has reached the beginning of a desired string of text. The pen then writes each character to the output string until a —Bill McEvoy

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like