Naïve Bayesian Classification Using C# and ASP.NET
A Detailed Analysis of Bayesian Rule, Its Design, and Implementation in Contemporary Applications
October 30, 2009
Machine learning is a fascinating area of computer sciencein which we study the techniques and implementations of algorithms that allowcomputers to learn. Its implementations include, but are not limited to,automatic text classification, search engines, stock market analysis, financialthreshold alerts, speech and handwriting recognition, medical diagnostics, datamining, and autonomous land vehicles. For more information on ASP.NET development, see "HTML5 for the ASP.NET Developer" and "ASP.NET MVC Tutorial: Handling Errors and Exceptions." In this article, we ll discuss naiveBayesian text classification techniques and its implementation.
Text classification can be used to determine some sort ofcategory for a document or block of text. There are three commonly knownclassification techniques: naive Bayesian classification, vector machines, andsemantic indexing. The naive Bayesian technique is the most popular and is usedin contemporary spam filtration, document categorization, and help desk systemsbecause of its accuracy, robustness, and ease of implementation. The Bayesianclassifier is attributed to a British mathematician, Thomas Bayes, who definedprobability as:
The probability ofany event is the ratio between the value at which an expectation depending onthe happening of the event ought to be computed, and the chance of the thingexpected upon its happening.
So what s the big deal about this? Why are naive Bayesianclassifiers being used by industry leaders like Microsoft in its notificationplatform to filter unwanted messages, Google to provide probable results, andAutonomy to build document classifiers? Because it provides an efficient way toattain reason from uncertainty. Let s consider an example: Have you everwondered what happens behind the scenes when you receive a spam message in youre-mail inbox and you click on the Mark as spam or Send to Junk mail button?The message gets passed to an indexer that tokenizes the contents, calculatesthe probability, and adds it to a spam repository memorizing that theparticular message had junk mail attributes. Also, for good messages, itlearns from the environment and classifies the text as being not junk. When thenext message comes in, the system automatically calculates the priorprobability on the past experiences and can recommend if the new message fallsunder the category of ham or spam. This learning is based on the category ofthe word and its frequency of occurrence.
The Bayes Rule
To understand and code for the naive Bayesian algorithm,we will do some math to understand the procedure. The primary equation for the Bayesrule is:
P(A|B) = (P(B|A) * P(A)) / P(B)
This states mathematically that the posterior probabilityor probability of future occurrence can be calculated by the product ofprevious belief P(A) and the likelihood of B if A is true; i.e., P(B|A). P(A|B)is called posterior probability, P(A) is called prior probability, and P(B) isnormalization constant. This equation enables us to calculate the probabilitythat A would occur providing that B has happened.
Following are some examples to further clarify what wasjust said. Let s take a scenario where we have three teams of developersworking on different parts of a problem and generating a percentage of KLOC (thousandsof lines of code). Team A is working on a 50% portion of the modules and onaverage produces 3% buggy code. Team B is responsible for 30% of the classlibraries written and has a 4% chance of having failure assertions and testcase flaws in their code. Team C is working on the remaining 20% of the sourcebut, because the technology used is in beta, the percentage of their errors israised to 5%. Now we come to the classification problem: What is theprobability that a randomly selected line of code is faulty?
This is the normalization constant we ve seen in the previousequation, which can be calculated as the sum of the product of probabilitiesand chances of erroneous code:
P (X) = P(A). P(D|A) + P (B). P(D|B) + P (C) . P(D|C)P(X) = (0.5). (0.03) + (0.3) (0.04) + (0.2) (0.05) = 0.037 = 3.7%
This implies there is a 3.7% chance that a randomlyselected line of code would not pass a unit test. However, the power ofuncertainty classification demands us to calculate the probability that arandomly selected line came from team X. This is when the Bayes rule comes to therescue. The Bayes rule tells us that the posterior probability can be found asfollows:
P(A|X) = P(X|A) * P (A) / P(X)P(A|X) = (0.03) (0.5) / 0.037 = 0.405405... 40.5%
Similarly, the probability that the erroneous code belongsto team B or C is 32.5% and 27%, respectively.
This gives you an idea of how the posterior probabilitiesare calculated or weighted (normalization constant). If the word free and pokeroccurs in an e-mail and it s marked as junk, there is a big chance that thenext e-mail coming in my inbox with the similar contents is also a junk e-mail.It may also mean that I m being invited to a friend s house for a movie nightwith free pizza and games, but this leads us to a discussion of trainingdomain, false positives, and reduction techniques, which are beyond the scopeof this article. So let s get coding.
The Implementation
To mimic the two major operations of a text classifier,learning and classifying, we can create two methods for implementing theclassification: AddWords and Classify.
The AddWords method will work as a simple indexer using auses hashtable to tokenize the text and assign it to a category with a count (seeFigure 1). The count will increase every time we add the same word to the samecategory. The way a token (word) is defined is very simple in this approach.
public int AddWords(string sCategory, string sWords){ Hashtable htWordsInFile = StringtoHashtable(sWords); foreach (string sWord in htWordsInFile.Keys) { _htWords[sCategory + "-" + sWord] = SafeKeyNum(_htWords, sCategory + "-" + sWord) + Double.Parse(htWordsInFile[sWord].ToString()); } bDirtyFlag = true; //The index count return _htWords.Count;}
Figure 1: The AddWordsmethod.
The Classify method is the one that performs the matchingfor provided text and available data in the words.bayes file (see Figure 2). Itcreates a hashtable of the input for traversing, then calculates the score(prior probability) of every individual token and assigns it to anotherhashtable, then performs the sorting operation on the probabilities. If theprobability of multiple items is the same, this implies that it s a neutralcategory (the classifier can t really determine which category it belongs to becauseof a lack of information).
public string Classify(Hashtable htWordsInFile){ double iTotal = 0; string Response = string.Empty; Hashtable htCategoryCount = new Hashtable(); foreach (string sEntry in _htWords.Keys) { string sCategory = (Regex.Split(sEntry, "-"))[0]; htCategoryCount[sCategory] = (double)(SafeKeyNum( htCategoryCount, sCategory) + Double.Parse (_htWords[sEntry].ToString ())); iTotal += Double.Parse(_htWords[sEntry].ToString()); }// This is a ht of doubles. _htScore = new Hashtable(); foreach (string sWord in htWordsInFile.Keys) { foreach (string sCategory in htCategoryCount.Keys) { if (_htWords.ContainsKey(sCategory + "-" + sWord)) { _htScore[sCategory] = SafeKeyDouble(_htScore, sCategory + "-" + sWord) + (Math.Log(Convert.ToDouble(_htWords[sCategory + "-" + sWord]) / Convert.ToDouble( htCategoryCount[sCategory]))); } else { _htScore[sCategory] = SafeKeyDouble(_htScore, sCategory + "-" + sWord) + Math.Log(0.01 / Convert.ToDouble(htCategoryCount[sCategory])); } } }foreach (string sCategory in htCategoryCount.Keys) { _htScore[sCategory] = SafeKeyDouble(_htScore, sCategory) + Math.Log(Convert.ToDouble( htCategoryCount[sCategory]) / Convert.ToDouble(iTotal)); }//The sorter bool firstLine = true; string firstElement = string.Empty; string secondElement = string.Empty; SortedList sorter = new SortedList(_htScore, this); foreach (string sCategory in sorter.Keys) { if (firstLine) { Response += (sCategory + "t" + _htScore[sCategory] + " (most likely candidate)"); firstLine = false; firstElement = _htScore[sCategory].ToString(); } else { Response += (sCategory + "t" + _htScore[sCategory]); secondElement = _htScore[sCategory].ToString (); } } if (firstElement.Equals(secondElement)) return "Neutral Category:" + firstElement; else return Response; }
Figure 2: The Classifymethod.
Now let s use these disjointed pieces in a solution. Totest the library functions defined in NaiveBayesClassifier, I've created a Webservice called ClassifierService that exposes these methods. Figure 3 shows theclass diagram for NaiveBayesClassifier and ClassifierService.
Figure 3: Class diagrams for the ClassifierServiceand NaiveBayesClassifer class.
The Web service methods are used to access our writtenfunctionality. Note the extra method named ClearIndex (see Figure 4). Thismethod renames the existing words.bayes file, if there is one, and thereforeclears the data already there. The next time the Learn method is called a newdata file will be created.
[WebMethod] public string Classify(String words) { return new NaiveBayesClassifier().Classify( NaiveBayesClassifier.StringtoHashtable (words)); } [WebMethod] public string Learn(String Category, String Keywords) { return new NaiveBayesClassifier().AddWords( Category, Keywords).ToString (); } [WebMethod] public bool ClearIndex() { return new NaiveBayesClassifier().FlushIndex(); }
Figure 4: The ClassifierWeb service methods.
In a simple example to test the classifier s capabilities,let s create two basic categories, Cats and Dogs, with the attributes outlinedin Figure 5.
Cats | Dogs |
---|---|
cat kitten meow barf furry hairball playful | dog puppy bark woof drool smell stink wretch hairy loyal loud stupid |
Figure 5: Twosimple categories and their attributes.
To make a system learn what are the attributes associatedwith each category, we d have to invoke the Web service method Learn, with thegiven parameters as shown in Figure 6.
Figure 6A: Invoking the Learn Webmethod for category creation.
Figure 6B: Invoking the Learn Webmethod for category creation.
As you can see in Figure 6, the Web service responseprovides the number of keys added in the hashtable and, as mentioned earlier,if the file words.bayes doesn't exist in the Web services / APP_Data folder,the program will create it and add these keys to the corresponding categories.
Now it s time to see if this really worked. When you callthe Classify method, providing the text meow , ClassiferService returns thefollowing response:
Cats -3.04452243772342 (most likely candidate)Dogs -7.64969262371151 However, when you call it passing an unknown keyword, forexample Turtle , it interprets it as a neutral category because the keywordTurtle can t be found in the attributes of either Cats or Dogs. Therefore, theprobability of its matching either of these two categories is the same: Neutral Category:-7.64969262371151 What if we look for the keyword loyal ? Yes, you got it.It will return the Dogs category. Now you might be wondering why one should usenaive Bayes? This is so trivial that it can be simply achieved by making a listof known keywords assigned to a category, then matching it up; but what if boththe categories have the same keyword? Naive Bayes just doesn't do a stringmatch, it calculates the weight of a particular keyword with its frequency ofoccurrence in a category and then decides (i.e., assigns it a weight). You llsee this in the next example. This example shows a personal blog where the subject postsbook reviews. In this implementation of a naive Bayesian classifier, the goalis to categorize text; i.e., making the system learn what is junk and what isreally useful. Spam comments are a big headache for bloggers when spammers, viaautomated bots using the blogging engine APIs or just by mere post request,post comments on an unwary blogger s Web site, hence making it a zombie spamlink farm. CAPTCHA (Completely Automated Public Turing Test to Tell Computersand Humans Apart) has proven to be a very effective measure against these bots.However, for actual human posters this approach doesn't work. People have movedto moderated comments, but wouldn't it be better if you are notified only abouta spammish comment on your Web site instead of you approving comments everytime one is posted? This can also provide a resolution to Web sites thatprovide customers a facility to write product reviews because now the moderatorwill get notified only if the score of a particular message is higher than spamthreshold. Having said that, it s very important to keep systems updated withthe newest spam vs. ham feeds so the classifier can understand thedifference. Otherwise, you ll end up seeing either too many false positives orspam comments making their way through the classifier. One of the new features of the Visual Studio .NET 2005 IDEis that you aren't required to have a Web server in order to run a Web service/Website. The IDE comes with a Web server (Cassini), which is only accessible fromlocalhost and provides development flexibility in case you don t have IISinstalled on your machine or you have restrictions to run a Web server locally. To provide a real-world problem simulation, I've used thedesign illustrated in Figure 7, wherein the service resides at a different tieralong with data and implementation while the facade utilizes it. (Forextensibility, security, and various other good reasons, the data layer shouldalso be introduced and separated from the business layer.) In the diagram inFigure 7, boundaries are represented as Blog (front-end facade) and ProcessingEngine, which has the ClassifierService and NaiveBayesClassifier assemblieswith words.bayes, the classifier file. Figure 7: Classifier s service design. I've provided a collection of spam messages and propercomments in two files named comments.xml and SpamComments.xml (available fordownload; see end of article for details). These files are used to train theclassifier. An excerpt from these files is shown in the table in Figure 8. Comments Spam Comments Pearl's "Probabilistic Reasoning in Intelligent Systems" is elegantly done seminal work on uncertainty, probabilistic reasoning and all things related inference. As the author says, "This book is a culmination of an investigation into the applicability of probabilistic methods to task requiring ... 2005-05-26T06:00:44.4062500-07:00 2005-05-26T06:00:44.4062500-07:00 723a656a-830c-4ba6-98a1-f831dab5ece2 Adventures in GAC D463106F-84DA-4C28-A0AB-6A1E95B2A72A AdnanMasood [email protected] www.axisebusiness.com/adnano Please visit some helpful info in the field of online poker http://www.williamchibbard.com/online-poker.html texas hold em http://www.texas-hold-em-world.com/texas-hold-em.html online poker http://www.texas-hold-em-world.com/online-poker.html texas hold em http://www.e-online-poker-777.com/texas-hold-em.html online poker http://www.e-online-poker-777.com/online-poker.html...2005-05-26T06:00:44.4062500-07:002005-05-26T06:00:44.4062500-07:00 723a656a-830c-4ba6-98a1-f831dab5ece2>Adventures in GAC<D463106F-84DA-4C28-A0AB-6A1E95B2A72Aonlinepokerabsolut4833@freemail.com[JJW7] For engine training purposes use SpamArchive s repository of spam e-mails: http://www.spamarchive.org. Figure 8: Excerptsfrom comments.xml and SpamComments.xml. There are two ways to create the spam index file: by usingthe command line executable NaiveBayesFileClassifier.exe or via the Web service sLearn method; you can use either. Here is an important point for the reader tonote; if I only train the classifier using SpamComments.xml and post a spamcomment like texas and poker , I ll get the correct response from the Webservice stating the comment is a spam message. However, for a legitimatecomment like the one shown in Figure 9, the Web service will still regard it asa spam response the reason being the amount of training data; Bayesianclassifier can t automatically know the difference between categories unless it strained on different datasets. This is well summed up in a sentence by MartinOverton of IBM Global Services. As with most things in life, he wrote, Bayesianfiltering only works well when it is properly trained. Positive Response False Positive due to lack of training data Figure 9: Bayesianfiltering is only effective when it is properly trained. To avoid this insufficient training data problem, we'dfurther tutor the classifier with comments.xml, which will provide it aprospective of what is spam and what is a comment. With new definitionsrecorded in words.bayes, when I try to post the correct comment again, theresults are different. This time, upon entering spammish keywords, the programclassifies them correctly as seen in the table shown in Figure 10. spam -4.01443287321891 (most likely candidate) comment -15.1548443302207 comment -10.5496741442326 (most likely candidate) spam -15.1548443302207 Figure 10: Tutorthe classifier to get different results. The classifier used in these examples can be modified andenhanced for optimization. For instance, the classifier currently is casein-sensitive; however, case sensitivity can provide better approximation. Also,classifier loads the file every time it s called. For optimization and betterperformance, IO caching could be used to preserve the hashtable in memory andinvalidate it upon new entries. Also, multiple file support to store differentcategories and adding strong typed parameters to make use of the spam thresholdon the front-end are some useful features to add in this basic implementation. Conclusion Bayesian classification is an excellent technique for textclassification and categorization; it s simple, effective, and easy toimplement. It s also not language-dependent; i.e., it can support otherlanguages, such as Spanish, Chinese, Japanese, French, Portuguese, Russian,Arabic, etc. Also, what we covered is only the tip of the iceberg; Bayesianclassification can be used for a variety of purposes. DIGIMIMIR is a classicexample of its diverse uses. It s subtitled as a Tool for Rapid SituationAnalysis of Helpdesk and Support E-mail . However, this classification is not a silver bullet,either; bad guys are already out there trying to fool Bayesian spam filters byusing Wordlists (or Dictionary Salads), passages from books, and by trainingclassifiers with bad data. However, with a large amount of correct data (ham), aBayes algorithm is capable of nullifying their probabilities and, hence, thwartingtheir plans. Academia and implementation have never been as close asthey are now. In this rapidly evolving world of IT, the success of a softwareenterprise is inversely proportional to the time it takes for its products tocome out of the lab and get to the masses. The shorter the time, the greaterthe chance of success providing that the product delivers an innovativesolution the industry wants. The implementation of complex algorithms is now anindigenous part of our everyday lives. From MapQuest (shortest path finding)and Google search to advanced traffic routing algorithms in the cellularindustry, complex algorithms are out of the books and are living and breathingamong us. Special thanks to Neal Hardesty for his support and the classifier sport from Perl. The sample codeaccompanying this article is available for download. Adnan Masood worksas a Sr. Software Engineer in a Monrovia-based financial institution where hedevelops middle-tier architectures, decoupled designs, and Web services usingMicrosoft s .NET Framework. He holds various professional memberships (ACM,BCS, and ACS) and several technical certifications (MCSD.NET, MCAD.NET, and[JJW1]SCJP-II). Adnan is attributed and published in print media and on the Web andis currently pursuing his MS (Computer Science) from NovaSoutheastern University, FL. Heis actively involved in the .NET community as co-founder of The SanGabriel Valley.NET Developers Group. His blog can be found at http://www.axisebusiness.com/adnanoand he can be reached via e-mail at mailto:[email protected].
About the Author
You May Also Like