Using SSAS and SSIS to Build an IIS Web Log Cube

Use these tools to track historical statistics and usage

David Pruden

January 11, 2010

15 Min Read
ITPro Today logo in a gray background | ITPro Today

Manually combing through the data in the web log files created by a Microsoft IIS server can be tedious. It can be a quick way to look at an application or end-user error, but it's not practical for trend analysis or usage statistics. IIS doesn't have a good built-in tool for compiling the data and transforming it into useful information. There are third-party applications and services that you can use, but they may provide only basic statistics or reports that cannot be customized to filter out “bots” or used to identify potential threats. Our office did not have the budget for these tools.

The solution I outline in this article integrates the IIS web log file, Microsoft SQL Server 2005, SQL Server 2005 Analysis Services (SSAS), SQL Server 2005 Integration Services (SSIS), and SQL Server 2005 Reporting Service (SSRS). You can customize the cube or cubes to provide historical statistics and analyze many levels of usage.

Step 1: Create the Log File

You create and configure the web log file via Site Properties, which you access through the IIS Admin tool. Or if you don't have access to the IIS Admin portal, you can use Web Sites properties, which you access from the Computer Management console in the Control Panel Administrative Tools applet, as Figure 1 shows. The log file that this step creates is a simple text file that can be read by third-party applications or opened in notepad for manual review. The file has the following naming convention:

C:WINDOWSsystem32LogFilesW3SVC{#}ex{yymmdd}.log

Our office had used Analog 6.0 and Report Magic 2.21 for some time. This combination provides a good overall snapshot of the cumulative history of the site through a web interface. So, instead of changing the log format or fields for my process, I left the configuration at its default settings. Although the statistics logged are helpful, it can be hard to provide both the granular information and summarized data that the office would like to see from the simple text file. The reports from our other applications provided only basic OS and browser data.

To enable logging on the IIS host, open the IIS Admin tool, expand the Web Sites tree, and right-click the Default Web Site. Select the Enable logging checkbox. Click Properties under Active log format, which is set to W3C Extended Log File Format, as Figure 1 shows. Click the Advanced tab to open the Logging Properties screen and select these fields: Date, Time, Client IP Address, User Name, Server IP Address, Server Port, Method, URI Stem, URI Query, Protocol Status, Protocol Substatus, User Agent, as Web Figure 1 shows.

Step 2: Load the Data

Next, you copy the log file to a location from where you can retrieve it. I decided to use DOS commands in .bat files run on the web server to keep the process simple and to avoid having to install additional software. Our external web server didn’t run SQL Server and could not execute SSIS packages remotely. I created two batch files: Copy_log.bat, which Listing 1 shows, and Delete_log.bat, which Listing 2 shows.

@ECHO OFFfor /f "tokens=2,3,4 delims=/ " %%i in ('date /t') do (set my_day=%%jset my_month=%%iset my_year=%%k)set my_year=%my_year:~2,4%set fileName=ex%my_year%%my_month%%my_day%.logcopy "C:WINDOWSsystem32LogFilesW3SVC1234567890%fileName%" "C:weblog"EXIT
@ECHO OFFfor /f "tokens=2,3,4 delims=/ " %%i in ('date /t') do (set my_day=%%jset my_month=%%iset my_year=%%k)set my_year=%my_year:~2,4%set fileName=ex%my_year%%my_month%%my_day%.logDEL "C:weblog%fileName%"EXIT

Copy_log.bat copies the log file to a directory on the web server so that you can use FTP to GET the file to your production database server. I set up a Windows Scheduled Task to execute the file on the web server on a nightly basis during the week. As an extra security precaution, I scheduled the delete_log.bat file to run 15 minutes after copy_log.bat runs to delete the file. This allows the FTP process to run on the SQL server and copies the new file before it's deleted.

I created a nine-step SSIS package, which Figure 3 shows, to load the file, parse the fields, archive the data, and process the SSAS cube. Before you begin creating the package, execute the Table DDL file, which Web Listing 1 shows.

CREATE TABLE [dbo].[tblTreLog](      [entry] [varchar](4096) NULL) ON [PRIMARY]CREATE TABLE [dbo].[tblTreWebLog](      [logID] [int] IDENTITY(1,1) NOT NULL,      [logDateTime] [datetime] NULL,      [hostIP] [varchar](15) NULL,      [method] [varchar](8) NULL,      [directory] [varchar](2048) NULL,      [port] [varchar](3) NULL,      [sourceIP] [varchar](15) NULL,      [browser] [varchar](2048) NULL,      [status] [int] NULL,      [substatus] [int] NULL, CONSTRAINT [PK_tblTreWebLog] PRIMARY KEY CLUSTERED(      [logID] ASC)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]) ON [PRIMARY]CREATE TABLE [dbo].[tblTreWebLogArchive](      [logID] [int] IDENTITY(1,1) NOT NULL,      [logDateTime] [datetime] NULL,      [hostIP] [varchar](15) NULL,      [method] [varchar](8) NULL,      [directory] [varchar](2048) NULL,      [port] [varchar](3) NULL,      [sourceIP] [varchar](15) NULL,      [browser] [varchar](2048) NULL,      [status] [int] NULL,      [substatus] [int] NULL, CONSTRAINT [PK_tblTreWebLogArchive] PRIMARY KEY CLUSTERED(      [logID] ASC)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]) ON [PRIMARY]GOCREATE NONCLUSTERED INDEX [tblTreWebLogArchive_logDate_IDX] ON [dbo].[tblTreWebLogArchive](      [logDateTime] ASC)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]GO

This creates the destination tables (tblTreLog, tblTreWebLog, and tblTreWebLogArchive) and stored procedure in a database to which you plan to connect the SSAS cube.

To create your SSIS package, open Microsoft Visual Studio 2005. Then create the following tasks for your package.

Execute Process Task. The first task of the package executes the TreLogFTP.bat file, which Listing 3 shows, to retrieve the web log file.

@ECHO OFFREM 5/09/07 - David PrudenREM copy web log files from web site to load into tblTreLogcd C:REM delete old web log filesDEL ex*.logREM connect to www.website.com, get current filesftp -s:"C:Program FilesMicrosoft SQL ServerMSSQLScriptsTreLogFTP.src" 127.0.0.1copy ex*.log TreLog.logEXIT

The file is stored in C:Program FilesMicrosoft SQL ServerMSSQLScripts along with the TreLogFTP.src file, which Listing 4 shows.

[LOGIN][PASSWORD]cd logpromptmget *.logquit

To create this task, open the Visual Studio Toolbox, and drag the Execute Process Task to the Control Flow panel. Double-click the new task and click Process in the left panel, as Figure 4 shows. Click in the Executable field and browse to the TreLogFTP.bat. Manually run the copy_log.bat on the web server, then execute this step to make sure the .bat file copies the current file to your SQL server. You'll need the file to test and set up the third step.
Truncate tblTreLog. The second task truncates the tblTreLog table. To create this task, drag an Execute SQL Task to the Control Flow panel. Connect the green connection arrow from the Execute Process Task to the SQL task. Double-click the SQL task and click Connection. Add a new connection to your destination database. Click the SQL Statement and enter the code for SSIS Task 2, which Listing 5 shows. We loaded data only Monday through Friday, so I added an IF Exists statement, as Figure 5 shows, to test the day of the week before truncating the table.

IF EXISTS (      SELECT DATEPART(dw,GETDATE()) AS weekday      WHERE DATEPART(dw,GETDATE()) BETWEEN 2 AND 6      )BEGIN      TRUNCATE TABLE tblTreLogEND


Load TreLog. Next you add a Data Flow task. Double-click the task and drag a flat file source and an OLE DB destination to the Data Flow window, as Figure 6 shows. Double click the flat file source and add a new connection to the C:TreLog.log file. Under the Connection Managers tab at the bottom of the Visual Studio window you will see the new flat file connection. Double-click the connection, change the Header row to skip to 4, as Web Figure 1a shows. Click Advanced and change the Column 0 name to Entry and change the OutputColumnWidth to 4096. Connect the green connection arrow from the previous step to this one.
Delete Last Row. Task 4 executes the delete_tblTreLog_lastRow_SP code, which Listing 6 shows.

CREATE PROC delete_tblTreLog_lastRow_SP   AS   DELETE FROM tblTreLog  WHERE entry = '' OR  entry LIKE '#%'

This code deletes an unreadable character from the tblTreLog. Add a new Execute SQL Task to the Control Flow window, as Figure 7 shows. Connect the green connection arrow from the previous step to this one.
Truncate tblTreWebLog. Add another Execute SQL Task for Task 5. This task simply truncates the tblTreWebLog table, clearing the log table that contains the previous day’s data. The table is used for the daily Reporting Services reports. Connect the green connection arrow from the previous step to this one.
Load tblTreWebLog. Task 6 executes some complicated SQL code to convert the single column data from Task 3 into specific columns. Add a new Data Flow Task and an OLE DB source and OLE DB destination. Double click the OLE DB source and select SQL Command from the Data Access Mode, as Web Figure 2 shows. Add the SQL code for SSIS Task 6, which Web Listing 2 shows.

select   dateadd(hh,-4,convert(datetime,left(entry,19))) as logDateTime,         right(left(entry,33),13) as hostIP,         right(left(entry,(charindex('/',entry)-2)),len(left(entry,(charindex('/',entry)-2)))-34) as method,         case               when entry like '% - % - %'               then right(left(entry, charindex (' - ', entry)), len(left(entry, charindex (' - ', entry))) - charindex('/', entry)+2)               else right(left(entry, charindex (' - ', entry)-3) ,(len(left(entry, charindex (' - ', entry)-3)))-(charindex('/',entry)-2))         end as directory,         case               when entry like '% - % - %'               then ltrim(rtrim(right(left(entry, charindex(' - ',entry)+5),4)))               else ltrim(rtrim(right(left(entry,charindex(' - ',entry)),4)))         end as port,         case               when entry like '% - % - %'               then left(right(entry, len(entry) - charindex(' - ',entry,charindex(' - ', entry)+1)-2 ),charindex(' ',right(entry, len(entry) - charindex(' - ',entry,charindex(' - ', entry)+1)-2 )))               else left(right(entry, len(entry)-charindex(' - ', entry)-2),charindex(' ',right(entry, len(entry)-charindex(' - ', entry)-2)))         end as sourceIP,         left(right(entry, len(entry)-charindex(         case               when entry like '% - % - %'               then left(right(entry, len(entry) - charindex(' - ',entry,charindex(' - ', entry)+1)-2 ),charindex(' ',right(entry, len(entry) - charindex(' - ',entry,charindex(' - ', entry)+1)-2 )))               else left(right(entry, len(entry)-charindex(' - ', entry)-2),charindex(' ',right(entry, len(entry)-charindex(' - ', entry)-2)))         end , entry)-         len(case               when entry like '% - % - %'               then left(right(entry, len(entry) - charindex(' - ',entry,charindex(' - ', entry)+1)-2 ),charindex(' ',right(entry, len(entry) - charindex(' - ',entry,charindex(' - ', entry)+1)-2 )))               else left(right(entry, len(entry)-charindex(' - ', entry)-2),charindex(' ',right(entry, len(entry)-charindex(' - ', entry)-2)))               end )         ),         charindex(' ',right(entry, len(entry)-charindex(         case               when entry like '% - % - %'               then left(right(entry, len(entry) - charindex(' - ',entry,charindex(' - ', entry)+1)-2 ),charindex(' ',right(entry, len(entry) - charindex(' - ',entry,charindex(' - ', entry)+1)-2 )))               else left(right(entry, len(entry)-charindex(' - ', entry)-2),charindex(' ',right(entry, len(entry)-charindex(' - ', entry)-2)))         end , entry)-         len(case               when entry like '% - % - %'               then left(right(entry, len(entry) - charindex(' - ',entry,charindex(' - ', entry)+1)-2 ),charindex(' ',right(entry, len(entry) - charindex(' - ',entry,charindex(' - ', entry)+1)-2 )))               else left(right(entry, len(entry)-charindex(' - ', entry)-2),charindex(' ',right(entry, len(entry)-charindex(' - ', entry)-2)))         end )         )))   AS browser,         left(ltrim(rtrim(right(entry,6))),3) as status,         ltrim(rtrim(right(ltrim(rtrim(right(entry,6))),2))) as subStatusfrom     tblTreLogorder by convert(datetime,left(entry,19))

Connect the green connection arrow from the previous step to this one. Because the flat file is neither delimited nor fixed width, the bulk of the work was writing a SQL query to parse out the data fields. I also had to convert the UTC time to Standard time. By using CHARINDEX and identifiable key patterns, such as .%.%.%. and %/%, you can break out the fields into the date, host IP, method, directory, port, source IP, browser, and status codes.

Click Columns to preview the list of columns that the select code creates. Now connect the OLE DB source to the OLE DB destination and double-click the OLE DB Destination to add the tblTreWebLog to the Name of the table or the view field, as Web Figure 3 shows. Finally, click Mappings in the left panel to make sure the columns are mapped correctly, as Web Figure 4 shows. Delete old tblTreWebLogArchive records. Next, add a new Execute SQL Task and add the SSIS Task 7 code, which Listing 7 shows, to the SQL statement.

DELETE FROM tblTreWebLogArchiveWHERE logDateTime < dateadd(d, -365, getdate())

You can modify the date criteria to either store more or less data. Our office had seasonal rather than monthly trends. If you have a lot of site traffic you might decide to keep less than a year’s worth of data. Connect the green connection arrow from the previous step to this one.
Archive tblTreWebLog. Task 8 adds another Execute SQL Task that copies the daily data into an archive table (tblTreWebLogArchive). Add the code for SSIS Task 8, which Listing 8 shows.

INSERT   INTO tblTreWebLogArchiveSELECT   logDateTime,         hostIP,         method,         directory,         port,         sourceIP,         browser,         status,         substatusFROM     tblTreWebLog

Connect the green connection arrow from the previous step to this one. Process the web log cube. The last SSIS task processes the cube that you'll create in Step 4: Build the Cube. (You must first build the cube before you can add this step to the SSIS package.) Drag an Analysis Service Processing Task to the Control Flow window. Connect the green connection arrow from the previous step to this one. Double-click the Analysis Services Task and click Analysis Services. Click New next to Analysis Services connection manager, as Web Figure 5 shows and create a new connection to your cube. Click Add and navigate to your web log cube to select it for processing.

Step 3: What to Analyze

Originally, I had been loading the daily files so I could analyze potential threats or unauthorized activity. I created a few Reporting Services reports that search for key words, directories, or times. I was also able to get statistics on each directory’s hits. This worked well for the small amount of daily data. However, when the office wanted historical statistics by specific dates and directories, the reports would take an extremely long time to generate. The office wanted to be able to click on a Reporting Services report and see it in seconds rather than minutes. I also wanted to be able to analyze error codes over time as we made changes to our web applications. I decided to create a SSAS cube out of the archived data, so that I could slice and dice it as needed.

Step 4: Build the Cube

I built the first cube on the archive table and included all the dimensions as well as the directory data as it was listed. This was too granular and contained too much information. It had all of the directories, subdirectories, files, and image files, in addition to bot data.

The office just wanted to see the main directories as they relate to the web layout. Using the code in Web Listing 3, I created a view that limits the data returned and counts the main directory and not the sub levels, as Web Figure 6 shows.

CREATE VIEW vwTreWebLogCube   AS    SELECT   DATEPART(yy,logDateTime) AS logYear,   DATEPART(mm,logDateTime) AS logMonth,    CONVERT(DATETIME,CONVERT(VARCHAR(12),logDateTime,101)) AS logDateTime,     CASE       WHEN CHARINDEX('/', directory, 2) > 0       THEN LEFT(directory, CHARINDEX('/',directory,2)-1)      ELSE directory     END AS directory,     [status],     1 AS Request   FROM tblTreWebLogArchive   WHERE CASE       WHEN CHARINDEX('/', directory, 2) > 0       THEN LEFT(directory, CHARINDEX('/',directory,2)-1)      ELSE directory     END = '/index.html'   OR CASE       WHEN CHARINDEX('/', directory, 2) > 0       THEN LEFT(directory, CHARINDEX('/',directory,2)-1)      ELSE directory     END = '/accounting'   OR CASE        WHEN CHARINDEX('/', directory, 2) > 0       THEN LEFT(directory, CHARINDEX('/',directory,2)-1)      ELSE directory     END = '/bankinvermont'   OR CASE       WHEN CHARINDEX('/', directory, 2) > 0       THEN LEFT(directory, CHARINDEX('/',directory,2)-1)      ELSE directory     END = '/cash'   OR CASE       WHEN CHARINDEX('/', directory, 2) > 0       THEN LEFT(directory, CHARINDEX('/',directory,2)-1)      ELSE directory     END = '/debt'   OR CASE       WHEN CHARINDEX('/', directory, 2) > 0       THEN LEFT(directory, CHARINDEX('/',directory,2)-1)      ELSE directory     END = '/literacy'   OR CASE       WHEN CHARINDEX('/', directory, 2) > 0       THEN LEFT(directory, CHARINDEX('/',directory,2)-1)      ELSE directory     END = '/office'   OR CASE       WHEN CHARINDEX('/', directory, 2) > 0       THEN LEFT(directory, CHARINDEX('/',directory,2)-1)      ELSE directory     END = '/pension'   OR CASE       WHEN CHARINDEX('/', directory, 2) > 0       THEN LEFT(directory, CHARINDEX('/',directory,2)-1)      ELSE directory     END = '/reporting'   OR CASE       WHEN CHARINDEX('/', directory, 2) > 0       THEN LEFT(directory, CHARINDEX('/',directory,2)-1)      ELSE directory     END = '/retirement'   OR CASE       WHEN CHARINDEX('/', directory, 2) > 0       THEN LEFT(directory, CHARINDEX('/',directory,2)-1)      ELSE directory     END = '/unclaimed'   OR CASE       WHEN CHARINDEX('/', directory, 2) > 0       THEN LEFT(directory, CHARINDEX('/',directory,2)-1)      ELSE directory     END = '/VendorPortal'

I used the view as the fact table for the cube and also built the dimensions off the same view. I revised the cube to include only a Directory dimension and a Status dimension as well as two time dimensions. This saves extract, transform and load time, and the cube processes in about 10 minutes.

Step 5: Creating Reports

In this last step, we create some Reporting Services reports from the cube data. I created one report with a graph that can be sliced and diced by date, month, year, and directory. This allows the office to see usage trends from a specific date that a press conference was held, an event was attended, or an advertisement was run. Over a year we can see which month’s activity is low. I also created a pie chart to analyze directory hits and a third report that tracks error and status returns. You can view these reports in Web Figures 7, 8, and 9.

What's Next?

The next step would be to build a cube based on potential threats or bots. Creating a view that searches the URL for key words based on SQL injection or excludes all “good” URLs could be used to analyze threat trends over time. A bot or crawler cube could be helpful in predicting when the site is scheduled to be searched and by whom. If there is site maintenance scheduled at a time when your site is crawled, it may be slower.

The last key component would be building a translation table for the browser and OS data. Instead of displaying the cryptic information that is captured in the log the current version or a browser as well as the OS could be replaced with more user-friendly information. However, manually creating a reference table could be quite time consuming. Attempting this with dynamic SQL could also be more of an academic exercise with results that would be more of a technical interest rather than a trends analysis or enhancement to threat detection.

 

 

Sign up for the ITPro Today newsletter
Stay on top of the IT universe with commentary, news analysis, how-to's, and tips delivered to your inbox daily.

You May Also Like