WebHawk

  3.0

   

Info      Technology      Screenshots      Trial       Faq        Contact       Home

 


 

 

NEXT GENERATION FILTERING TECHNOLOGY

WebHawk’s next generation content filtering technology adds a Real Time Content Recognition layer (RCR).

A large URL database is a vital component of a content filtering solution. But,when it comes to prolific categories (such as adult content), that number in the millions of sites and change URL locations rapidly, a URL database approach on its own is ineffective.

In order to provide a robust filter for categories such as adult content, it is necessary to deploy more than a database technology.

WebHawk utilizes Real Time Content Recognition (RCR) to analyze URL requests and detect the software coding and image patterns common to adult-content, gambling, proxy or dating sites before a block or allow decision is made.

Real-Time URL Classification
Most real time URL classification engines are based upon keyword analysis. WebHawk’s Real Time Content Recognition (RCR) layer combines keyword analysis with software algorithms that analyze other information available within the mark-up language of the web page.

 

WebHawk’s Real Time Content Recognition (RCR) classification engine analyzes language-independent parameters which play a much more important part in the accurate classification of a web site. By analyzing the contextual aspects of the web page, the percentage accuracy of the RCR results reaches the mid to high nineties.

WebHawk uses a single-pass real time engine to classify web pages. Much less overhead is required for this approach and as a result, the RCR is fast, scaleable and has a small footprint. It classifies sites in a matter of a few milliseconds.

Other filtering layers are also present to permit an administrator (teacher etc) to reclassify a URL as needed, rather than submitting the request and waiting for a reply.
 
 

Content Classification Flow Diagram

WebHawk Classification Process
WebHawk’s innovative multilayer content categorization technology combines dynamic RCR (Real Time Content Recognition) with a URL classification cache, a pre-classified dynamic database, a customer-customizable database and URL tagging capabilities to provide the most robust URL categorization engine in the content filtering industry today.

The diagram and explanation to the left describe how the categorization process works.

WebHawk’s dynamic database supports reverse lookup, so if the IP is submitted, the URL will be found and the category returned if it is in the database.
 


Multilayer Categorization Process
When the URL request is made, the URL is first looked up in the User Custom Database. If the URL is found, the category will be returned to the application for an appropriate action such as block or allow. If the URL is not in the User Database, it passes to the next layer, WebHawk’s Dynamic Database.

If the URL is present in the Dynamic Database, the category is returned to the application. If the URL is not present, the request passes to the next layer.

The URL is looked up in the URL Classification Cache. This cache contains the results of previous categorizations that have been made by WebHawk’s RCR. If the URL is present, the category is returned to the application. If not, then all methods of providing a categorization based upon the outbound URL request have been exhausted. The next stage involves analysing inbound data that is returned as a result of URL request.

As the data is returned from the URL request, a check is first made in the headers for tags such as ICRA or SafeSurf, which may reveal the category of the URL. ( ‘Responsible’ adult sites such and Penthouse and Playboy will declare their URL category in this manner. Many millions of adult sites regrettably do not) If tagging information is present it is used and the category of the URL is once again returned to the application. If no tag information is present, then the data is analysed by the RCR.

WebHawk’s RCR parses and analyses the mark up language returned from the web site. This determines the URL category dynamically, in real time and returns the category to the application.

The RCR analysis of the data concerns itself with both language-related and non-language-related markup language parameters. When WebHawk’s RCR receives a page of streaming data, it parses the data and creates what is called an RDV or Raw Data Vector. This is a 700 dimensional mathematical representation of the data on the page.

Next a learning algorithm reduces the RDV to a PDV or Processed Data Vector. This reduces the 700 dimensions to about twenty five feature sets. A feature might comprise a black background, bright foreground font, a number of links and the word sex in a Meta tag for example. These features are then run through a clustering mechanism that is based upon a back propagated neural network which looks at all the features, groups them and produces an output which provides a percentage likelihood that the web site concerned fits one of the RCR categories, such as adult content, dating, gambling or sport.

This approach is both accurate and fast and the AI techniques produce a result in one pass. Small keyword dictionaries of only twenty to thirty words can be used because much of the analysis is being performed on the rest of the non-language dependent parameters. The Keywords check and the ‘look and feel’ analysis make the RCR a powerful tool in dynamic real time URL classification.

The multilayered filtering process described above, provides a most rapid and accurate method by which to categorize URLs.

 

© 2007 - Tangent Inc. All rights reserved.                Call 1-888-TANGENT