WebHawk’s Real Time Content Recognition (RCR) classification engine analyzes language-independent parameters which play a much more important part in the accurate classification of a web site. By analyzing the contextual aspects of the web page, the percentage accuracy of the RCR results reaches the mid to high nineties.
WebHawk uses a single-pass real time engine to classify web
pages. Much less overhead is required for this approach and
as a result, the RCR is fast, scaleable and has a small
footprint. It classifies sites in a matter of a few
milliseconds.
Other filtering layers are also present to permit an
administrator (teacher etc) to reclassify a URL as needed,
rather than submitting the request and waiting for a reply.
|

Content Classification Flow Diagram |
WebHawk
Classification Process
WebHawk’s innovative multilayer content
categorization technology combines dynamic RCR (Real
Time Content Recognition) with a URL classification
cache, a pre-classified dynamic database, a
customer-customizable database and URL tagging
capabilities to provide the most robust URL
categorization engine in the content filtering
industry today.
The diagram and explanation to the left describe how the categorization process works.
WebHawk’s dynamic database supports reverse lookup,
so if the IP is submitted, the URL will be found and
the category returned if it is in the database.
|
Multilayer Categorization Process
When the URL request is made, the URL is first looked up in
the User Custom Database. If the URL is found, the category
will be returned to the application for an appropriate
action such as block or allow. If the URL is not in the User
Database, it passes to the next layer, WebHawk’s Dynamic
Database.
If the URL is present in the Dynamic Database, the category
is returned to the application. If the URL is not present,
the request passes to the next layer.
The URL is looked up in the URL Classification Cache. This
cache contains the results of previous categorizations that
have been made by WebHawk’s RCR. If the URL is present, the
category is returned to the application. If not, then all
methods of providing a categorization based upon the
outbound URL request have been exhausted. The next stage
involves analysing inbound data that is returned as a result
of URL request.
As the data is returned from the URL request, a check is
first made in the headers for tags such as ICRA or SafeSurf,
which may reveal the category of the URL. ( ‘Responsible’
adult sites such and Penthouse and Playboy will declare
their URL category in this manner. Many millions of adult
sites regrettably do not) If tagging information is present
it is used and the category of the URL is once again
returned to the application. If no tag information is
present, then the data is analysed by the RCR.
WebHawk’s RCR parses and analyses the mark up language
returned from the web site. This determines the URL category
dynamically, in real time and returns the category to the
application.
The RCR analysis of the data concerns itself with both
language-related and non-language-related markup language
parameters. When WebHawk’s RCR receives a page of streaming
data, it parses the data and creates what is called an RDV
or Raw Data Vector. This is a 700 dimensional mathematical
representation of the data on the page.
Next a learning algorithm reduces the RDV to a PDV or
Processed Data Vector. This reduces the 700 dimensions to
about twenty five feature sets. A feature might comprise a
black background, bright foreground font, a number of links
and the word sex in a Meta tag for example. These features
are then run through a clustering mechanism that is based
upon a back propagated neural network which looks at all the
features, groups them and produces an output which provides
a percentage likelihood that the web site concerned fits one
of the RCR categories, such as adult content, dating,
gambling or sport.
This approach is both accurate and fast and the AI
techniques produce a result in one pass. Small keyword
dictionaries of only twenty to thirty words can be used
because much of the analysis is being performed on the rest
of the non-language dependent parameters. The Keywords check
and the ‘look and feel’ analysis make the RCR a powerful
tool in dynamic real time URL classification.
The multilayered filtering process described above, provides
a most rapid and accurate method by which to categorize
URLs.
|