No title

February 20, 2019 | Author: Anonymous | Category: N/A

Share Embed Donate

Report this link

Short Description

Download No title...

Description

Discovery of ads web hosts through traffic data analysis ∗ V. Bacarella

University of Pisa Pisa, Italy

[email protected]

F. Giannotti, M. Nanni ISTI-CNR Pisa, Italy

{f.giannotti,

m.nanni} @isti.cnr.it

ABSTRACT One of the most actual problems on web crawling – the most expensive task of any search engine, in terms of time and bandwidth consumption – is the detection of useless segments of Internet. In some cases such segments are purposely created to deceive the crawling engine while, in others, they simply do not contain any useful information. Currently, the typical approach to the problem consists in using a human-compiled blacklist of sites to avoid (e.g., advertising sites and web counters), but, due to the strongly dynamical nature of Internet, keeping them manually up-to-date is quite unfeasible. In this work we present a web usage statistics-based solution to the problem, aimed at automatically – and, therefore, dynamically – building blacklists of sites that the users of a monitored web-community consider (or appear to consider) useless or uninteresting. Our method performs a linear time complexity analysis on the traffic information which yields an abstraction of the linked web which can be incrementally up-dated, therefore allowing a streaming computation. The crawler can use the list produced in this way to prune out such sites or to give them a low priority before the (re-)spidering activity starts and, therefore, without analysing the content of crawled documents.

1. INTRODUCTION The ever-increasing popularity of the Web, as well as its size, exacerbates the importance of effective search services, capable of helping people discover and select desired contents. The search engines of the current generation, despite considerable achievements, are clearly unadequate to this purpose, and their limitations are becoming more and more evident for general queries. One major reason for this weakness lies in the crawling task: it is difficult to drive crawlers towards interesting sites, and therefore crawlers gather huge amounts of uninteresting contents, which impact negatively time and bandwidth consumption and, more importantly, the possibility of high-quality query answering. The importance of this drawback is highlighted by the observation that a majority large portion of the web consists of uninteresting contents. Indeed, there is a wide family of Internet site categories which are uninteresting from the point of view of a web crawler, since visiting them does not yield any informa∗All authors are member of the Pisa KDD Lab: http://www-kdd.isti.cnr.it/. The present research is partially supported by “Fondazione Cassa di Risparmio di Pisa”, under the “WebDigger Project”, and by Italian Ministry of Education, University and Research, under the ECD Project “Technologies and Services for Enhanced Content Delivery”.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DMKD’04, June 13, 2004, Paris, France Copyright 2004 ACM ISBN 1-58113-908-X/04/06...$5.00.

76

D. Pedreschi

University of Pisa Pisa, Italy

[email protected]

tion useful to the search engine users in the querying process. While the usefulness concept is quite subjective, and then strongly dependant on search engine users, there exist some site categories which are universally considered undesired in the crawling step, in that they have no information content at all — at least from a common user viewpoint. We point out two major examples, which will be the main focus of this paper: Advertising: hosts containing only pure advertising. They have a strong visibility from other sites, which usually link them by means of banners or pop-up windows. Counter services: hosts containing scripts (CGI or other methods) which collect statistics on the access to web pages. Monitored pages usually contain links or scripts which autonomously invoke such services. Recognizing and avoiding the crawling of these web sites would allow to considerably improve the efficiency of the crawler because of their pervasive presence in both commercial and amatorial web sites. In this paper, we propose an approach to discover uninteresting sites on the basis of traffic, or usage information, i.e., users’ requests registered on web log files, rather than analyzing contents or link structure of web pages. We present a web usage statistics-based solution to the problem, aimed at (i) building, on the basis of the stream of traffic information, an abstraction of the linked web, (ii) automatically – and, therefore, dynamically – computing blacklists of sites that the users of a monitored web-community consider (or appear to consider) useless or uninteresting, and (iii) feeding the list produced in this way to the crawler to prune out such sites or to give them a low priority before the (re-)spidering activity starts and, therefore, without fetching them and analysing their content. Remarkably, our method exhibits a linear time complexity (w.r.t. the traffic information stream), and the abstraction of the linked web can be incrementally updated, therefore allowing a streaming computation. This makes the technique applicable, in principle, even to massive proxy server logs. The rest of the paper is organized as follows. In Section 2 a brief review of related works is provided. Then, Section 3 introduces a formal framework for web traffic data, which is exploited in Section 4 to design a few simple heuristics for useless sites detection. In Section 5 the traffic dataset and the implementation details of our technique are explained, while Section 6 reports the results of some preliminary experimentations on real data. Finally, in Section 7 some conclusions are drawn and some remarks on future works are provided.

2.

RELATED WORK

Collecting information about web hosts – e.g., distinguishing different typologies of web sites – allows to implement focused crawlers [2, 11, 9, 3] designed to gather only documents on specific topics or within sites of selected categories, thus reducing the amount of network traffic and downloads needed. Several examples are available in literature: in [2] a method is presented, which uses a pre-defined set of topics to selectively search relevant pages during the crawling process. [11], instead, combines text and link analysis to improve ranking and crawling algorithms. Finally, [1] notices that the spread of templatised pages has led to a situation where the graph structure of a site often lacks any correlation with the site typology it belongs to. A similar problem is addressed by Lempel et al. [5], which follow a structural statistics approach to solve the link farms problem, also known as search engine spamming. In all these approaches, only information on linkage and/or content is exploited, thus not figuring out which and how web resources are typically used by visitors. Within the field of search engines, other approaches can be found which make use of usage data extracted from web logs. Such data are generally exploited in link analysis-based techniques for ranking web pages. E.g., [7] presents an approach to use Web logs for computing the relevance of a web user to a given query, while [6] proposes to construct a traffic graph by mining users access patterns, and then applies a modified PageRank algorithm to rank web pages.

3.

Definition 1 (T raffic Graph) L et S = h(1, r1 , d1 ), . . . , (N, rN , dN )i be a sequence of N web requests monitored in a web community over a given time interval, each request being composed of a referrer ri (i.e., the page where the request is originated) and a destination di (i.e., the page requested). Then, we define the Traffic Graph of S (or simply Traffic Graph, when S is clear from the context) as a triple T G = (V, E, T ), where: V = {r1 , d1 , . . . , rN , dN } E = {(r1 , d1 ), . . . , (rN , dN )} T (p, q) = for p, q ∈ V P|{(n, p, q) ∈ S}| T (p) = (q ,p)∈E T (q, p) for p ∈ V The vertices in V are called pages, while the edges in E are called transitions. Finally, function T : V → N is called the Traffic of pages, while T : V × V → N is called the Traffic of transitions. In other words, a traffic graph is a graph adorned with a traffic function T (), defined on both single pages and page transitions. T (p) represents the incoming traffic of page p, i.e., the number of collected HTTP requests which ask page p, and T (p, q) is the traffic from p to q, i.e., the number of collected HTTP requests which ask page q and have p as referrer.

3.2

USAGE-BASED CHARACTERISATION OF WEB SITES

Monitoring the user activity on the web, e.g., through a proxy server, allows to extract information not only on general user behaviours and preferences, but also on the nature of the visited pages/sites. The general idea of our proposal relies on the intuition that pages and sites almost systematically avoided by the users – in spite of the fact that they are significantly exposed to web pages which contain links to such sites – are useless. On the other hand, pages which are excessively visited are probably automatically invoked by means of some kind of scripting mechanisms, and then have not been voluntarily requested by the users. In more abstract terms, we want to exploit the semantic information implicitly given by users with their navigation, and we are particularly interested in locating families of web sites which are characterised by some extreme user behaviours, which can be effectively recognised with simple statistical means. In this section we will present the formalisation of our approach, together with the definition of some parameters which will be used later in Section 5 to design some heuristics.

3.1

ated with a frequency weight. This leads to the following definition:

Usage-induced link structure

The canonical view of the Web adopted in the search engine field is essentially based on its link structure, which is represented as a simple oriented graph, the web pages being its vertices and the links the (oriented) edges between them. In the context of traffic data, each transition from page A to page B in a user session corresponds to a link contained in page A which points to page B. The set of page transitions that can be collected by tracing the web activity of a community of users, then, essentially represents a sub-graph of the whole Web, each vertex corresponding to a page and each edge to a page transition. The same page transition usually occurs more than once, so each transition is associ-

77

Characterising web traffic

The transition graph defined in the previous section summarises the traffic load over the web, and some interesting patterns on such summary can be defined and retrieved. In what follows, we provide one of such patterns, which can represent the basis for defining some heuristics to detect uninteresting nodes. From a user perspective, the links connecting a normal web page to clearly uninteresting sites (e.g., links to advertising sites through banners) are very little appealing, and thus only very seldom they are hit. For this reason, a very simple approach to evaluate the usefulness of web pages could consist in analysing their traffic T (p), marking as uninteresting the pages which have very low values. However, this solution suffers from some problems: (i) it fails to correctly model the presence of invasive advertising techniques, which autonomously generate additional page requests, yielding an high volume of traffic towards the corresponding advertising sites. As a result, content pages and such advertising sites would be undistinguishable. Moreover, (ii) even non-invasive advertising sites (as well as other families of uninteresting sites) can have an higher traffic than some typologies of interesting sites. One cause of this problem can be the fact that, while such interesting pages are linked only by few other sites, advertising links appear almost everywhere. As a result, even though each exposure of an advertising link has a low probability of being followed (below 3%, according to a recent marketing research [13]), when multiplied for the high number of exposures this low percentage gives easily rise to a considerable traffic. A more sophisticated approach can be obtained evaluating the success of web pages, i.e., how often the references to such pages are followed by users (even not voluntarily). We observe that in the case of pop-up windows or counters, each exposure automatically causes a request for the linked resource, thus conveying all the traffic of the page which hosts the links towards the advertising site. As a result, such

advertising/counter links have a success percentage close to 100%, i.e., much higher than any average (interesting) content page. Non-invasive advertising techniques (e.g., simple banners) show a diametrically opposite behavior: they have only a very low percentage of successful impressions, and then only a small fraction of the incoming web traffic of the pages containing the advertising links is propagated to the advertising site. We can formalise this way of reasoning by defining the following parameter, which computes the relative incoming traffic of web pages: Definition 2 (Relative Traffic) Given a traffic graph T G = (V, E, T ), the relative traffic R(p) of pages p ∈ V is defined in the following way:

where and and Φ : 2 numbers.

R

for p ∈ V R(p) = Φ(q,p)∈E 0 R(q, p) ½ 1 if T (q, p) > T (q) R(q, p) = T (q,p) otherwise T (q)

at detecting some families of uninteresting pages, including the advertising sites used so far as a leading example.

3.3

Searching for the best abstraction-level

So far, both the traffic graph and the relative traffic have been defined on single web pages. However, in several cases reasoning on single pages can be too fine-grained an approach. Advertising services, for example, are usually implemented on the level of hosts (real and virtual), while the internal organization of their single pages can be confusing and dispersing (in terms of web traffic). Therefore, it is advisable to introduce mechanisms for abstracting pages to coarser-grained concepts, in order to reason and work on the right entities. In general, we give the following definition: Definition 3 Given a set of pages V , we define an abstraction for V any couple A = (Ψ, α) such that: • Ψ is a set of elements, called abstract pages, and

0

E = {(q, p) ∈ E | T (q) > 0}

• α : V → Ψ, called abstraction operator.

→ R is an aggregation operator over sets of real

We notice that the relative traffic of a transition, R(q, p), is essentially an estimated conditional probability of visiting page p given that q has been visited, and is similar to the weighted implicit links defined in [6] for different objectives. Typical instantiations of Φ, which will be taken in consideration in later sections, are the average, minimum and maximum operators. For each page p, the Φ aggregate is computed on the relative traffic of its incoming transitions. It is worth noting that, due to fact that web requests are monitored within a limited time window, some pages can have null incoming traffic. As a consequence, the relative traffic of transitions originating in these pages is meaningless, and therefore such transitions are excluded in the computation of R(p). In Figure 1 a small example of traffic graph is shown, where edges and vertices are labelled with their corresponding value of the traffic function T (). According to Definiq2 q1

5

0 3

7

q3

6

p

In particular, in the rest of the paper it will be used the instantiation given by Ψ = {hosts} and α(p) = “host of p”. Indeed, in [8] it is shown that the host level is an appropriate granularity for capturing web macro-structures at the content level, and we believe that it can be a reasonable abstraction level even for the user behaviour analysis we are interested in. For example, in this case we would have that α(http://www.w3.org/MarkUp/) = http://www.w3.org = α(http://www.w3.org/News/). Other, more complex, instantiations can be defined and evaluated, e.g., considering the directory structure of URIs or the parameters passed to CGIs. From a traffic graph and an abstraction operator, then, we can obtain an abstraction of the traffic graph, in the following way: Definition 4 (Abstract Traffic Graph) Given a traffic graph T G = (V, E, T ) and an abstraction A = (Ψ, α) for V , we define the Abstract Traffic Graph T G0 corresponding to A in the following way: T G0 = (V 0 , E 0 , T 0 ) V 0 = α(V ) E 0 = {(α(a), P α(b)) | (a, b) ∈ E} 0 T 0 (a0 ) = α(a)=a0 T (a) for a ∈ V 0 P 0 0 0 T (a , b ) = α(a)=a0 ,α(b)=b0 T (a, b) for (a0 , b0 ) ∈ E 0

15

5

Essentially, nodes having the same abstraction collapse in the abstract traffic graph, and the consequences of that are propagated to the edges of the graph and to the traffic of nodes and transitions.

Figure 1: An example of Traffic Graph tion 2, transition (q2 , p) is ignored, while the relative traffic of the two remaining transitions is the following: R(q1 , p) = 1 and R(q3 , p) = 56 . Adopting the min aggregation operator, then, the relative traffic of page p will be Rmin (p) = min{1, 56 } = 56 . The max operator will yield Rmax (p) = 1. Finally, with the average operator we will have Ravg (p) ' 0.92. Notice that in some cases T (q, p) > T (q) (the outgoing traffic is higher than the incoming one). These situations, which are treated in Definition 2 by simply setting the relative traffic of edges to 1, are due to the fact that the web requests considered are just a slice of the real network traffic, as mentioned above. The relative traffic parameter will be used in Section 5 as the core of a few heuristics aimed

78

4. PRE-PROCESSING AND ANALYSIS OF PROXY-SERVER DATA In this section we provide some implementative details for the computation of the traffic graph and the relative traffic parameter, introduced in Section 3, together with some considerations on time and space complexity of the analysis process.

4.1

Data acquisition into the Web Object Store

In this work, the usage data have been processed and analysed using a Web Object Store – i.e., a system for storing

and handling several kinds of information related to the web, including usage, content and structure data – which is under development within the Enhanced Content Delivery (ECD) project. This is a three-year, Italian national project focused on developing tools and technologies for delivering enhanced contents to final users. This entails identifying relevant material from various sources, transforming it, organizing it and delivering the most relevant material to interested users in a timely fashion. The implemented data model of the Web Object Store includes some Usage Data abstractions that represent a hierarchy of object aggregations: HTTP Request: a generic HTTP request addressed to a Web server in order to retrieve a resource, for example a HTML page or its parts. P age View: a set of HTTP requests which are necessary to the visual rendering of a Web page in a specific browser environment (HTML page, JPEG, multimedia, etc.). Page views represent the basic elements of our computations and are extracted from the HTTP requests by performing some of the most common heuristics for web data preprocessing [4, 12]. They include, in particular, the construction of page views as sequences of HTTP requests beginning with a request to a HTML page, and followed by a sequence of requests to elements of the web page (images, multimedia, etc.), enforcing a temporal upper-bound constraint between consecutive requests.

4.2

Data cleaning of proxy level data

The web usage data considered in this paper were extracted from the packet level data obtained from the traffic of a proxy server, by means of packet sniffing techniques [10]. The network tool Ngrep allowed us to specify regular expressions to match against data payloads of HTTP requests contained in TCP packets addressed to port 80 of Web servers. This continuous flow of raw information has been filtered by means of a data cleaning module, which extracts only the relevant data for our objectives, and re-organize them in a server-like form. As an example, applying the data cleaning task to an entry of the raw data stream, we can obtain the following record: 2003/07/21 09:18:09, 131.114.3.xxx, http://www.xfce.org/index.html, http://www.xfce.org /en/download.html. which contains, in the order: a time-stamp of the request, the client IP, the URI of the requested resource and the URI of the referrer, i.e., the resource from which the request was originated.

4.3

Implementation of analysis

All the components of the Traffic Graph – i.e., vertices, edges and traffic functions – can be obtained through a single scan of the database, by incrementally extending the graph with the new nodes and edges met during the scan, and dynamically updating the traffic function for both nodes and edges. The process described above can clearly be performed in linear time w.r.t. the size of the database, i.e., w.r.t. the number of page views recorded. Moreover, the space requirements are limited to the size of the traffic graph: while, in principle, it could grow up to the size of the whole Internet, in practice it is expected to be much smaller than the portion of the web stored by the most common search engines. In fact, the traffic graph contains only the segment of the web which is actually visited by the users of the monitored web community, then excluding all portions of the web which are considered uninteresting by such users. Moreover,

79

the higher the abstraction level of pages the more compact the traffic graph, since several different pages will collapse into a single page abstraction. Finally, we notice that, if we simplify the page view construction heuristics (e.g., by associating each HTML page with a different page view, and discarding all the remaining HTTP requests), the overall algorithm requires to process each request only once, thus allowing a one-pass streaming computation. Such feature is almost mandatory in any realistic application, since the web traffic is huge, even when limited to a web community, and therefore it would be unfeasible to store the whole continuous stream of data and to process it more than once. The relative traffic of each node p of the graph can be obtained by computing the Φ aggregate over the traffic of edges connecting any other node to p. Such computation can be performed incrementally during the construction of the traffic graph, by updating a few ad hoc temporary data structures. As a result, also this step of the process can be performed in linear time (w.r.t. the size of the database of page views) in a streaming fashion.

5.

HEURISTICS FOR USELESS SITES DETECTION

In Section 3 two interesting tools for characterizing (and then locating) particular Internet segments were presented: the traffic function and the relative traffic parameter. In particular, we remarked that the latter intuitively seems to provide a more reliable model for some specific categories of hosts, such as invasive and non-invasive advertising sites. In this Section we show, by means of a few statistics, how much that intuition is supported by real data for the case of high traffic advertising and then, from the analysis of these results, a heuristic for detecting such sites is provided. Finally, the Section is concluded with a few remarks on heuristics for characterizing low traffic advertising.

5.1

Comparing T(p) and R(p)

In order to estimate the value of the traffic function and of the relative traffic parameters as discriminants between content and high traffic advertising sites, we analyzed their distribution over a segment of real traffic. The analysis has been performed on the dataset described in details in Section 6, which covers the traffic of an academic community over a short temporal window having a width of ca. 73 minutes. Moreover, advertising companies usually carry out their advertising campaigns by means of dedicated hosts, which makes the host level the most appropriate page abstraction, and so we performed our analysis setting Ψ = {hosts} and α(p) = “host of p”. Since we are interested in comparing the behaviour of the T () and R() functions over high traffic Internet segments, we restricted our analysis to the most visited hosts from our dataset. In particular, we manually labelled them either as content or advertising/counters and selected the 200 most visited hosts for each category. In Figure 2 we plot the distribution of the traffic function T () for content hosts and for advertising/counters by means of two distinct curves. Analogously, Figure 3 depicts the distribution of the relative traffic R() for the same two categories of hosts. As we can see, the T () function has a similar distribution on both the content hosts and the advertising/counters. Therefore, the two categories of hosts cannot be significantly characterized by means of the traffic function. On the contrary, the two curves for the R() function have very different behaviours. Indeed, content hosts are significantly

Content sites Advertising sites

70

60

N. of sites

50

40

30

20

10

0

0

50

100

150 Traffic

200

250

300

Figure 2: Distribution of Sites vs. Traffic (bins of size 10)

Algorithm 1 : AdvertisingHosts(S, n, N, Φ) Input: A sequence S of web requests; two integers n and N ; an aggregation operator Φ ∈ {min, max, av g} Output: A list of hosts. 1. Build the Traffic Graph T G from S; 2. Compute the A bstract Traffic Graph T G0 = (V 0 , E 0 , T 0 ) from T G using abstraction A = (Ψ, α), where Ψ = {hosts} and α(p) = ”host of p”. 3. Let H = {h ∈ V 0 |h linked by at least n other hosts}; 4. For each h ∈ H: Compute R(h); 5. Let O = list of all h ∈ H, sorted by R(h) in descending order; 6. R eturn O[1 : N ];

110 100

We notice that the first four steps of the algorithm, as already remarked in Section 4, are linear w.r.t. the number of processed requests, i.e. |S|. The last two steps extract the top N elements of H w.r.t. the R() value, which has a O(|H| log N ) computational complexity. Such cost is expected to be quite low, thanks to the high level of abstraction adopted, which yields a small |H|, and to the fact that typically N ¿ |H|.

Content sites Advertising sites

90 80

N. of sites

70 60 50 40 30 20 10 0

0

0.2

0.4

0.6

0.8

1

5.3

R(p)

Figure 3: Distribution of Sites vs. Relative traffic (bins of size 0.1)

predominant for low values of relative traffic, while advertising/counters prevail for high values. The relative traffic function, therefore, appears to be a good candidate criterion for the design of heuristics aimed at distinguishing the two typologies of hosts.

5.2

A Heuristic for high traffic ads/counters

In using the framework described in Section 3 to design any heuristic, the first step is the choice of a suitable abstraction level for the computation of the abstract traffic graph. As noticed in the previous sections, advertising campaigns and counter services are typically carried by means of dedicated hosts. Therefore, the most appropriate granularity seems to be the host level. As shown in Figure 3, the presence of advertising hosts and counters is particularly dense on high values of the R() function. From such observation, we draw the following hypothesis: the higher is R(h) for a host h, the higher is the probability that h contains advertising or counters. Therefore, we can use the R() values to rank hosts, filtering out the top N values, with N being a parameter of the method. An additional peculiarity of the hosts we want to spot is their link popularity: in fact, advertising services are usually delivered on impressions on several different sites. As a consequence, from a web link structure perspective, several sites link the same advertising host. Therefore, any reasonable candidate advertising/counter host should be linked by at least n ≥ 1 other hosts, n being another parameter of our heuristic. Such constraint also highly reduces the risk of misclassification due to sporadic cases of pages containing only one link (e.g., redirection to moved pages). The above observations can be summarized into Algorithm 1, which extracts a set of potential advertising/counter hosts from a dataset of web requests.

80

Towards heuristics for low traffic ads

As mentioned in previous sections, the R() function intuitively provides the means for designing a good model also for non-invasive, therefore low-traffic, advertising. In fact, recent marketing statistics on the effectiveness of noninvasive advertising [13] clearly suggest that this category of sites should be characterized by a very low relative traffic, which make them distinguishable from other categories. However, such kind of sites produces a low traffic, and so it is not possible to extract significant statistics about them without a large dataset, covering an adequately large time window. Due to the limits of the dataset at our disposal, we must postpone any exhaustive analysis on the subject to future works.

6.

EXPERIMENTS

In this section we report the empirical results obtained from some preliminary experiments performed with our available dataset, already used in Section 5 to compute some statistics. The data were collected from the proxy server SERRA, which handles the whole Internet traffic from the Pisa academic institutions to the rest of the (connected) world. At the present, only a small segment of traffic was available, corresponding to a temporal window of about 73 minutes, from 9.17 a.m. to 10.30 a.m. of Monday, July 21st 2003. We have cleaned 1.4 GB of raw data into a total of 876,975 HTTP requests, grouped into 82,056 page views. In our experiments we instantiated Algorithm 1 in the following way. First of all, the parameter n, which represents the minimum number of hosts linking to the host under consideration, after a few tuning experiments was empirically set to n = 5. Then, all three suggested choices for the Φ aggregation operator were explored and several different values for the N parameter – i.e., the number of hosts returned by the algorithm – were tried, comparing the results. To evaluate the results of experiments, we manually analyzed all the hosts returned by the algorithm, and assigned them to one of the two categories mentioned along the paper: content sites and advertising/counter sites. I.e., for the purpose of evaluating our heuristics, advertising sites and counters and statistics sites are grouped together into a single class.

N 10 20 30 40 50 60 70 80 90 100

Φ = min 100% 100% 100% 100% 96% 97% 87% 79% 71% 68%

Φ = avg 90% 95% 90% 92% 92% 93% 90% 86% 78% 72%

8.

Φ = max 90% 95% 87% 87% 82% 83% 80% 74% 75% 70%

Table 1: Precision of results for different Φ and N . In Table 1 we reported the results obtained applying our heuristic with different instantiations of Φ and different values for N . For each set of hosts returned by the algorithm, we computed its precision as the percentage of hosts which really were advertising or counters. We notice that: • As expected, the max aggregation operator yields in general a lower precision, since its optimistic nature tends to easily assign high relative traffic values. • Its opposite choice, on the contrary, seems to yield a very high precision, although it begins to quickly degradate for values of N greater than 60. • The average operator, finally, represents a trade-off in terms of precision, and seems to yield the most stable results among the three choices explored in these experiments. • Finally, in general, the precision falls down quite quickly, and good results can be obtained only for values of N up to around 70. The main reason for this, in our opinion, is due to the small size of the dataset used in the experiments, which makes the computed values of the R() function reliable only for a limited number of pages.

7.

CONCLUSION AND FUTURE WORK

We presented how the task of discovering uninteresting web sites, such as advertising sites, can be effectively based on traffic information. Our experiments, albeit preliminary and conducted on limited streams of usage data, show promising precision in detecting the intended sites. Clearly, precision and effectiveness is expected to improve monotonically as the system is used, as the traffic graph is incrementally grown through the progressive analysis of the continuous stream of usage data. On the basis of these results, we are encouraged to pursue this direction of characterizing web sites through their usage patterns, by further investigating important problems that pop up: • a thorough characterization of site/page categories which are usage-definable is needed; • a systematic study of how the proposed method can be made efficiently incremental, to cope with massive, continuous streams of usage data; • how usage analysis can be combined with contents/structure analysis, web mining and graph mining techniques to yield better characterizations of web sites.

81

REFERENCES

[1] Z. Bar-Y ossef and S. Rajagopalan. Template detection via data mining and its applications. In Proceedings of the 11th International WWW Conference, pages 580–591, 2002. [2] S. Chakrabarti, M. van den Berg, and B. Dom. Focused crawling: a new approach to topic-specific Web resource discovery. Computer Networks (A msterdam, Netherlands: 1999), 31(11–16):1623–1640, 1999. [3] J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In Proceedings of the 7th International WWW Conference, 1998. [4] R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1(1):5–32, 1999. [5] E. Amitay et al. The connectivity sonar: detecting site functionality by structural patterns. In Proceedings of the fourteenth A CM conference on Hypertext and hypermedia, pages 38–47. ACM Press, 2003. [6] G.-R. Xue et al. Implicit link analysis for small web search. In Proceedings of the 26th annual international A CM SIGIR conference on Research and development in informaion retrieval, pages 56–63. ACM Press, 2003. [7] J. Wang et. al. Ranking user’s relevance to a topic through link analysis on web logs. In Proceedings of the fourth international workshop on Web information and data management. ACM Press, 2002. [8] K. Bharat et al. Who links to whom: Mining linkage between web sites. IEEE International Conference on Data Mining (ICDM ’01). [9] K. Stamatakis et al. Domain-specific web site identification: The crossmarc focused web crawler. In Proceedings of the Second International Workshop on Web Document A nalysis (WDA 2003), 2003. [10] A. Feldmann. Continuous online extraction of HTTP traces from packet traces (Position paper). In W3C Web Characterization Group Workshop, 1998. [11] F. Menczer. Lexical and semantic clustering by web links. Journal of the A merican Society for Information Science and T echnology. To appear. [12] M. Spiliopoulou and L. C. Faulstich. WUM: a Web Utilization Miner. In Workshop on the Web and Data Bases (WebDB98), pages 109–115, 1998. [13] DoubleClick Q3 2003 Advertising Serving Trends. www.doubleclick.net, 2003.

No title

Short Description

Description

Comments