RSS
热门关键字:  数据挖掘  人工智能  数据仓库  搜索引擎  数据挖掘导论

Efficiency Analysis of Brokers in the Electronic Marketplace

来源: 作者:unkonwn 时间:2004-11-28 点击:

Introduction

The Internet and the World Wide Web provide a global virtual marketplace, without location and time constraints. The electronic market provided by the almost universal system of communication of the Web is adequate for information-based products (e.g., news, software, financial services, ticketing services) and also for order retailing of some non-digital products such as books, CDs, flowers, travel, groceries, PCs, among others. Usually, e-commerce companies make available on the Internet electronic catalogs, that support lists of products and/or services, price information, and commercial transactions. As a consequence, the amount of available information and the number of potential customers in the Web is growing very rapidly [LG98]. 数据挖掘研究院

Though useful information may exist somewhere, it is not always easy to find what a user is looking for on the Web. Since the Web is large and growing exponentially, it is impractical to exhaustively browse the Web looking for products and services. Therefore, one of the biggest challenges faced by electronic customers is the information overload, that hampers the growth of the online buying process. Although there are several different models for representing e-customer behavior, there exist some basic steps that are shared by most models [Bak97], such as: need identification, product search, merchant search, negotiation, purchase and delivery, product service and product evaluation. In order to boost e-commerce activities, tools and services are needed to help customers in each of these basic steps.

数据挖掘研究院

As a result, e-brokers have been developed to help users to find information, products and merchants. A broker is a party which mediates between buyers and sellers in a marketplace [SWB95]. E-brokers can search for products, retrieving information to help a customer to determine what to buy. E-brokers can also look for merchant-specific information (e.g., price) to help a customer to decide whom to buy from. Basically, e-brokers can be viewed as search engines that specialize in specific topics. For example, a bargain broker searches the Web for price and characteristics of the products, summarizes the results and presents it to the user. In another example, a broker could search e-catalogs of many suppliers, which are registered with the broker, and try to match product specification and negotiation requirements.

数据挖掘研究院

The Miner Family of Web Agents [Min] is a set of tools whose main objective is to help people in finding information on the Web. The main idea is to bring multiple search and information sources together in one place. The searching is performed by agents working in parallel, just like metasearchers [SE95,Dre] that use several search engines simultaneously, collecting answers and unifying them. The information may be the price of a book, a new musical release, a freeware or a shareware software, daily news, or any document available on the Web. 数据挖掘研究院

A portal is a site that brings together a variety of content and services in one area and attracts a large number of visitors. The idea is to become the single best starting place for as many users as possible. In Brazil, the largest Web site is UOL [Uni], which is shaping itself as a portal. UOL is a Brazilian site that brings together a variety of content and services in different areas. UOL acts both as a content and service provider offering more than 53 Brazilian magazines, 21 international magazines, 59 Brazilian newspapers, and 31 international newspapers. UOL also offer several services, including hundreds of chat rooms that topped 12,000 people online simultaneously, more than 400 product sites, and RadarUOL, a search engine powered by Inktomi [Ink]. UOL topped more than 12 million page views in one day[*], while being one of the largest non-English and the largest Portuguese content provider in the world. The Miner Family has a partnership deal with UOL and is one of the services offered at the UOL site. The Miner Family has topped more than 100 thousand page views in one day. This rich environment provided us the data used in this paper.

数据挖掘研究院

The goal of the paper is threefold. First, we give an overview of the Miner Family architecture, implementation and workload characteristics and point out the differences to existing similar services (e.g., Express, Jango, and Junglee). Second, we present a quantitative study of the behavior of two large non-English e-brokers. Considering that the e-brokers are browsed by a large number of users who mainly speak Portuguese and live in Brazil, we discuss the influence of regional and cultural issues on the e-commerce activities. Third, we present a case study that analyzes the efficiency of the results provided by e-brokers in the e-marketplace.

The paper is organized as follows. Section 2 presents the architecture of the Miner Family and discusses its design rationale, components, and overall workload. Section 3 characterizes the workload of two brokerage services (i.e., BookMiner and CDMiner) of the Miner Family. Section 4 presents a click-through model based on data collected from the operation of the e-broker BookMiner. We present figures that indicate the level of activity of the e-broker and show a model of the customer behavior. To conclude, Section 5 points out some evidences that show the influence of regional and cultural issues (language in particular), brand and regional factors, on the quantitative results presented in the paper.

数据挖掘研究院

Architecture and Workload of the Miner Family

The Miner Family of Web Agents [Min] is both a searching utility and an electronic catalog, that also provides brokerage services. The Miner Family was developed mainly for Portuguese language-based services. The search utility services provided by the Miner Family at the time the paper was written include: (1) MetaMiner, metasearch engine that uses Brazilian and international search engines, (2) DoctorMiner, that searches for information on several sites containing medical and dental articles, (3) SoftMiner, that searches for software in freeware and shareware sites, the just released (4) JavaMiner, that searches for technical information about Java language, and (5) PeopleMiner, that searches for people on the Internet. The search engine service includes (6) NewsMiner, that collects news from Brazilian newspapers, leaving them daily available for the Internet community. Brokerage services include: (7) BookMiner, that searches for books in registered Brazilian and international bookstores to match user′s specification and (8) CDMiner, that searches for musical titles in Brazilian and international musicstores to find the user′s preferences. Table 1 presents description of each member concerning its target (e.g., search engines, stores, etc.) and the number of registered sites for each member.

 

   
Table 1: Members of the Miner Family
Member Target #Sites
MetaMiner search engines 13
DoctorMiner medical and odontological information 17
NewsMiner newspapers 13
BookMiner bookstores 16
CDMiner musicstores 13
SoftMiner software 10
PeopleMiner people 13
JavaMiner Java language information 7

  数据挖掘实验室

The Miner Family was coded in Java and comprises about 23,000 lines of code that run on a Netscape Enterprise Server, and the host platform is a SUN Ultra running Solaris 2.6. The code was implemented emphasizing greater reusability and easier maintenance and is structured into four levels: (1) general library, (2) middleware (e-commerce, search utilities, and search engines), (3) agents, and (4) user′s interface. Figure 1 depicts each of these levels, which are explained in detail in the next paragraphs.

  数据挖掘研究院

  
Figure 1: Structure of the Miner Family code
egin{figure}
            centerline{
            psfig {file=comp.eps,width=4.6in}
            }end{figure}

 

数据挖掘实验室

The general library contains several functionalities that are used by the upper levels, such as handlers (HTTP, cookies, tickets), query caching (for breaking results among pages), data fusion and interface widgets. It corresponds to 25% of the Miner code. The functions and primitives for each of the types of services offered by the Family are implemented in the middleware level, and each of the three services comprises about 2,000 lines of code. The e-commerce code contains classes that abstract goods′ characteristics and interface with the stores that sell them. Similarly, the search utilities code contains functions that handle searches in each of the types of sites (software, people, and general) and the respective object classes. The search engines code implements procedures that follow the ethic of bots [Che96,Kos96], an information manager, connection handler, and bots′ navigation control. The agents responsible for querying the various sites comprise 3,000 lines of code total. Among other tasks, these classes store details about site handling, data filtering, and the structure of HTML data. Finally, the interface code (7,000 lines) implements all the HTML forms for the queries and the formatting of their results. 数据挖掘研究院

By using this structure, the implementation of new Family members becomes trivial. A new search utility, querying ten different sites would require only about 500 new lines of code. As a example, the implementation of the newest member, JavaMiner, cost 16 man-hour and was made available in less than a week after conception. 数据挖掘研究院

All members of the Miner Family work similarly and the main steps to answer a query are depicted in Figure 2. Each query task can be divided into five main steps, as follows: (1) a user submits a query; (2) the Miner server gets the query and dispatches its agents; (3) each agent queries its target engine, store, or site; (4) each agent receives and parses the query results; and (5) the server unifies, formats, and sends the results back to the user. 数据挖掘研究院

  数据挖掘研究院

  
Figure 2: Miner Family functionality
egin{figure}
            centerline{
            psfig {file=process.eps,width=5.2in}
            }end{figure}

 

Workload Characterization of the Miner Family

This section presents a workload characterization of the Miner Family. We start out the analysis by partitioning the overall workload according to the services provided by the Miner Family. Table 2 shows the data extracted from logs of a four-week period of usage of the Miner services. The daily average number of requests was 22,086. We divided the data into three categories: (1) request frequency, (2) request characteristics, and (3) hourly distribution. Request frequency represents the percentage of requests addressed to each service. We note that MetaMiner is the most popular service, receiving almost 90% of the total requests. Three other metrics were defined to further characterize the request workload: (1) words per query, (2) match ratio, and (3) answers per query. Words per query quantifies the complexity of the request, which is around 2 words on the average. For instance, 95% of the requests to CDMiner have less than four words.

  数据挖掘研究院

  
Table 2: Overall Workload Statistics
Miner Meta Book CD Soft News Doctor
Queries(%) 89.15 2.60 2.65 2.34 1.89 1.37
Words/Query 1.98 2.05 1.87 1.55 1.66 1.69
Match Ratio(%) 93.64 75.65 79.53 88.00 55.60 95.81
Answers/Query 53.97 42.40 41.06 63.74 11.05 47.78
Peak Period 7am 7am 11am 8am 5am 8am
  9pm 9pm 7pm 11pm 5pm 10pm
Peak Hour 1pm 1pm 2pm 1pm 7am 8pm
Peak Ratio 2.29 7.52 6.41 7.50 13.12 9.37

  数据挖掘研究院

The match ratio represents the number of requests that returned at least one URL. In this case, we can observe that a high match ratio can result from two different scenarios. The first one is related to services that have broad coverage (i.e., the MetaMiner) and provide answers for most of the queries (although we cannot quantify how meaningful the answers are). The second scenario involve services that are so specialized that the queries are very constrained (i.e., SoftMiner and DoctorMiner). Similar conclusions arise when we look at the average number of answers per query.

数据挖掘实验室

Regarding hourly distribution, we consider three characteristics of the workload: peak period, peak hour, and peak/average ratio. Peak period represents the hours during which the number of requests is higher than the daily average. As we can see in Table 2, this information uncovers an interesting characteristic of Miner users, who usually query information during work time, probably using a non-modem connection. The peak hour is the time slot when the maximum number of requests was observed. In all cases but two, we noticed the peak hour is during lunch time in Brazil. One of the exceptions occurs for the NewsMiner service, whose peak is around 7:00am, when users log to get the daily and breaking news. DoctorMiner peak hour is around 8:00pm, when health professionals are usually able to look for medical information. Finally, peak ratio measures the request rate at the peak hour over the average rate [MA98]. Specific services such as BookMiner and NewsMiner are more bursty than generic search services like MetaMiner. Their peaks are 7 and 13 times higher than their average, respectively, while the MetaMiner peak ratio is only 2.29. 数据挖掘研究院

数据挖掘研究院

Related Work

There are related works in this area. Excite has a shopping guide to find products and prices on the Web, which is called Product Finder and is powered by Jango [Jan]. Junglee [Jun] has developed a technology which aggregates information and prices for merchandise sold on the Web, enabling consumers to compare and shop for online products. Their technology is now being used by Yahoo [Yaha,Yahb]. More recently, Infoseek announced Express [Exp], which uses many search engines to multiple search for products. 数据挖掘研究院

Table 3 presents the main characteristics of the three technologies mentioned above and the Miner Family. The first row shows the number of bookstores used by Yahoo.Junglee, Infoseek.Express and BookMiner. In the case of Yahoo.Junglee the number was estimated from the queries submitted as they do not list the actual bookstores. In the case of Infoseek.Express they do not search all five bookstores or musicstores in parallel, but each one at a time, and so we could not include them in our experiment whose results are shown in Table 4. From the 16 bookstores listed in BookMiner 8 are Brazilian. The second row presents the number of musicstores provided by Yahoo.Junglee and CDMiner, and the value for Yahoo.Junglee is again estimated from the queries because they do not list them. From the 13 musicstores listed in CDMiner 5 of them are Brazilian ones. The third row presents the number of engines to search for software (freeware and software). Again, Infoseek.Express searches all 5 software sites one at a time. The fourth row presents the number of search engines and directories used by Infoseek.Express and MetaMiner. From the 13 engines used by MetaMiner 5 are Brazilian. The fifth row shows that only Infoseek.Express searching tools do not perform requests in parallel. Finally, the last row shows the tools that allow users to choose the sites that are to be queried. 

数据挖掘研究院

 

数据挖掘研究院

 
Table 3: Characteristics of the search tools
Characteristics Technologies
  Junglee Jango Express Miner
  Yahoo Excite Infoseek  
Bookstores 6$^
            dag
            $ - 5 16 (8 Brazilian)
Musicstores 4$^
            dag
            $ - 5 13 (5 Brazilian)
Software - 10 5 10
Metasearch engines - - 7 13 (5 Brazilian)
Parallel search yes yes no yes
Where to search option no no yes yes
5l$^
            dag
            $ Estimated        

  数据挖掘研究院


Table 4 presents seven different queries submitted to Yahoo.Junglee, BookMiner and CDMiner. The first five queries search for books, the first two being titles published in US. The following three are authors of books: one American (i.e., the writer Tom Wolfe), one Portuguese (i.e., the poet Fernando Pessoa), and one Brazilian (i.e., the writer Jorge Amado). The following two queries search for CDs from one American artist (i.e., the jazz singer Ella Fitzgerald) and one Brazilian artist (i.e., the bossa nova singer João Gilberto). The last query searches for the sound track of the 1985 movie Subway, which was found only in one Brazilian musicstore at that time. The aforementioned table shows the query results. The first two columns present the answers returned by Junglee and Miner, respectively. The last column (Common) presents the number of answers that appeared in the results returned by both tools. The large number of documents returned by the Miner Family comes from the larger number of registered sites. For queries involving Brazilian and Portuguese names the differences are even larger because of the language influence. 数据挖掘研究院

  数据挖掘研究院

  
Table 4: Different types of queries submitted to Yahoo.Junglee and the equivalent Miner tools (BookMiner and CDMiner)
Queries Answers
  Junglee Miner Common
Sphere (by title) 75 261 65
Jurassic Park (by title) 71 106 58
Tom Wolfe (by author) 77 46 40
Fernando Pessoa (by author) 30 160 27
Jorge Amado (by author) 39 225 35
Ella Fitzgerald (by artist) 42 161 20
Joao Gilberto (by artist) 28 76 11
Subway (by title)   1  

 

数据挖掘研究院


Workload of the Two E-Brokers

  This section analyzes the workload of two brokerage services, namely BookMiner and CDMiner. The goal is to study the actual workload generated by customers searching for books and CDs on global and Brazilian electronic stores. The characterization is based on data collected from two logs, corresponding to a four-week period. The first log shows overall results of the broker activities, while the second one provides per-store information. IP addresses were masked in order to protect users′ privacy. We merged the two logs based on time, date, and masked IP address. As a result, the merged logs provide the following information: date and time of the request, query keyword(s), type of query (title or author), request response time, overall number of titles or CDs returned to the user, response time for each store, and number of titles or CDs returned by each store.

数据挖掘研究院

The broker workload is described by a graph called a Customer Preference Graph (CPG). This graph has one node for each service and registered stores of the broker. The transitions between the nodes represent the percentage of customers that followed a specific path, i.e., service, national domain, and store. Figure 3 shows the CPG for the BookMiner brokerage service. For each registered bookstore, we measured the click-through frequency, given a BookMiner response. The click-through determines which bookstore was chosen by the user. The percentage associated with each path of the BookMiner graph represents the click-through frequency. From the CPG of Figure 3, we note that 76% of the customers prefer Brazilian bookstores. Among the global bookstores, Amazon.com was chosen by most of the users (50% of the users), followed by Barnes & Noble and BookStacks. Siciliano, a Brazilian bookstore, is responsible for one fourth of the click-throughs among the Brazilian bookstores, followed by Cultura, Booknet and Loyola.

On the other hand, the CPG of the CDMiner (Figure 4) shows a different customer profile. The percentage of users that visit global and Brazilian stores are about the same. We conjecture that this behavior can be explained by the following observations: (1) international music has wider acceptance than international literature in Brazil, and (2) no Brazilian musicstores allow consumers to listen to tracks from CDs before buying. According to [Nie99], customers of stores that sell music CDs want more music samples, ease of use and low prices. In addition to the fact that Brazilian musicstores do not offer samples, the tax to import CDs may explain why customers visit international stores, but do not buy the products. On the contrary, books do not pay import tax. Among the global bookstores, Blockbuster was chosen by 26.77% of the users, followed close by CDNow and Amazon.com. It is remarkable that CDNow, as an electronic musicstore, is more famous than BlockBuster. We conjecture that two factors led shoppers to visit BlockBuster more frequently: (1) it does not return the prices of the CDs, which somehow forces customers to visit its site, and (2) the average response time of CDNow is four times larger than Blockbuster, as we discuss later. CDStudio, a Brazilian musicstore, is responsible for almost one third of the click-throughs among the Brazilian musicstores, followed by Ferrs. Again we can observe how specialization affects user preferences, VanDamme, which sells only new age CDs, got only 1.21% of the click-throughs.

 

  
Figure 3: BookMiner Customer Behavior Graph
egin{figure}
            centerline{
            psfig {file=miner.eps,width=5in}
            }end{figure}

  数据挖掘实验室

 

数据挖掘研究院

  
Figure 4: CDMiner Customer Behavior Graph
egin{figure}
            centerline{
            psfig {file=cdminer.eps,width=5in}
            }end{figure}

  数据挖掘研究院

E-commerce service levels are usually assessed through response time and availability. For the purpose of our analysis, a server is considered available when it answers the book request within the user-defined timeout (i.e., 60 seconds by default). Elapsed request response time is the interval of time needed for receiving a response from the server. Tables 5 and 6 show availability and elapsed response time of the registered stores that are queried by BookMiner and CDMiner, respectively. We note that almost all stores exhibit a good level of availability no matter where they are located. On the other hand, the same tables show a high variance for response times. Average elapsed response time of national stores is lower than the same time of international stores. We conjecture that this phenomenon is a consequence of the heavy traffic on the international links between Brazil and US. One remarkable exception is BlockBuster, that answers as fast as any Brazilian musicstore. 数据挖掘研究院

It is worthwhile to look at the influence of those metrics in the measurements we obtained for the brokers. For example, Siciliano bookstore does not exhibit a good service level indicator. It has the lowest availability (23.88% of the queries timed out as shown in Table 5) among the Brazilian bookstores. However, Siciliano is the Brazilian bookstore that attracted the largest portion of the Brazilian customer community. This apparent contradiction can be explained by the influence of the "brand" on the shoppers. Siciliano is a well established company, having many bookstores in the main cities of Brazil, which somehow makes the company familiar to customers, even on the Internet 数据挖掘研究院

 

数据挖掘研究院

 

 

数据挖掘实验室

  数据挖掘研究院

 
Table 5: BookMiner Performance Results
Bookstore Availability Response Book
  (% of requests) Time(sec.) Hit Ratio
Barnes & Noble 95.55 25.4 18.5%
Bookstacks 84.75 8.1 22.0%
BookPool 99.50 10.4 4.7%
McGraw Hill 99.20 28.0 4.3%
O′Reilly 100.00 12.7 4.6%
Prentice Hall 100.00 7.1 7.2%
iBS 100.00 17.1 13.9%
Amazon 99.23 13.0 19.1%
Booknet 98.27 12.5 49.3%
Campus 100.00 2.0 7.2%
Cultura 100.00 14.3 33.6%
Siciliano 76.12 24.8 69.4%
Sodiler 100.00 11.4 38.4%
Tempo Real 100.00 12.5 11.5%
Loyola 100.00 8.9 56.0%
artepaubrasil 100.00 8.9 55.7%
BookMiner 100.0 48.5  

  数据挖掘研究院

We define another metric called ``book hit ratio" (BHR) that represents the number of times that a bookstore suggests at least one title in response to a customer request over the total number of requests sent to the bookstore. Looking at Table 5, it is evident that Brazilian bookstores are more effective in finding in their selection the books requested by Brazilian customers. The BHR of Brazilian bookstores is higher than the BHR of the global bookstores. This fact stems from cultural factors such as English proficiency and local interests. Around 50% of Brazilian Internet users do not know English. Also, Brazilian bookstores have much larger selection of books on topics that are part of the Brazilian culture [ACR10#1+98] than global bookstores. 数据挖掘研究院

Regarding musicstores, we define a similar metric called ``CD Hit Ratio′′ (CHR) that represents the percentage of requests that a musicstore suggests at least one CD in response to a customer request. Looking at Table 6, we note that all Brazilian musicstores but VanDamme (which is specialized on new age) presented a higher CHR than international musicstores, which is explained by requests to Brazilian artists. However, notice that the CHRs are smaller than BHRs for national stores, confirming the smaller influence of language issues on music preferences. 数据挖掘研究院

 

   
Table 6: CDMiner Performance Results
Musicstore Availability Response CD
  (% of requests) Time(sec.) Hit Ratio
Amazon 98.73 17.271 22.22%
AudioHouse 100.00 11.259 8.31%
BlockBuster 100.00 6.763 22.14%
CDUniverse 97.47 34.994 23.11%
CDNow 98.95 20.099 33.44%
MassMusic 94.86 41.356 26.26%
MusicBoulevard 97.27 18.550 16.52%
Ferrs 100.00 5.124 57.55%
PlanetMusic 100.00 4.850 31.89%
VanDamme 100.00 1.955 5.07%
CDStudio 100.00 6.824 44.73%
CDMiner 100.00 41.698  

  数据挖掘研究院

A final observation regards the percentage of click-throughs that turned into sales. We analyzed sale reports from three stores (i.e., amazon, CDNow and Booknet) and found that 8% of the click-throughs became sales of books and 3% of the click-throughs turned into CD sales.. This behavior has been observed before [Kra98], that is, consumers are less likely to buy CDs on electronic stores than they buy books. Recently, [Nie99] presented similar results, stating that only 5% of the visits to e-commerce sites are to buy. These results show that brokers are more efficient than some advertisement mechanisms, such as banners (whose estimated click-through ratio is only 1%).
数据挖掘研究院

Case Study: Efficiency of a Non-English E-Broker

The efficiency of the results provided by a broker could be assessed by the percentage of customers that is driven to each of the registered stores. In Section 3, we saw that BookMiner turned 8% of the click-throughs into book sales. Now, we want to answer the following question: What are the reasons that motivated customers to shop on the stores pointed out by the brokers? Based on a first intuition, we would say that the percentage of click-throughs for a given store is proportional to the book hit ratio and availability, and inversely proportional to the response time. However, looking at the data obtained from the logs, we found different observations. We did a regression analysis on the data presented in Section 3 and found that average response time is not correlated to click-through. Then, we did other correlation tests and found that the number of click-throughs for each store is strongly influenced by factors such as book hit ratio, price, brand and regional characteristics, represented by language, currency, logistics and customs. We examined the logs from the operation of BookMiner for two days and assessed what factors were influential on the customer preference.


数据挖掘实验室

Bookstore Selection on Click-through Percentages

The availability of a large variety of products is a key issue in the relationship between customers and companies on the Internet. Bookstore selection is directly related to the book hit ratio metrics, that represents the number of times that a bookstore suggests at least one title in response to a customer request over the total number of requests sent to the store. Figure 5 shows that there exists a strong relation between the availability of titles and the click-through percentages. The larger the selection of a given store, the greater the click-through percentage for the store.

数据挖掘研究院

Bookstore Selection quantifies the diversity of titles offered by a bookstore. It is calculated in a per-click basis: for every bookstore that offers the desired title, we add the inverse of the number of offering bookstores to its bookstore selection. The percentages shown in the graphs of Figure 5 are the relative weight of each bookstore considering the overall bookstore selection observed.

 

  
Figure 5: Bookstore Selection Influence: National and International
egin{figure}
            egin{center}
            egin{minipage}[t]
            {6.5cm}
            psfig {file=selbr.ep...
            ...5cm}
            psfig {file=selus.eps,width=6.5cm}
            end{minipage}end{center}end{figure}

  数据挖掘研究院

Brand Influence on Click-through Percentages

Trust is a fundamental issue in the relationship between customers and online stores. Trust in the electronic market is many times associated with the traditional concept of retail brand, that identifies the e-commerce company that is responsible for the customer relationship in a electronic transaction. In this case study, we viewed the factor "brand" as the percentage of those requests where the bookstore offer was clicked despite its price was not the minimum among the several offers or within 10% of the minimum price. In this case, we conjectured that the bookstore choice was driven by the "bookstore brand". Figure 6 shows that there exists in our data a correlation between brand and click-through percentage for both national and international bookstores. In the previous section we observed the importance of the brand factor when analyzing the number of click-throughs to the Siciliano bookstore. 数据挖掘研究院

 

  
最新评论共有 0 位网友发表了评论
发表评论
评论内容:不能超过250字,需审核,请自觉遵守互联网相关政策法规。
匿名?