OLAP applications perform more complex processing than typical relational database applications. Also, a single result can depend on every item of data in the database, which is very unlikely in an operational system, so the volume of processing can be very much higher. This means that it is more important – and more difficult – to get the architecture right. 数据挖掘研究院
You should not assume that the vendors are expert at this. We found no product that gets the architecture fully optimized for today’s hardware, and many vendors who were not even aware that they had not done so. Every element of what we believe to be ideal is available, but not yet in a single product. Even the products that have an architecture that is potentially optimal usually do little to make it easy for the application to be tuned.
We have classified the components of an OLAP application into five logically defined layers, shown in Figure 1. In most cases, some of these may be merged, so that fewer physical layers can be distinguished, but this is a good way to classify generalized applications; one could, of course, also show more layers for any particular solution, but it would not improve the analysis to do so. We have indicated that these layers have to communicate with each other through a communications process that is a potential bottleneck. The narrowness of this constriction will vary depending on the architecture. If two modules of the same program are running on the same box, then there is, effectively, no significant narrowing of the bottleneck. If two different programs (from different vendors) are communicating on the same computer, then there will be additional overheads and translations involved, which will constrict the flow of information. If this is done across a network, then the constriction will be much greater. 数据挖掘研究院
The diagram in Figure 1 also gives an indication of the volumes of data that must pass between the layers. Clearly, an architecture that places a tight bottleneck between layers that have large volumes of data passing between them is more likely to suffer from performance problems than one that places the tightest bottlenecks at places with small data traffic requirements. 数据挖掘研究院
|
||||||||
Starting from the lowest level, the database files are, literally, the physical disk file or files holding the data structures and values. We are assuming that data is not physically stored elsewhere, though metadata is sometimes duplicated on client PCs for performance reasons. This does make the product easier to use, but downloading the latest metadata can take time if the connection is slow. In other words, the session may be slow to start up, but will deliver better interactivity later. 数据挖掘研究院
The database management layer may be a standard RDBMS or it may be a proprietary multidimensional database engine. The database management layer will, from time to time, have to access all the data in the files, so to separate these two layers across a LAN is a recipe for trouble with large applications.
The bulk multidimensional calculations layer is where the large-scale consolidation and other complex calculations that apply across large parts of the database are performed. These calculations will usually be largely or entirely pre-defined and may be performed either in advance or on demand. Doing them in advance improves the run-time performance, but can consume large amounts of disk, as explained in the database explosion section. Because these bulk calculations can involve so much of the database, it is also highly desirable that they be done close to the data, without a network bottleneck in between. Some of these calculations may also be defined at run-time, without pre-programming: for example, a user might define a new ad hoc variance, then ask for the worst five examples from the whole database. To find these five cases might involve calculating and filtering tens of thousands of variances, so it is most efficient for the calculations and the sorting to be done near the data. This layer of an OLAP application is usually included within MOLAP and hybrid OLAP products, and in the case of MOLAP products, it should be tightly bound with the integrated database management layer, with no bottleneck between them. 数据挖掘研究院
The ad hoc multidimensional calculations layer performs the simpler calculations often done at reporting time, based on data that is already used in reports anyway (for example, calculating differences between two columns in a report, or a new subtotal of rows that are already used). This functionality is usually provided on the client PC, and may be part of the OLAP tool itself or performed in a spreadsheet or third-party client. If these calculations involve only data that would, in any case, be sent to the client, then they are best done on the client machine. This will reduce LAN traffic and, potentially, the server load as well. However, in some zero footprint Web solutions, such work has to be done at the server, which increases server load and network traffic. 数据挖掘实验室
Finally, the GUI presentation layer is the human factors and data manipulation level (for example, the part of the product that allows ‘dicing and slicing’, color coding and charting). Again, this might be part of the OLAP product or some other connected product such as a spreadsheet, an EIS front-end or a Web browser using Java applets or other scripting methods. The thinnest client architectures split this layer between the browser and a mid-tier server. While this provides a clean architecture, it reduces functionality, performance and ease of use. 数据挖掘实验室
Although it may do little computational work, a responsive GUI environment can still consume significant processing resources. In order that users can generate queries quickly, it is essential that this layer understands the dimensional structures of the application. This enables it both to present the structures to the users for navigation and to generate efficient queries for processing by the lower levels.


