Jewels, Himalayas and Fireworks, Extending Methods for Visualizing n dimensional Clustering

tarix	24.06.2016
ölçüsü	1.87 Mb.

Jewels, Himalayas and Fireworks,

Extending Methods for Visualizing N Dimensional Clustering

W. Jockheck

Dept. of Computer Science

North Dakota State University

Fargo, North Dakota 58105

William.Jockheck@ndsu.nodak.edu

Dr. William Perrizo

Dept. of Computer Science

North Dakota State University

Fargo, North Dakota 58105

William.Perrizo@ndsu.nodak.edu

Abstract

Visualization of higher dimensional data is awkward and in some cases counter intuitive. Yet the human mind is the most sophisticated and effective pattern recognition device available for two-dimensional or three-dimensional patterns and two-dimensional patterns easily presented to humans (on computer screen or print out). This paper considers “pattern preserving” (cluster preserving) transformations of high dimensional data to two (or three) dimensional space, so that a quick visual scan can yield significant amounts of pattern related information. Over time several approaches have been presented that attempt to show the relationships between data that is in more than three dimensions. The work here explores methods of extending and overcoming some of the drawbacks of these systems, specifically the Jewel Diagram [4] and combinations and variations on Alfred Inselberg’s parallel coordinate system [1,2,3]. The lack of pattern preserving projections of n-dimensions into two dimensions is the root of the problem, but efforts to extend meaning and clarity continue. This paper specifies methods to improve the displays and the meaning conveyed by two dimensional graphics.

1. Background

Based on the need to represent complex data in a visual format several methods have been developed. If the number of independent variables exceeds three it becomes awkward to provide graphs useful for visualization. Attempting to do so usually results in some combination of graphs or the exclusion of some of the variables for the purpose of simplicity. The approaches in this paper are based on the Jewel Diagram [4] and extension of the Parallel Coordinate methods [1,2,3].

1.1 Parallel Coordinate Scheme

Figure 1: A Parallel Axis System

This method, from Inselberg and others, attempts plots with all the axes parallel to each other in a plane. This preserves some of geometric structure. It is, however, projected in such a fashion that most geometric intuition has to be relearned. Inselberg has used it to achieve impressive results but the intuition is not clear to a novice user.

1.2 Jewel Diagram

As originally proposed the Jewel Diagram projected the normalized variables onto the sides of an n-sided polygon. The projected points are then merged (to their average or center of mass) to form a single projection point at the center of the polygon that represents the n-dimensional projection of the data. If each point is shown in a different color, the diagram looks like a jewel, therefore the name.

Figure 2: Jewel Diagram with six dimensions with attributes for each sample connected. The central points are the n-dimensional projection of the data.

The advantage of the Jewel Diagram is that the sides of the polygons provide an indication of the distribution or clustering of the individual attributes of each tuple (sample) while the central points provide an n-dimensional projection of each tuple and hence a snapshot of the overall relationships between the tuples (samples).

The concept of these n-dimensional projections were picked up and added to a number of other high dimensional visualization systems. For clarity, these n-dimensional projections will be refereed to as representative projections. In each case the representative projections are two dimensional projections of the n-dimensional space represented by the attributes of each tuple.

There are serious problems in an absolute mathematical sense with this approach. As the number of dimensions increases, the loss of information in the projections increases. Points mapped to 2-d space are not a unique representation of the n-d space. While this loss of uniqueness is relatively small, sample coincidence may occur, using only the 2-d projection these cannot be detected.

For the jewel diagram the points mapped into 2-d space will converge to a single point (no distinguishable variation in values) as the number of sides of the polygon approaches a circle. This shrinking viewing space is due to the fact that the values are averaged. While it is possible to increase the diameter of the near circle to separate the point, when n is very large it is not possible to view the distribution of the individual attributes on the sides of the polygons in the same image as the center of mass points.

The sequence that the variables are placed on the polygon has significant impact. Note that (in figure 2) series 1 and 4 increase in opposite directions (180 degree offset) hence directly offsetting each other while 1 and 2 move in a similar (30 degree offset) direction. Also, changing the sequence of variable changes the diagram significantly (the method is not robust with respect to sequencing).

Exploration has not consistently produced useful patterns. It appears that the pattern is often lost due to the perturbations caused by the random (or uncorrelated) values in some variables which then move the points randomly in the two dimensional space.

1.3 Variations on the Parallel Coordinate Scheme

Based on the projected central point, attempts to use that mechanism were projected onto the parallel coordinate and a variation called the mountain or Himalayan diagram [5] is used, which changes the angles of parallel coordinate system to distribute the central projected points.

Adding the combined points to the parallel axis system provides a two dimensional visualization of a means of the normalized values. This becomes very busy as the number of attributes (A) increase, and (cluster) patterns can become hidden or lost.

Figure 3: Parallel axis with mean points added.

Below, in an effort to increase the separation a variation called the mountain or Himalayan diagram was used to spread the distribution of the combined point projection over two dimensions. Here the once parallel axis are connected and are rotated through ever increasing angles, providing a unique x component to each attribute.

Figure 4: Himalayan Diagram with projected points.

Several variations of the Himalayan Diagram have been evaluated with modification of the angles and selection of whether the normalized attribute values are placed on the axis directed upward or downward.

Problems with the inclusion of larger numbers of attributes lead to the combination of the parallel and Himalayan diagrams with the concept of the jewel diagram by arranging the diagram around a circle. This placed the line segments representing the parallel or Himalayan axis around the outside of a circle and posted the projection combined points in the center.

Using at the Himalayan diagram, we looked at instead of connecting the vectors and arranging the vectors as spokes of a wheel. Using the spokes with the Himalayan angles should result in the same mean points as the linear arrangement. Doing this allows the use of more of the circle. The following Himalayan diagram uses the circle.

Figure 5: Himalayan Diagram wrapped around a circle.

The four blue line segments, each twice as far about the circle as its predecessor are used to plot the normalized attribute values. The colored lines then connect the attributes from the same tuple or sample. The circle is started at 0 degrees so the first line segment is at approximately 30 degrees, since the circle is completely used, the last is at positive x axis. This is based on the fact that the Himalayan diagram doubles the angle as each new line segment is added.

By using the segment of each radius from length one to length 2, the center of the circle becomes available to plot the mean points. Unfortunately, since the last angle change must be greater than 180 degrees, the mean points will only fall in the top half of the circle.

The image shows three samples plotted in a sort of jewel diagram manner with the plot ending at the mean point for convenience in Excel.

Plotting uses the well-known iris data set usually attributed to R.A. Fisher [6]. Additional notation about the data set are discussed in section 2.3 Selection of attribute sequence and weighting.

Using ten samples from each class the results are as follows.

Figure 6: Himalayan Diagram with projected points.

This does not provide good separation of classes, as you zoom in on the mean points.

Figure 7: Zoom in on Himalayan Diagram projected points.

Looking just at the mean points, removing the lines we see this.

Figure 8: Himalayan Diagram projected points only.

This particular data set does not separate clusters well in this variation. Alteration of the sequence of attributes and scaling may alter the appearance and improve the performance. However, the underlying problem is that by projecting the data points into a lower dimensional space the points lose their uniqueness. Completely different tuples (samples) can plot to the same point in two dimensional space. The method relies largely on the vastness of possible combinations to hope that the coincidental mean points do not occur.

While interesting, all of these approaches continued to have issues with separation of the projected points because of the interaction of projections. That is the x component of one attribute moved the combined point in the opposite direction from the x component of another either completely or proportional to the angel of the line segment. And of course the same is true of the y component.

2. Attempts to improve separation.

This leads to several approaches to improve the separation of the projected points. Since each attribute is projected onto a different line segment, the angel of the line segment determines the x and y component of the projection.

The participation of each attribute value in the projected point is relatively simple. The line segments that make up the sides of the polygon or the axis of the parallel or Himalayan diagrams each correspond to one of the attributes or dimensions. The angle of each side of the polygon is quite simply 2 pi / n radians (360/n degrees). Similarly each axis of the other systems is derived by dividing the arc appropriately. The point on this axis has an x and y component of simply its angle (some value m between 1 and n times 2pi/n). Its x and y components are simply the sine and cosine of the normalized attribute values.

For simplicity this means the cluster of data points can be viewed quickly and easily in, e.g., Excel with a simple x,y plot. The resulting plot represents a projection of the n dimensions of the attributes into two dimensional space in the same sense that you might draw a three dimensional box by selecting an arbitrary vector to represent the z axis.

2.1 Sides of the Jewel

To avoid the counteracting vector problem, the jewel diagram is altered to use only polygons with odd numbers of sides, even if a side is not used. This modification is specifically selected to insure that the projection of each attribute is on a different angle.

Figure 9: Jewel Diagram showing a subset of the iris data on a polygon with an odd number of sides.

The diagram shows some separation at common points. For convenience, the three different classes of iris are shown in different colors but this is not a feature of the diagram, it is only added to show the separation does correspond to class. Showing just the mean points gives this.

Figure 10: Jewel Diagram projected points only.

2.2 Restrict to One Quadrant

To attempt to avoid the counter action of the attribute plots as the axis move about the circle, an approach to limit the plot only to the first quadrant was examined. In doing it this way all, projections have only a positive component and while each attribute contributes to the projection, the effects are less dramatic than a pull from the opposite side of the circle.

2.2.1 Quadrant-1 Himalayan Diagrams

This is the variation of jewel and Himalayan diagrams where the diagram is limited quadrant 1. This means that the 4 attributes are projected onto the four sides of a sixteen sided polygon that are in quadrant I. For the Himalayan diagram the four line segments are limited to the same quadrant.

Figure 11: Himalayan Diagram restricted to the first quadrant.

The Himalayan diagram becomes a little jumbled since the mean points fall in the space of the other lines. This can be fixed with an offset scaling but for clarity the points can be examined separately.

Figure 12: Himalayan Diagram restricted to the first quadrant, projected points only.

2.2.2 Parallel becomes Radial

When the parallel axis system is placed on a circle it becomes more of a radial axis system as the line segments are arrange around the out side of a circle. The difference from the Himalayan is only the uniform spacing instead of the doubling of the angles.

Figure 13: Radial axis Diagram restricted to the first quadrant.

Again the mean points are obscured by the lines. Selecting the mean points only provides the following:

Figure 14: Radial axis Diagram restricted to the first quadrant, projected points only.

This is a good separation of the yellow (diamond) class. Note that the red (square) and green (triangle) classes tend to intermingle. We must keep in mind, though, that the red and green classes of the full iris dataset are not very distinct from one another, and therefore we should not see the same separation as we see in the red-yellow or green-yellow cases.

2.2.3 Fireworks

To avoid the problem in the Himalayan limited to the first quadrant caused by the lines overlapping the projected points, a variation projecting the points out of the circle instead of into the circle was created.

Figure 15: Fireworks version of the radial axis diagram restricted to the first quadrant including its projected points projected.

This variation seems to maintain the information contained in the lines and still show the n dimensional projection of the data.

2.2.4 Jewel limited to the first quadrant.

In this variation, the numbers of side of the polygon have to be extended to at least 4n and only the sides falling in the first quadrant are used.

Figure 16: A jewel diagram limited to the first quadrant.

The jewel diagram becomes harder to view due to the overlap of lines. The issue is that as n becomes larger (and here the value of n is graphed on 4n) the viewing space and separation becomes smaller and smaller until eventually the polygon approximates a circle and the mean points converge to a single point.

Figure 17: Projection of points from jewel diagram limited to the first quadrant.

This variation because of the limitation to the first quadrant makes all projection vectors closer to parallel and the separation become along the an arc near the perimeter of the polygon.

2.3 Selection of attribute sequence and weighting.

Since the sequence of selection determines the portion of x and y in the projection, altering the sequence of the attributes will change the orientation of the resulting projected point. In addition, as noted by Inselberg, it will also alter the clarity of information that can be gleaned from the lines linking the sample values from axis to axis.

Figure 18: Jewel diagram with weigh altered to include sides that appear to have significant separation.

The combined points convey information about the combination of all four attributes. However, there are problems that arise with noise (meaningless attributes). In this example the fact that the sepal width is -0.4194 correlated to class tends to obscure the separation. The advantage is that having seen this along the sides of the polygon the observer can adjust the relative participation (weighting) of the sides to further separate the results. This process can be automated to simply locate the most highly correlated attributes and select an appropriate weighting.

Figure 19: Jewel diagram projected point separation using all attributes.

Comparing the result of all attributes to a weighted result clearer insight appears.

Figure 20: Jewel diagram projected point separation weighting the projection using attributes with high separation.

While the examples up to this point have relied on a subset of the Fisher iris data, when looking at just the projected points it is possible to include all of the data set.

Figure 21: Jewel diagram of Fishers full data set point projections using all attributes.

Then weighting the more separated attributes, the full set gives figure 22. While in all the versions, the yellow (diamond), Iris-setosa, is distinguishable from the other classes. However, the green (triangle), Iris-virginica and red (square) Iris-versicolor are entangled. While the extremes of the entangled area are clearly different by at least as much as the space between the yellow and red, the overlap creates confusion.

Figure 22: Jewel diagram projected points for the complete Fisher data set using weighted attributes for the projection.

If the classification was not shown, a user would probably identify the yellow points as different from the others. Further the long shape of the red/green group might imply that it is more than one group with similarities.

This overlap area would also identify samples that might merit reevaluation or review. However, not being an expert on iris should the question be raised? In fact an internet search failed to locate any questioning of the classification. So with a novice eye, look to these images of the three classes, all members of the Limniris subgenus of Iris, i.e. Iris. : Limniris is one of six subgenuses.

2.4 Effects of noise attributes

While the proposed techniques for viewing and projecting n-dimensional data work relatively well when highly correlated attributes are given predominance in the projection, the user may not have the luxury of perfect knowledge. The utility of the tool is to be able to gain insight without extensive analysis and without perfect knowledge.

The demonstration of this problem is to simply include a fake attribute or two with random numbers as values. Using the same data set of four attributes, two random value attributes were added. The results were troubling, but not unexpected. Adding 50% random data, then projecting six dimension down to two, can be expected to tax these methods severely.

Figure 23: Jewel diagram projected points for the complete Fisher data adding two fake attributes of random numbers.

The presence of the random values perturbs the result making any pattern obscured. This is further perturbed since the two random attributes were added as the first and second variable so the position of the classes reoriented in the x-y space. With the addition of the class marking there may be some desire to see a separation, but the inclusion of the random attributes has essentially obscured any boundary clarity.

Figure 24: Jewel diagram projected points for the complete Fisher data plus two fake random attributes and marking for class.

This argues strongly for the necessity to weight the selection and projection of attributes based on separation or correlation. This can be done either visually from the plot of the attributes in the full diagram (jewel, Himalayan or fireworks) or mathematically by a measure of entropy or information content. For simplicity a simple Chi square test for randomness may suffice to select those attributes to be more heavily weighted in the projection.

4. Conclusion and Future Directions

These techniques provide methods for viewing n-dimensional data in two dimensional space and hence to visually interpreting such items as clustering in data mining. These extensions improve on the current state-of-the-art. Like most tools it has its drawbacks. However, the method of displaying the information provides a method for human viewers to gain understanding and insight.

Despite some drawbacks this technique does provide a method to visualize n-dimensional clustering. As a result the authors intend to continue to use the technique to represent data mining results in n-dimensional space. Since the fireworks version of the tools seems to combine all the utility of the versions with the least drawbacks it is the candidate for incorporation into future work. Principle lines of interest now devolve on methods for selecting and weighting attributes to optimize the information conveyed by the image.

5. References

[1] Visual data mining with parallel coordinates Al Inselberg, COMPUTATION STAT 13: (1) 47-63 1998

[2] MULTIDIMENSIONAL LINES .1. REPRESENTATION. INSELBERG A, DIMSDALE B, SIAM J APPL MATH 54: (2) 559-577 APR 1994

[3] Parallel coordinate graphics using MATLAB, http://www.nbb.cornell.edu/neurobio/land/PROJECTS/Inselberg/

[4] Visualizing N Dimensional Clustering. Computers and Their Applications 2002: 144-147 Paul Juell, William Jockheck.

[5] Visualization of High-Dimensional Space, 2007 International Conference on Computers and Their Applications, M. Canton, W. Perrizo, Honolulu, March, 2007.

[6] "The use of multiple measurements in taxonomic problems" Fisher, R. A. Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to Mathematical Statistics" (John Wiley, NY, 1950)