Section 3 of Experimentation
Abstract
This section establishes the foundational taxonomy of data visualization chart types used to encode numeric values through specific geometric and positional mechanisms. It defines the four primary encoding structures—bar, line, scatter, and box plots—and details their common variations, including histograms, area charts, and dual-axis configurations. The technical contribution of this section lies in mapping the relationship between data variables and visual channels, such as length, position, and area, to facilitate the accurate interpretation of distributions, trends, and multivariate relationships within a dataset. Understanding these encodings is critical for selecting the appropriate visual representation that preserves the integrity of the measured groups and their constituent breakdowns.
Key Concepts
- Bar Chart Encoding: The bar chart is defined as a primary visual tool where numeric values are indicated by the length of rectangular bars, with each bar physically corresponding to a distinct measured group. This encoding relies on the pre-attentive processing of length to convey magnitude, allowing for immediate comparison between categorical entities through spatial extension along an axis.
- Line Chart Continuity: The line chart is specifically designed to demonstrate changes in value across continuous measurements, such as temporal sequences or ordered intervals. Unlike discrete bar representations, this chart type connects specific value points with line segments to imply a trajectory, making it the standard for visualizing trends over a continuous domain.
- Scatter Plot Relationships: The scatter plot displays the values of two numeric variables using points positioned on two orthogonal axes, where one axis represents the first variable and the other represents the second. This encoding is versatile for demonstrating the nature of the relationship between the plotted variables, specifically indicating whether the correlation is strong or weak, positive or negative, or linear or non-linear.
- Box Plot Distribution: The box plot utilizes a combination of boxes and whiskers to summarize the distribution of values within specific measured groups. The precise positions of the box boundaries and the whisker ends are utilized to show the regions where the majority of the data lies, providing a compact summary of statistical spread without displaying every individual data point.
- Histogram Binning: A histogram represents a variation of the bar chart where the depicted groups are actually continuous numeric ranges rather than discrete categories. By pushing the bars together to eliminate gaps, this encoding demonstrates the frequency distribution of variables within the data, effectively converting continuous data into visible density regions across defined intervals.
- Stacked Bar Composition: The stacked bar chart modifies the standard bar by dividing each primary bar into multiple smaller segments based on values of a second grouping variable. This structure illustrates a relative breakdown of each group’s total into its constituent parts, allowing the viewer to observe how the whole is partitioned across the sub-categories within that specific group.
- Grouped Bar Comparison: In the grouped bar chart variation, sub-bars are placed side-by-side into clusters instead of being stacked vertically. This arrangement explicitly does not allow for easy comparison of the primary group totals, but it significantly improves the capability for direct comparison of the sub-groups across the different primary categories.
- Area Chart Accumulation: The area chart combines the concept of connecting value points with line segments (from the line chart) with the concept of a bar chart by applying shading between the line and a baseline. This is further combined with stacking logic to show not only how a total has changed over time but also how the individual components’ contributions to that total have evolved dynamically.
- Multi-Variable Bubble Plot: The bubble chart modifies the base scatter plot to show the relationship between three variables instead of two. While the base axes represent the first two variables, the third variable’s value determines the size of each point, effectively adding a quantitative dimension to the positional encoding of the standard scatter plot.
- Density Estimation: The density curve, or kernel density estimate, serves as an alternative to the histogram for showing distributions of data. Rather than collecting data points into fixed frequency bins, this method allows each data point to contribute a small volume of data, the collected whole of which becomes a smooth density curve that represents the underlying probability distribution.
Key Equations and Algorithms
- Bar Length Mapping: describes the algorithmic mapping where the length of a bar at position is a function of the value measured for that group. This procedure ensures that the spatial extent of the visual element is directly proportional to the numeric magnitude of the data point it represents.
- Histogram Binning Procedure: The algorithm for generating a histogram involves dividing continuous numeric ranges into bins and collecting data points into frequency bins. This procedure transforms continuous variable data into discrete frequency counts, which are then rendered as adjacent bars to form a unified distribution shape.
- Stacking Addition Logic: The construction of a stacked bar chart follows an additive logic where . Each bar’s total height represents the sum of its constituent sub-bars, ensuring that the visual height of the bar accurately reflects the aggregate sum of the component values.
- Bubble Size Scaling: In a bubble chart, the algorithm modifies the standard point by scaling its radius or area based on a third variable . The procedure dictates that the size of each point is determined by the value of the third variable, allowing for the visual encoding of magnitude in a third dimension while maintaining the - positional relationship.
- Density Curve Integration: The procedure for generating a density curve involves placing a small volume of data at the location of each data point and collecting the whole. This differs from binning by not requiring discrete ranges, resulting in a continuous curve that represents the collected density rather than discrete frequency counts.
Key Claims and Findings
- Stacked vs. Grouped Utility: The stacked bar chart is optimal for illustrating the relative breakdown of a group’s whole into constituent parts, but it fails to facilitate accurate comparison of the primary group totals across the chart. Conversely, the grouped bar chart does not prioritize total comparison but performs significantly better when the analytical goal is to compare sub-groups across categories.
- Dual-Axis Complexity: Dual-axis charts are constructed by overlaying two different charts that share a horizontal axis but utilize potentially different vertical axis scales for each component. To maintain clarity and reduce confusion regarding the different axis scales, it is a common design rule to use different base chart types, such as combining a bar chart with a line chart.
- Scatter Plot Versatility: The scatter plot is identified as the primary tool for verifying the functional relationship between two numeric variables, capable of revealing strong or weak, positive or negative, and linear or non-linear correlations through the spatial clustering of points.
- Distribution Alternatives: While histograms use frequency bins to show distributions, density curves are presented as a superior alternative when a smooth representation of the data’s underlying distribution is required, as they utilize the contribution of each individual data point rather than aggregated bins.
- Variable Expansion: Standard scatter plots are limited to two variables, but bubble charts effectively expand this capability by using point size to encode a third variable, thereby allowing for multivariate analysis in a two-dimensional plane.
- Geospatial Necessity: When values in a dataset correspond to actual geographic locations, it is technically valuable to utilize map-based plots or geospatial plots rather than standard abstractions, as the map provides the necessary spatial context for the data values.
Terminology
- Whisker Ends: The specific points on a box plot that denote the limits of the box and whisker structure. Their positions indicate the boundaries where the majority of the data lies, defining the statistical range of the distribution summary.
- Kernel Density Estimate: A specific technical term for the density curve used as an alternative to histograms. It refers to the method where each data point contributes a small volume of data whose collected whole becomes the final density curve.
- Continuous Measurements: A type of data interval often associated with time or order, which is the specific domain in which line charts are designed to show changes in value.
- Constituent Parts: The sub-components that make up a primary group total, which are visually represented by the subdivisions in a stacked bar chart to illustrate relative breakdowns.
- Sub-groups: Distinct categories within a primary group that are compared against one another, a comparison made significantly more effective using a grouped bar chart than a stacked configuration.
- Grid Cells: The individual units created in a heatmap by dividing two variables of interest into ranges or levels. The values inside these cells are colored based on magnitude, often with darker colors corresponding with higher values.
- Pipeline Flow: The process tracked by a funnel chart in business contexts. The chart visualizes the number of users or visitors making it to each stage of the process, indicated by the width of the funnel at each stage division.
- Geospatial Plots: A category of specialist charts used when values correspond to geographic locations. These plots allow for the actual plotting of data with some kind of map to provide spatial context.
- Baseline: The reference line from which shading is extended in an area chart or from which density is built in a violin plot. It serves as the zero-point reference for the visual area or distribution shape.
- Group Membership: The categorical classification of points in a bubble chart when the third variable is categorical. In these instances, points utilize different shapes or colors to indicate their specific group membership rather than relying solely on size.