Kernel density estimation of 100 normally distributed random numbers using different smoothing bandwidths.

In statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample. In some fields such as signal processing and econometrics it is also termed the Parzen–Rosenblatt window method, after Emanuel Parzen and Murray Rosenblatt, who are usually credited with independently creating it in its current form.

Definition

Let (x1, x2, …, xn) be a univariate independent and identically distributed sample drawn from some distribution with an unknown density ƒ. We are interested in estimating the shape of this function ƒ. Its kernel density estimator is

where K is the kernel — a non-negative function — and h > 0 is a smoothing parameter called the bandwidth. A kernel with subscript h is called the scaled kernel and defined as Kh(x) = 1/h K(x/h). Intuitively one wants to choose h as small as the data will allow; however, there is always a trade-off between the bias of the estimator and its variance. The choice of bandwidth is discussed in more detail below.

A range of kernel functions are commonly used: uniform, triangular, biweight, triweight, Epanechnikov, normal, and others. The Epanechnikov kernel is optimal in a mean square error sense, though the loss of efficiency is small for the kernels listed previously, and due to its convenient mathematical properties, the normal kernel is often used, which means K(x) = ϕ(x), where ϕ is the standard normal density function.

The construction of a kernel density estimate finds interpretations in fields outside of density estimation. For example, in thermodynamics, this is equivalent to the amount of heat generated when heat kernels (the fundamental solution to the heat equation) are placed at each data point locations xi. Similar methods are used to construct discrete Laplace operators on point clouds for manifold learning.

Example

Kernel density estimates are closely related to histograms, but can be endowed with properties such as smoothness or continuity by using a suitable kernel. To see this, we compare the construction of histogram and kernel density estimators, using these 6 data points:

Sample 1 2 3 4 5 6
Value -2.1 -1.3 -0.4 1.9 5.1 6.2

For the histogram, first the horizontal axis is divided into sub-intervals or bins which cover the range of the data. In this case, we have 6 bins each of width 2. Whenever a data point falls inside this interval, we place a box of height 1/12. If more than one data point falls inside the same bin, we stack the boxes on top of each other.

For the kernel density estimate, we place a normal kernel with variance 2.25 (indicated by the red dashed lines) on each of the data points xi. The kernels are summed to make the kernel density estimate (solid blue curve). The smoothness of the kernel density estimate is evident compared to the discreteness of the histogram, as kernel density estimates converge faster to the true underlying density for continuous random variables.

Comparison of the histogram (left) and kernel density estimate (right) constructed using the same data. The 6 individual kernels are the red dashed curves, the kernel density estimate the blue curves. The data points are the rug plot on the horizontal axis.
Comparison of the histogram (left) and kernel density estimate (right) constructed using the same data. The 6 individual kernels are the red dashed curves, the kernel density estimate the blue curves. The data points are the rug plot on the horizontal axis.

Bandwidth selection

Kernel density estimate (KDE) with different bandwidths of a random sample of 100 points from a standard normal distribution. Grey: true density (standard normal). Red: KDE with h=0.05. Black: KDE with h=0.337. Green: KDE with h=2.

The bandwidth of the kernel is a free parameter which exhibits a strong influence on the resulting estimate. To illustrate its effect, we take a simulated random sample from the standard normal distribution (plotted at the blue spikes in the rug plot on the horizontal axis). The grey curve is the true density (a normal density with mean 0 and variance 1). In comparison, the red curve is undersmoothed since it contains too many spurious data artifacts arising from using a bandwidth h = 0.05, which is too small. The green curve is oversmoothed since using the bandwidth h = 2 obscures much of the underlying structure. The black curve with a bandwidth of h = 0.337 is considered to be optimally smoothed since its density estimate is close to the true density.

The most common optimality criterion used to select this parameter is the expected L2 risk function, also termed the mean integrated squared error:

Under weak assumptions on ƒ and K, -(ƒ is the, generally unknown, real density function), MISE (h) = AMISE(h) + o(1/(nh) + h4) where o is the little o notation. The AMISE is the Asymptotic MISE which consists of the two leading terms

where for a function g, and ƒ'' is the second derivative of ƒ. The minimum of this AMISE is the solution to this differential equation

or

Neither the AMISE nor the hAMISE formulas are able to be used directly since they involve the unknown density function ƒ or its second derivative ƒ'', so a variety of automatic, data-based methods have been developed for selecting the bandwidth. Many review studies have been carried out to compare their efficacies, with the general consensus that the plug-in selectors and cross validation selectors are the most useful over a wide range of data sets.

Substituting any bandwidth h which has the same asymptotic order n−1/5 as hAMISE into the AMISE gives that AMISE(h) = O(n−4/5), where O is the big o notation. It can be shown that, under weak assumptions, there cannot exist a non-parametric estimator that converges at a faster rate than the kernel estimator. Note that the n−4/5 rate is slower than the typical n−1 convergence rate of parametric methods.

If the bandwidth is not held fixed, but is varied depending upon the location of either the estimate (balloon estimator) or the samples (pointwise estimator), this produces a particularly powerful method termed adaptive or variable bandwidth kernel density estimation.

Bandwidth selection for kernel density estimation of heavy-tailed distributions is said to be relatively difficult.

A rule-of-thumb bandwidth estimator

If Gaussian basis functions are used to approximate univariate data, and the underlying density being estimated is Gaussian, the optimal choice for h (that is, the bandwidth that minimises the mean integrated squared error) is

where is the standard deviation of the samples. This approximation is termed the normal distribution approximation, Gaussian approximation, or Silverman's (1986) rule of thumb. While this rule of thumb is easy to compute, it should be used with caution as it can yield widely inaccurate estimates when the density is not close to being normal. For example, consider estimating the bimodal Gaussian mixture:

from a sample of 200 points. The figure on the right below shows the true density and two kernel density estimates—one using the rule-of-thumb bandwidth, and the other using a solve-the-equation bandwidth. The estimate based on the rule-of-thumb bandwidth is significantly oversmoothed. The Matlab script for this example uses kde.m and is given below.

Comparison between rule of thumb and solve-the-equation bandwidth
Comparison between rule of thumb and solve-the-equation bandwidth.
 1 % Data
 2 randn('seed',1)                            % Used for reproducibility
 3 data = [randn(100,1)-10; randn(100,1)+10]; % Two Normals mixed
 4 % True
 5 phi = @(x) exp(-.5*x.^2)/sqrt(2*pi);       % Normal Density
 6 tpdf = @(x) phi(x+10)/2+phi(x-10)/2;       % True Density
 7 % Kernel
 8 h = std(data)*(4/3/numel(data))^(1/5);     % Bandwidth estimated by Silverman's Rule of Thumb
 9 kernel = @(x) mean(phi((x-data)/h)/h);     % Kernel Density
10 kpdf = @(x) arrayfun(kernel,x);            % Elementwise application
11 % Plot
12 figure(2), clf, hold on
13 x = linspace(-25,+25,1000);                % Linear Space
14 plot(x,tpdf(x))                            % Plot True Density
15 plot(x,kpdf(x))                            % Plot Kernel Density with Silverman's Rule of Thumb
16 kde(data)                                  % Plot Kernel Density with Solve-the-Equation Bandwidth
 1 #The same code with R language
 2 #` Data
 3 set.seed(1)
 4 data = c(rnorm(100,-10,1),rnorm(100,10,1))
 5 #` True
 6 phi = function(x) exp(-.5*x^2)/sqrt(2*pi)
 7 tpdf = function(x) phi(x+10)/2+phi(x-10)/2
 8 #` Kernel
 9 h = sd(data)*(4/3/length(data))^(1/5)
10 Kernel2 = function(x) mean(phi((x-data)/h)/h)
11 kpdf = function(x) sapply(x,Kernel2)
12 #` Plot
13 x=seq(-25,25,length=1000)
14 plot(x,tpdf(x),type="l",ylim=c(0,0.23),col="red")
15 par(new=T)
16 plot(x,kpdf(x),type="l",ylim=c(0,0.23),xlab="",ylab="",axes=F)

Relation to the characteristic function density estimator

Given the sample (x1, x2, …, xn), it is natural to estimate the characteristic function φ(t) = E[eitX] as

Knowing the characteristic function, it is possible to find the corresponding probability density function through the Fourier transform formula. One difficulty with applying this inversion formula is that it leads to a diverging integral, since the estimate is unreliable for large t’s. To circumvent this problem, the estimator is multiplied by a damping function ψh(t) = ψ(ht), which is equal to 1 at the origin and then falls to 0 at infinity. The “bandwidth parameter” h controls how fast we try to dampen the function . In particular when h is small, then ψh(t) will be approximately one for a large range of t’s, which means that remains practically unaltered in the most important region of t’s.

The most common choice for function ψ is either the uniform function ψ(t) = 1{−1 ≤ t ≤ 1}, which effectively means truncating the interval of integration in the inversion formula to [−1/h, 1/h], or the Gaussian function ψ(t) = eπt2. Once the function ψ has been chosen, the inversion formula may be applied, and the density estimator will be

where K is the Fourier transform of the damping function ψ. Thus the kernel density estimator coincides with the characteristic function density estimator.

Statistical implementation

A non-exhaustive list of software implementations of kernel density estimators includes:

  • In Analytica release 4.4, the Smoothing option for PDF results uses KDE, and from expressions it is available via the built-in Pdf function.
  • In CrimeStat, kernel density estimation is implemented using five different kernel functions – normal, uniform, quartic, negative exponential, and triangular. Both single- and dual-kernel density estimate routines are available. Kernel density estimation is also used in interpolating a Head Bang routine, in estimating a two-dimensional Journey-to-crime density function, and in estimating a three-dimensional Bayesian Journey-to-crime estimate.
  • In ELKI, kernel density functions can be found in the package de.lmu.ifi.dbs.elki.math.statistics.kernelfunctions
  • In ESRI products, kernel density mapping is managed out of the Spatial Analyst toolbox and uses the Quartic(biweight) kernel.
  • In gnuplot, kernel density estimation is implemented by the smooth kdensity option, the datafile can contain a weight and bandwidth for each point, or the bandwidth can be set automatically according to "Silverman's rule of thumb" (see above).
  • In IGOR Pro, kernel density estimation is implemented by the StatsKDE operation (added in Igor Pro 7.00). Bandwidth can be user specified or estimated by means of Silverman, Scott or Bowmann and Azzalini. Kernel types are: Epanechnikov, Bi-weight, Tri-weight, Triangular, Gaussian and Rectangular.
  • In JavaScript, the visualization package D3.js offers a KDE package in its science.stats package.
  • In JMP, The Distribution platform can be used to create univariate kernel density estimates, and the Fit Y by X platform can be used to create bivariate kernel density estimates.
  • In the NAG Library, kernel density estimation is implemented via the g10ba routine (available in both the Fortran and the C versions of the Library).
  • In Octave, kernel density estimation is implemented by the kernel_density option (econometrics package).
  • In SAS, proc kde can be used to estimate univariate and bivariate kernel densities.

This article uses material from the Wikipedia article Kernel density estimation, which is released under the Creative Commons Attribution-Share-Alike License 3.0.