Chapter 1 Density Estimation by Histogram

Click Start Over at the left bottom to start Back to Contents

1 Random variable and sample

The familiar random variable \(X\) is the two possible outcomes when we toss a coin. We take two values \(0\) or \(1\) to represent tail or head in the outcome. If the probability that it takes the value \(1\) is \(q\), we have the Bernoulli probability distribution

\[\begin{equation} P(X = x) = q^x (1-q)^{(1-x)} \label{eq:bernoulli} \end{equation}\]

If \(Y\) is the possible number of heads in a total of \(N\) tosses, the probability distribution is given by the Binomial distribution

\[\begin{equation} P(Y=y) = \begin{pmatrix} n\\ y \end{pmatrix} q^y (1-q)^{N-y} \label{eq:binomial} \end{equation}\]

The Binomial distribution extends the Bernoulli distribution from one toss to \(N\) tosses. the The mean and variance of the binomial variable \(Y\) is \(\mu = E(Y) = Nq\) and \(\sigma^2 = V(Y) = Nq(1-q)\), respectively.

A set of \(n\) identical and independent observations \(Y_1, Y_2, \cdots, Y_n\) is called a sample. The total set of observations is called population. Each element in a sample is called a sample point or an observation. In the event of tossing a coin once, a sample point is \(0\) or \(1\). In the event of tossing a coin \(N\) times, a sample point is a combination of \(0\) and \(1\), i.e. \(y_1 = 0, y_2 = 1, \cdots, y_N = 0\).

In a pulling experiment at a given condition including pulling rate, a population can be defined as infinite number of F-D curves under that condition. A sample consists of limited F-D curves collected by a technician under that condition. One of the F-D curves is called an observation or a sample point.

2 The mean squared error

It is usually to use a single estimate \(\hat{f}\) from a single sample to approximate a true value \(f\). The difference between the true value and the estimate \(\hat{f}\) is usually measured by the mean squared error:

\[\begin{equation} \begin{split} \mathbb{E}[(f-\hat{f})^2] & = (\mathbb{E}[\hat{f}] - f)^2 + \mathbb{V}[\hat{f}] \\ & = \mathcal{B}^2 + \mathcal{V} \end{split} \label{eq:mse} \end{equation}\]

where \(\mathcal{B} = \mathbb{E}[\hat{f}] - f\) is called the bias. If \(\mathcal{B} = 0\), the underlying estimator is called unbiased estimator. The relations among the true value, an estimate, and the mean of esimtates are schematically shown in Figure 1.

Figure 1: The schematic drawing of bias and variance.

3 Empirical cumulative distribution function

Given a sample \(x_1, x_2, \cdots, x_n\) with a common cumulative distribution function \(F\), for example the calculated work for a given pulling rate in the pulling experiment, the empirical cumulative distribution function is defined as

\[\begin{equation} \hat{F}_n(t) = \frac{number\, of\, elements\, \leq t}{n} \equiv \frac{1}{n}\sum_{i=1}^{n}\pmb{I}_{(-\infty, t)}(x_i) \label{eq:cdfd} \end{equation}\]

where \(\pmb{I}\) is the indicator function. An indicator function of a subset \(A\) of a set \(X\) is a binary function from \(X\) to \(\{0, 1\}\) defined as

\[ \pmb{I}_A(x) := \begin{cases} 1 & x\in A \\ 0 & x\notin A \end{cases} \]

To plot the empirical cumulative function, we first arrange the \(n\) sample points in ascending order to get \(x_{(1)}, x_{(2)}, \cdots, x_{(n)}\), allowing repeated numbers. Then the empirical cumulative function is a step function that jumps by \(1/n\) at each sample point: \((x_{(1)}, x_{(2)}, \cdots, x_{(n)}) \rightarrow (\frac{1-0.5}{n}, \frac{2-0.5}{n}, \cdots, \frac{n-0.5}{n})\).

Example. A sample consists of ten observations: \(176, 192, 214, 220, 205, 192, 201, 192, 183, 185\). The following chunk plot its empirical cumulative function.

x <- c(176, 192, 214, 220, 205, 192, 201, 192, 183, 185)
n <- length(x)
plot(sort(x), ((1:n) - 0.5)/n, type = 's', ylim = c(0, 1), xlab = 'x: observations', ylab = '', main = 'Empirical Cumulative Distribution')
# add the label on the y-axis
mtext(text = expression(hat(F)[n](x)), side = 2, line = 2.5)

4 Density function estimated from histogram

Since it is natural to treat a rescaled histogram as the estimation of a probability density function, we give a mathematical derivation on the rescaled histograms.

Figure 2: The schematic drawing of a histogram.

Given \(n\) sample points \(\pmb{x}\) and \(M\) bins shown in Figure 2, what is the maximum probability of observing \(n_i\) sample points in the \(i\)th bin? Assume that the probability of observing one sample point in the \(i\)th bin is \(c_i\), the problem is to find optimal \(c_i^*\) to maximize

\[\begin{equation} \mathcal{L}(\pmb{x} \mid \pmb{c}) = \prod_{i=1}^{M}p(n_i \mid c_i) = \prod_{i=1}^{M}c_i^{n_i} \label{eq:cost} \end{equation}\]

The optimization is subject to the density function constraint:

\[\begin{equation} \sum_{i=1}^{M}c_i w_i = 1 \label{eq:constraint} \end{equation}\]

where \(w_i\) is the width of the \(i\)th bin.

Using the method of Lagrange multipliers, the problem can be converted to find optimal \(c_i^*\) to minimize

\[\begin{equation} \begin{split} f(\pmb{c}) & = \log (\prod_{i=1}^{M}c_i^{n_i}) - \lambda (\sum_{i=1}^{M}c_i w_i - 1) \\ & = \sum_{i=1}^{M}n_i \log c_i - \lambda (\sum_{i=1}^{M}c_i w_i - 1) \end{split} \label{eq:lagrange} \end{equation}\]

Here we have taken the log of the cost function (\ref{eq:cost}) so we can deal with sums rather than products. Computing the derivatives and using the constraints, we can find that \(\lambda = n\) and \(c_i^*=\frac{n_i}{n w_i}\).

5 Error estimate

As shown in Section 3, the density of \(x\) in a bin \(b \equiv (x_0, x_0 +h]\) is

\[\begin{equation} \hat{f}_n(x) = \frac{1}{hn}\sum_{i=1}^{n}\pmb{I}_{(x_0, x_0 + h]}(x_i) \label{eq:density} \end{equation}\]

The sum in Equation (\ref{eq:density}) is actually the number of points in the bin \(b\). It is a bionomial random variable \(B \sim Binomial(n, p)\), where \(p\) is the probability of one point falling into the bin \(p = F(x_0 + h) - F(x_0)\). The Binomial distribution tells that the mean and variance of \(B\) is \(np\) and \(np(1-p)\), respectively. So the mean of \(\hat{f}_n(x)\) is

\[\begin{equation} \begin{split} \mathbb{E}\left[ \hat{f}_n(x) \right] & = \frac{1}{nh} \mathbb{E}[B] \\ & = \frac{n [F(x_0 + h) - F(x_0)]}{nh} \\ & = \frac{[F(x_0 + h) - F(x_0)]}{h} \\ & = f(x^*) \end{split} \label{eq:mean} \end{equation}\]

The last equality in Equation (\ref{eq:mean}) is due to the mean value theorem. It implies that if we use smaller and smaller bins as we get more data, the histogram density estimate is unbiased.

The variance of \(\hat{f}_n(x)\) is

\[\begin{equation} \begin{split} \mathbb{V}\left[ \hat{f}_n(x) \right] & = \frac{1}{n^2 h^2} \mathbb{V}\left[ B \right]\\ & = \frac{n[F(x_0 + h) - F(x_0)][1 - F(x_0 + h) + F(x_0)]}{n^2 h^2} \\ & = \frac{nhf(x^*)(1-hf(x^*))}{n^2 h^2} \\ & = \frac{f(x^*)}{nh} - \frac{f(x^*)^2}{n} \end{split} \label{eq:vari} \end{equation}\]

Therefore, the mean squared error (MSE) between the true and estimated density \(\hat{f}_n (x)\) is

\[\begin{equation} \begin{split} \mathbb{E}[(f(x)-\hat{f}_n (x))^2] & = (\mathbb{E}[\hat{f}_n (x)] - f(x))^2 + \mathbb{V}[\hat{f}_n(x)] \\ & = (f'(x^{**})(x^{*}-x))^2 + \frac{f(x^*)}{nh} - \frac{f(x^*)^2}{n} \\ & \leq (Lh)^2 + \frac{f(x^*)}{nh} - \frac{f(x^*)^2}{n} \end{split} \label{eq:msef} \end{equation}\]

where \(L = \underset{x\in [x_0, x_0 + h]}{\max}f'(x)\). Equation (\ref{eq:mse}) implies that the error depends on the choices of bin size \(h\) and the data size. The minimum error bound is reached when

\[\begin{equation} h = \left(\frac{f(x^{*})}{2L^2 n}\right)^{1/3} \label{eq:h} \end{equation}\]

Since \(L\) and \(f(x^{*})\) are unknown, it is an art to choose the right \(h\). The Scott’s rule (1979) sets \(h = 3.5 s n^{-1/3}\), where \(s\) is the sample standard deviation. The Freedman and Diaconis’s (1981) rule sets \(h = 2(IQ)n^{-1/3}\), where \(IQ\) is the sample interquartile range. In R, the default rule is the Sturges’ rule which set the number of bins as \(1 + \log_{2} n\).

6 References

Scott, D.W.(1979) On optimal and data-based histograms. Biometrika, 66, 605–610.

Freedman, D.andDiaconis, P.(1981) On this histogram as a density estimator:L2theory. Zeit. Wahr. ver. Geb.,57, 453–476.

Sturges, H.(1926) The choice of a class-interval.J. Amer. Statist. Assoc., 21, 65–66.

Myles Hollander, Douglas A. Wolfe, Eric Chicken. (2014) Nonparametric Statistical Methods, Third Edition. John Wiley & Sons, Inc.