outlay outlay
outlay

Statistical Estimation

Why do we need to estimate? In an ideal world if one had to find out, say, the height of all adult male population in England, one would measure all adults in England. In other worlds, one would study the entire population.

However in reality this almost never happens, as the population size in question is around 20 million. What happens in practice is that one would carefully design a sampling strategy to select a limited number of adult males from this 20 million or so adult males.

The next step would be to use the height measurements from this sample to calculate the average height of the adult males. The average height of this sample could be used as an estimator for the average height of the entire England population.

Estimating population parameters is only the first step. Once the parameters were estimated the next step would be to subject those parameters to hypothesis testing.

This is because the estimated parameters may have significant numerical values, but those may not be statistically significant. These are discussed further in hypothesis testing pages.

At this stage it would be useful to be familiar with the commonly used terminology in statistics. Population is the entire group we are interested in. We would like to draw inferences about this population. Sample is a subset of data drawn from the population.

Studying this sample would help us to draw inferences about the population. Estimation is a technique used to estimate sample statistics (for example, sample mean and sample variance etc). These sample statistics are called estimators for the population parameters (for example, population mean and population variance etc).

Prior to studying the characteristics of the estimators, we shall first discuss the common sampling techniques.

A good sampling strategy helps us to draw useful inferences about the population (for example, pre-election surveys) while saving time and money. However a poorly designed sampling strategy would provide misleading suggestions, resulting in wrong and expensive business decisions. There are several techniques available to sample populations.

Simple Random Sampling provides equal opportunity to each member in the population to be included in the sample. This is the best bias-free sampling technique if all the population members could be identified and labelled. For example, the national lottery draws are made using the simple random sampling technique.

For larger populations this technique may be difficult to implement without large powerful computers, for example, population of eligible voters in India. In some cases even large computers may not be helpful, for example, to draw a sample from unregistered voters in a country. Since there may not be any such unregistered voters list, it would be hard to assign equal chance to each member in the unregistered list.

Stratified Random Sampling involves dividing the population into distinct subgroups and then selecting a simple random sample out of each subgroup. This works best if the total population can be divided into homogeneous sub-populations.

For example the total adult population could be divided into working and non-working adults sub-populations to study the spending habits of adults in a country.

Systematic Sampling is perhaps the easiest to carry out. In this technique one systematically selects (or ignores) certain members of the population. For example if one had to select 50 boxes out of 100 apple boxes, one would select the first box randomly and include every second box in the sample. This is an easy technique but may introduce bias in results.

Cluster Sampling is useful when the population is homogeneous and can be partitioned. In many situations the partitioning is a result of physical distance. For example, Life expectancy of farmers in a country. Instead of treating individual farmers as members of the population, one could consider the homogeneous farming villages as members of the population.

Grab sampling is a not really a technique. This sampling takes advantage of availability of the population members at a given space and time. For example, an investigator collects a 10-member sample from a production line by picking the first 10 products. This type of sampling method is not recommend as it usually produces biased results.

Confidence Interval: Once we estimated one of the population parameters, say population mean, using the sample mean, one might like to know how good is our estimate of the population mean?

We answer this by calculating the standard error, the standard deviation of the sample mean. What it means is that we will have to repeat our sampling procedure several times to make, say, m samples. Then calculate sample mean for each sample, resulting in m sample means.

The standard deviation calculated using the m sample means about the super mean (the mean of all sample means) is known as standard error . However in practice, we estimate the standard error by dividing the sample standard deviation (S) by the square root of the number of observations (n) in the sample (that is S/sqrt(n)).

Standared error is used to calculate confidence interval for the estimate of the population mean. This is done by obtaining t-value from t-table corresponding to the required confidence level and degrees of freedom (in this case it is sample size minus one) and multiplying it by S/sqrt(n). This value is the limiting value of the estimate for the given confidence level, usually 95%.

We could illustrate this with an example. Consider a sample of size 4 (n) representing the concentraion of impure elements in a chemical solution with percentage values of 10,23,32,34. In this case the sample mean is 24.75, and the sample standard deviation is (S) is 10.94. t-value for 95% confidence level and 3 (n-1) degrees of freedom is 3.18. Therefore the 95% confidence limit for the population mean estimate is

3.18*10.94/sqrt(4) = 17.39

In other words, if we were to repeat this experiment over and over again then in 95% of cases the sample mean could be expected to fall within the range of values 24.75-17.39 and 24.75+17.39.

Back: Inferential Statistics

Next: Hypothesis Testing