Information

1.3.2: Summarizing the data- Descriptive statistics - Biology

1.3.2: Summarizing the data- Descriptive statistics - Biology


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

How do you summarize data?

Data is summarized in two main ways: summary calculations and summary visualizations

Calculations: What types of measures are used?

To be able to interpret patterns in the data, raw data must first be manipulated and summarized into two categories of measurements: Measures of central tendency and Measures of variability. These two categories of measurements encapsulate the first step of scientific inquiry, descriptive statistics.

Measures of central tendency (center) – Provides information of how data cluster around some single middle value. There are two measures of center used most often in biological inquiry:

  • Mean (average) – Sum of all individual values divided by total number of values in sample/population. This is the most commonly used measure of center under symmetrical distribution and is sensitive to outliers.
  • Median – The middle value when the data set is ordered in sequential rank (highest to lowest). This is commonly used when data is skewed and is resistant to outliers.

Measures of variability (spread) – Describes how spread out or dispersed the data are. There are two main measures of spread used in biological inquiry:

  • Range – Quantifies the distance between the largest and smallest data values.
  • Standard deviation – Quantifies the variation or dispersion from the average of a dataset. A low standard deviation indicates that the data tends to be very close to the mean; a high standard deviation indicates that the data points are spread out over a large range of values. This calculation is sensitive to outliers.
  • Standard error – Quantifies the variation in the means from multiple datasets or a sample distribution of your original dataset.

Visualizing the data: How are tables and graphs used?

After all desired descriptive statistics are calculated, they are typically visually summarized into either a table or graph.

Tables:

A table is a set of data values arranged into columns and rows. Typically the columns encompass a broad data category, and the rows encompass another. Within each broad category there are subcategories that determine how many columns and rows the table consists of. Tables are used to both collect and summarize data. However, most of the time when tables are presented, they consist summarized data, not raw data. Although tables allow summarized data to be presented in an orderly manner, most people prefer to translate tables into the more powerful data visualization tool, a graph.

Graphs:

A graph is a a diagram showing the relation between variable quantities, typically of two variables, each measured along one of a pair of axes at right angles. Graphs can look like a chart or drawing. Most graphs use bars, lines, or parts of a circle to display data. However, there are sometimes when graphs are overlaid on top of maps to also display geographical location, or are even animated to be interactive.

Major graph type categories:
  • Circle/Pie – A circular chart divided into slices to illustrate numerical proportion. In a pie chart, the arc length of each slice (and consequently its central angle and area), is proportional to the quantity it represents. While it is named for its resemblance to a pie which has been sliced, there are variations on the way it can be presented.
  • Line – A type of chart which displays information as a series of data points called 'markers' connected by straight line segments. It is a basic type of chart common in many fields. It is similar to a scatter plot except that the measurement points are ordered (typically by their x-axis value) and joined with straight line segments. A line chart is often used to visualize a trend in data over intervals of time – a time series – thus the line is often drawn chronologically.

Figure (PageIndex{a}): Examples of a circle/pie graph (a.) and a line graph (b.). Image created by Rachel Schleiger (CC-BY-NC).

  • Scatter plot – Is a graph in which the values of two variables are plotted along the horizontal and vertical axes, the pattern of the resulting points revealing any correlation preset. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis.

Figure (PageIndex{b}): Example of a scatter plot. Image created by Rachel Schleiger (CC-BY-NC).

  • Bar – A chart or graph that presents categorical data with rectangular bars with heights or lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally.
  • Histogram – Is an approximate representation of the distribution of numerical data. To construct a histogram, the first step is to "bin" (or "bucket") the range of values—that is, divide the entire range of values into a series of intervals—and then count how many values fall into each interval. The bins are usually specified as consecutive, non-overlapping intervals of a variable. The bins (intervals) must be adjacent (meaning there are not spaces between them like there are in bar graphs), and are often (but not required to be) of equal size. If the bins are of equal size, a rectangle is erected over the bin with height proportional to the frequency—the number of cases in each bin.

Figure (PageIndex{c}): Examples of a bar graph (a.) and a histogram (b.) using the same dataset. Image created by Rachel Schleiger (CC-BY-NC).


Attribution

Rachel Schleiger (CC-BY-NC)


1.3.2: Summarizing the data- Descriptive statistics - Biology

Measures of Central Tendency

In describing a set of data, we are interested in the numerical measures of center and spread. Recall in the cycle time example of Module 1.2, the center of the data appears to be a cycle time of 21 days. That is formally called the measure of central tendency of the data. There are two classic ways to measure the center of a distribution of data: the mean and the median.

Mean
The mean or average is the arithmetic measure of central tendency , and is simply the sum of all of the observations in a set of data divided by the total number of observations. So, the mean of seven cycle times with values 21, 23, 19, 22, 20, 22, 20 is:

You will read in statistics texts that this is the formula for both the population mean (its symbol is the Greek letter mu) and the sample mean (its symbol is an x with a bar over it, or x-bar). Don't get alarmed - this is not a course with a lot of formulas, symbols and computations - we will let the computer do it. I just want to highlight the computations for the classic measures in descriptive statistics. You will also read in textbooks that measures based on a population are called parameters , and measures based on a sample are called statistics .

Median
The median is called the location measure of central tendency since it is the middle value of an ordered array of observations - 50% of the observations in a set of data fall below the median provided the data is in an ordered array. Since we have to account for 100% of the data, 50% of the observations fall above the median. So, given the cycle times of 21, 23, 19, 22, 20, 22, 20 first created an ordered array (sorted list of the numbers):

The median is the ordered or ranked observation that appears at location = (n + 1)/2, where n is the symbol for the number of numbers in the set of data. For this ordered array:

Do you note that the mean and the median are equal? They will be approximately equal (such as within 5% of each other) as long as the distribution of the set of data is somewhat symmetric without extreme values (signals or outliers) to one side or the other. This is an important point. Suppose we had an outlier or signal cycle time of 38 instead of 23. Let's calculate the mean and the median:

Mean = (19, 20, 20, 21, 22, 22, 38) / 7 = 23.14

Median = (7 + 1) / 2 = 4th ordered observation, which is 21

The mean was "pulled" to the outlier value of 38 - the mean is not resistant to extreme values, but the median is since it is a location and not arithmetic measure. Note that the mean is about 10% greater than the median: 21 is more representative of the center when the distribution is not symmetric. As an aside, economists generally use the median when reporting the measure of center for housing values since sets of housing data are often skewed by high or low extreme values. It would be misleading or even unethical to report the mean housing value as a measure of center when there are extreme values on the high side - the median is more representative of center in that case. Also note that when the mean > median, the distribution is skewed right - you saw a histogram picture of that in Worksheet 1.2.3 of Module 1.2. Similarly, when the mean < median, the distribution would be skewed left - you saw a histogram picture of that in Worksheet 1.2.4 of Module 1.2.

Before we go on, let me show you how to compute the median with an even set of data. Suppose we have the numbers 19, 20, 20, 21, 22, 22, 23, and 23. The median is:


Mode
Sometimes you hear of the mode as a measure of central tendency. However, the mode is simply the most frequently appearing number in a data set, so there are actually two modes in the above distribution, 20 and 22. We will not be using the mode as a measure of central tendency, but only to help in general examination of shape, as discussed in Module 1.2.

Pause and Reflect

The mean and the median are measures of center. The mean and median will be approximately equal for symmetric sets of data. The median is resistant to extreme values, whereas the mean is pulled towards them. When the mean > median, the distribution is skewed right by an extreme value(s) > mean. When the mean < median, the distribution is skewed left by an extreme value(s) < mean. Thus, for sets of data with extreme values, the median is more representative of the center, and is the preferred measure. Alternatively, the extreme values could be removed from the data for separate study, and a new, more accurate arithmetic mean can then be computed.

Range
Equally important to describing the center of a set of data with a numerical measure, is describing the spread or variation in a set of data. There are three classic ways to measure variation. The simplest measure of spread is the range . In the data set, 19, 20, 20, 21, 22, 22, 23:

So there is a spread of four days between the smallest and largest cycle times.

Standard Deviation
The range is not very rich - it ignores all but two numbers. If the distribution is approximately normal or a symmetric bell-shape, the classic measure of variation is the standard deviation . To set the stage for the standard deviation, recall that the mean of these numbers is 21. The number 19 varies from 21 by - 2, the number 20 varies by -1, the number 23 by +2, and so forth. The standard deviation is simply a measure of the average of these deviations. To compute the standard deviation for a population of numbers 19, 20, 20, 21, 22, 22, 23 first find the mean (we already did that, it is 21). Now subtract each number from the mean, square the result (to get rid of plusses and minuses), sum the resulting squared deviations, and divide by the number of numbers. This is called the variance of the population. The standard deviation of the population (the symbol is the Greek letter sigma) is the square root of the variance:

Mean = (19 + 20 + 20 + 21 + 22 + 22 + 23) / 7 = 21

Sum of Squared Deviations = (19 - 21) 2 + (20 - 21) 2 + (20 - 21) 2 + (21 - 21) 2 + (22 - 21) 2 + (22 - 21) 2 + (23 - 21) 2 = 4+1+1+0+1+1+4 = 12

Variance = 12 / 7 = 1.714

Standard Deviation = Square Root of 1.714 = 1.3

The standard deviation for a sample (the symbol is the letter s) is similar to the above, but the variance is computed by dividing the sum of squared deviations by the number of numbers minus one (n - 1). In Modules 1.4 and 1.5 we will adjust the sample size to n - 1 again, and refer to this as degrees of freedom . For large sample sizes, the impact of dividing by n - 1 versus dividing by n is negligible. For small samples, it makes a difference so dividing by n - 1 gives a larger or more conservative estimate of the spread. If the above set of data is a sample, the variance and standard deviation are:

The standard deviation is interpreted with respect to the mean. For symmetric bell-shaped sets of data, the interpretation of the standard deviation is:

This can be combined with areas incorporating 1 and 3 standard deviations to completely describe a set of approximately symmetric, bell-shaped data:

68% of the observations fall between the mean + 1 Std Dev

95% of the observations fall between the mean + 2 Std Dev

99.7% (almost all or 100%) of the observations fall between the mean + 3 Std Dev

This very important property is called the Empirical Rule - it applies to all symmetric bell shaped sets of data. We will apply it later when we get to a larger example.

Interquartile Range
The third measure of spread or variation is the interquartile range (IQR). To compute this measure, we need the 1st and 3rd quartiles (Q1 and Q3, respectively). Twenty five percent of the ordered observations in a set of data is contained within a quartile, so 25% of the observations are below the first quartile, 50% of the observations are below the 2nd quartile (Q2, also called the Median), and 75% of the observations are below the 3rd. The IQR is simple Q3 - Q1. Here are the computations, using the numbers 19, 20, 20, 21, 22, 22, 24.

Q3 = 3*(n + 1)/4 ordered observation = 0.75 * (7 + 1) = 6th ordered observation which is the second number 22 in the ordered array.

IQR = Q3 - Q1 = 22 - 20 = 2 days

The middle 50% of the data falls within the IQR. Sometimes, analysts like to expand the location measure of spread by reporting the five number summary:

Pause and Reflect

The standard deviation and the interquartile range are measures of spread or variation. The standard deviation is associated with the mean and is an accurate measure of spread for symmetric bell-shaped distributions. The IQR is generally associated with the median, and is an accurate measure of spread for non-symmetric distributions. With two numerical measures (mean and standard deviation or median and IQR) we gain much knowledge about a data set - even data sets of thousands of numbers.

Symmetry
There are some quick numerical measures than can be used to supplement frequency distribution charts to determine if the distribution is symmetric, bell-shaped or skewed. They are:

We will try the above rules of thumb when we illustrate the Excel descriptive statistics with a larger sample later in this module.

Variation
The coefficient of variation is a useful tool when comparing the variation of one distribution to that of another, especially if the distributions have different units of measurement.

Suppose another company experiences mean cycle time of 21 in their supply chain as well, but the standard deviation is 7 days. The coefficient of variation in this case is 33%, which reflects a process with much greater variability than one with 6.67% coefficient of variation.

Observations
Relative measures of observations are useful to give observations context within a data set. The percentile is a relative measure of location for a number in an ordered array. We already discussed examples of percentiles, since the first quartile is actually the 25th percentile - 25% of the data in an ordered array are below the 25th percentile. Perhaps you recall your percentile score on college entrance exams. If you scored in the 90th percentile on the SAT, 90 percent of the ordered scores were below yours. To find the 90th percentile, order the observations, then compute 90th percentile = 0.90*(n+1) to find the ordered observation of interest. If n = 780, then we would be looking for the 703rd number.

For distributions that are approximately symmetric, bell-shaped, the Z-Score or Z- Value is a powerful relative measure for observations. The Z-Score is simply the number of standard deviations an observation is from the mean. So when we find the Z -Score of a number, we standardize that number. What is the Z-Score for the number 19.6 in a distribution with a mean of 21 and standard deviation of 1.4?

We would say that the number 19.6 is 1 standard deviation from (to the left of) the mean of 21.

Z-Scores have associated probabilities. Do you recall that 68% of the observations in a symmetric bell-shaped distribution fall between the mean and + 1 standard deviation? That "1" standard deviation is the Z-Score! Note also the 68% - we can find the probability for getting any Z-Score as long as the set of data is normal or is approximately symmetric bell-shaped. For example, what is the probability of getting an observation less than 19.6 days? We know that theoretically, the range of data from 19.6 to 22.4 days includes 68% of the data, and since we have to have 100% of the data accounted for, 100% - 68% or 32% of the data must include numbers up to 19.6 and above 22.4. Since we assume these distributions are symmetric, 16% of the data is up to 19.6 and 16% is above 24. So, for a symmetric bell shaped distribution with mean of 21 and standard deviation of 1.4, the probability of observing a number less than 19.6 is 0.16 or 16%. the probability of observing a number greater than 19.6 is simply 100% - 16% or 84%.


Using Excel for the Computations

We will use the Data Analysis Tool and Statistical Functions to produce the descriptive statistics we have discussed thus far in Module 1.3.

The quickest way to generate a family of descriptive statistics is to use the Data Analysis Descriptive Statistics Tool. Lets use the data from Worksheet 1.2.1 in Module 1.2. First create a column of the cycle times as shown in that Worksheet (or if you saved that example, copy C1:C31 to a new area of your worksheet past the histogram, such as Column J). I placed the numbers in column J, starting with the title "Time" in Row 1, and the 30 cycle times in J2 to J31. The data doesn't have to be sorted to compute the descriptive statistics, but it can be. Now select Tools from the Standard Toolbar, then Data Analysis from the pulldown menu, then select Descriptive Statistics and follow the dialog box requests as you did in creating the histogram. For output options, select Summary Statistics and Confidence Level for the Mean, leaving the confidence level default at 95%. I also like to put the output close to my data and histogram if I created a histogram for the data. Remember that you only have to enter one cell location for the Output Range , such as L1, and that cell will define the upper left-hand corner of the range of cells needed for the output - just be sure that the worksheet is clear below and to the right of the cell you select. Note that you can put your output on a separate worksheet or even in a separate workbook if you like. You should get Worksheet 1.3.1 below:

We have discussed the Mean , Median , Mode , Standard Deviation, Sample Variance, Range, Minimum, Maximum, Sum and Count (number of observations). You may ignore Kurtosis (a mathematical measure of the concentration of data around the center compared to the tails of the distribution) and Skewness (a mathematical measure of the symmetry of the distribution). You may ignore Standard Error and Confidence Level for now, but we will use them in Module 1.4

The Descriptive Statistics tool provides everything we have discussed then, except Q3 and Q1. To get these, we need to use the Function feature of Excel. As for all Excel Functions, position the cursor in a cell or select the cell in which you want the quartile, for example M31. Select Insert from the Standard Toolbar , then Function , then Statistical (in left box of Paste Function dialog screen), then Quartile (scroll down to find Quartile in right box of dialog screen), then follow the dialog screen by inserting the cell range for your data (e.g., L2:L30), and 1 for Q1. The resulting cell formula is =QUARTILE(L2:L30, 1). Note that no labels are allowed so I started the cell range with L2 rather than L1. You should get the results in Worksheet 1.3.2. I repeated this process for the third quartile, Q3, which I placed in M32. I also added the formula =(M32 - M31) in cell M33 to get the interquartile range. I added the labels in cells L31, L32, and L33 for clarity since the function feature does not insert the title of the function.

Lets use the above descriptive statistics to completely describe the set of data.

Location Measures of Center and Spread
Fifty percent of the ordered observations fall below the median of 21 days. Twenty-five percent of the ordered observations fall below the first quartile of 19 days and 75% of the ordered observations fall below the third quartile of 22 days. The interquartile range is 3 days from 19 to 22, representing the middle 50% of the ordered observations. The five number summary is complete when we add the minimum value of 16 and maximum of 29 days.

Arithmetic Measures of Center and Spread
The mean is 21.07, or simply 21 days depending on the accuracy we need for reporting purposes. The standard deviation is 2.94 or, rounded, 3 days. The interpretation: most or 95% of the observations fall within the interval: mean + 2 *s = 21 + (2 * 3) or 15 to 27 days. To use the Empirical Rule to completely summarize a set of data which is approximately symmetric and bell-shaped, and having a mean of 21 and standard deviation of 3:

68% of the observations are within 21 + (1 * 3): 18 to 24 days.

95% of the observations are within 21 + (2 * 3): 15 to 27 days.

100% of the observations are within 21 + (3 * 3): 12 to 30 days.

From the above, we would not expect any observations below 12 days or above 30. Such observations would be outliers.

Percentile
To find the percentile using Excel, select Insert on the Standard Toolbar, then Function, then Statistical , then Percentile , and then respond to the dialogue box by entering the cell range for your numbers and the desired percentile. For example, to get the 25th percentile for this illustration, the resulting cell formula will be: =PERCENTILE(L2:L31,0.25). The percentile will be placed in your spreadsheet wherever you have the cursor (active cell).

The Z-Score
What is the Z-Score for the number 18?

So, 18 is one standard deviation to the left of the mean of 21. What is the Z-Score for the number 24?

The number 24 is one standard deviation to the right of the mean. To get the Z-Score from Excel, we use the Standardize function. Position the cursor or point and click in a cell in which you want to place the Z-Score, such as Q2. Select Insert from the Standard Toolbar, then Function , then Statistical , then Standardize and respond to the dialog box questions. You should replicate the cell formula =STANDARDIZE(18,21,3) which gives a Z-Score of -1. I actually prefer to use the cell address for the number 18 so that I can copy the standardize cell reference to standardize a whole row or column of numbers. This would be =STANDARDIZE(L6,21,3). I then add a title to the column, such as "Z-Scores." The fifth requirement in Project Assignment 1 is to standardize (generate Z-Scores) for your column of data.

Z-Scores are very handy ways of identifying outliers. Any Z-Score below -3 or above +3 would identify a number more than 3 standard deviations from the mean. The sixth requirement in Project Assignment 1 is to remove any outliers from your data, and re-compute the descriptive statistics. Outliers are quickly identified by Z-Scores. Removing an outlier from a real data set doesn't mean we can ignore it for analysis - it just means we should analyze it separately from the data.

A side note: to remove a data element from a column of numbers in Excel means to remove it, not type over the number with a zero. To remove a number, point and click to it, select Edit on the Standard Toolbar, then Clear from the pulldown menu, then select Contents . This creates a blank space where the number once resided. If you want to remove the number and the space, select Edit, then Delete from the pulldown menu, then check Shift cells up if you have a column of data, or Shift cells left if you have a row of data.

The Probability for the Z-Score
Excel has a statistical function that automatically computes the probabilities for Z-Scores. Let's find the probability of getting a cycle time below 18 days (remember, time is a continuous variable, so a number below 18 could be 17.9999). If you know the Z-Score for a number, the NORMSDIST function will give a cumulative probability for the area under a symmetric bell-shaped distribution up to that Z-Score. NORMSDIST stands for Standardized Normal Distribution. Position the cursor or point and click in a cell in which you want the probability. Select Insert, then Function, then Statistical , then NORMSDIST and simply enter the Z-Score of -1. You should get 0.158655 or 0.16 or 16%. The probability of obtaining an observation less than 18 days is 0.16 or 16%. What is the probability of obtaining an observation GREATER than 18? You guessed it: 100% - 16% = 84%.

The seventh item in Project Assignment 1 is to compute the probability of exceeding the 45th ordered (sorted) observation in your data set. If you have more or less than 50 observations, compute the process capability for the 5th from the last ordered observation.

There are some alternative Excel procedures and manipulations, but we have described the important basics. What if you want the probability of getting a cycle time below 18 days but do not know the Z-Score? Use the Excel Function NORMDIST function. Point and click to a cell where you want the probability, select Insert from the Standard Toolbar, then Function , then Statistical , then NORMDIST and respond to the dialog box questions by entering the number, mean, standard deviation and the work TRUE). The cell formula would look like this: =NORMDIST(18,21,3,TRUE).

The Normal Distribution
Before we close this set of notes with one more example, let me say a few words about the normal distribution. The mean, standard deviation, Z-Scores and normal probabilities or probabilities for Z-Scores all depend on the distribution being approximately symmetric bell-shaped or "Normal". The distributions will never be perfectly bell-shaped in real life, but they don't have to be - approximately normal is fine, especially if we have at least 30 observations. So, how do you know if the distribution is normal? Here are some general rules of thumb:

1). Check the histogram to see if it has a symmetric bell shape, without outliers (any observation with a Z-Score greater than +3 or less than -3).

2). The mean should be approximately equal to the median. For this example, 21.07 is approximately equal to 21.

3). The interquartile range should be close to 1.33 times the standard deviation. Here, the IQR of 3 is fairly close to 1.33 times 2.94 (3.9) although we would like this somewhat closer.

4) Check theoretical ranges of the empirical rule and compare them to the actual data. 68% of the data should be between 18 and 24. The actual count is 26 out of 30 or 80%.

95% of the data should be between 15 and 27. The actual count is 29 out of 30 or 96.7%.

100% of the data should be between 12 and 30. The actual count is 30 our of 30, or 100%.

Most of the rules of thumb work for this cycle time example. Lets look at an example where there is an outlier.


Putting it Together with One More Example

What if the 31st observation was 38? We already know that 100% of the observations should fall within the range of 12 to 30, so 38 is more than 3 standard deviations from the mean - do you remember what we called that - yes, an outlier or a signal.

Process Control Chart
Suppose you created a process control chart to monitor the cycle time process based on the original 30 observations. The mean of the chart is 21, the upper control limit for a process is the mean plus three standard deviations or 21 + (3 * 3) = 30. The lower control limit is the mean minus three standard deviations or 21 - (3 * 3) = 12. These are upper and lower control limits for processes since we expect nearly 100% of the observations generated by a process to be within the area of the mean + 3 Std Dev. The Upper Specification Limit of 24 is set by the boss or the customer, or whomever - it is not set by statistics. Here is the process control chart:

Note that the 31st observation is above the UCL and would justify investigation as a signal that the process has gone out of control. Let's confirm this with finding the probability of getting an observation of 38 or greater, with a process mean of 21 and standard deviation of 3. First, the Z-Score computation:

Now, find the probability using NORMSDIST function of Excel. The cell formula is:

Recall, this is the cumulative probability of observing a number LESS than 38. So to get the probability of observing a number greater than 38, take 1.0 - .99999999 = 0.00000001 or there is a 0% probability of observing a 38 if the mean is 21 and the standard deviation is 3 - that's why its called a signal or outlier.

Before leaving the process control chart, let's focus for a moment on the upper specification limit set usually by the boss or customer. The USL is 24. Way back at the beginning of this note set, Module 1.1, we indicated that there is no way we can please the boss with an upper specification limit of 24. Now you should be able to say why that is so. With a mean of 21 and a standard deviation of 3, we know that 24 is +1 standard deviations from the mean ( z- score = (24 - 21)/3 = + 1). The probability of getting an observation above 24 is found by first getting the probability of a number less than 24:

This process is only capable of satisfying the customer 84% of the time - that is a measure of process capability . Now, to get the probability of finding a number above 24: 1.0 - 0.84 = 0.16. There is a 16% probability of getting an observation above 24 - that is why the customer will not be pleased. An upper specification limit one standard deviation above the mean is a 1 sigma process - GE's goal is to have SIX SIGMA processes - specification limits 6 standard deviations from the mean.

Descriptive Statistics
What if you were just starting out and collected 31 observations and had no prior knowledge of the mean, standard deviation, median, IQR or histogram. You would first generate the histogram, as shown in Worksheet 1.3.4 below.

By observation, we would say there is an outlier to the right of the range of the data, thus making a distribution that is skewed right. If we needed to report the center and spread of the skewed distribution we would use the. that's right, the median and interquartile range. Here are the descriptive statistics:


What is Descriptive Statistics?

Descriptive statistics involves summarizing and organizing the data so they can be easily understood. Descriptive statistics, unlike inferential statistics, seeks to describe the data, but does not attempt to make inferences from the sample to the whole population. Here, we typically describe the data in a sample. This generally means that descriptive statistics, unlike inferential statistics, is not developed on the basis of probability theory.


Content Preview

Let us take a look at an example. In 1985, the USDA commissioned a study of women’s nutrition. Nutrient intake was measured for a random sample of 737 women aged 25-50 years. The following variables were measured:

Using Technology

Using SAS

We will use the SAS program called to carry out the calculations that we would like to see.

The lines of this program are saved in a simple text file with a .sas file extension. If you have SAS installed on the machine on which you have download this file, it should launch SAS and open the program within the SAS application. Marking up a print out of the SAS program is also a good strategy for learning how this program is put together.

The video will walk you through the various parts of the code.

The first part of this SAS output, (download below), is the results of the Means Procedure - proc means. Because the SAS output is usually a relatively long document, printing these pages of output out and marking them with notes is highly recommended if not required!

Example: Nutrient Intake Data - Descriptive Statistics


Download the SAS Output file: nutrient2.lst

The first column of the Means Procedure table above gives the variable name. The second column reports the sample size. This is then followed by the sample means (third column) and the sample standard deviations (fourth column) for each variable. I have copied these values into the table below. I have also rounded these numbers a bit to make them easier to use for this example.

Using Minitab

Click on the graphic or the link below to walk through how to find descriptive statistics for the Women's Nutrition dataset in Minitab.

Video: Descriptive Statistics in Minitab

Analysis

Descriptive Statistics

A summary of the descriptive statistics is given here for ease of reference.

Variable Mean Standard Deviation
Calcium 624.0 mg 397.3 mg
Iron 11.1 mg 6.0 mg
Protein 65.8 mg 30.6 mg
Vitamin A 839.6 μg 1634.0 μg
Vitamin C 78.9 mg 73.6 mg

Notice that the standard deviations are large relative to their respective means, especially for Vitamin A & C. This would indicate a high variability among women in nutrient intake. However, whether the standard deviations are relatively large or not, will depend on the context of the application. Skill in interpreting the statistical analysis depends very much on the researcher's subject matter knowledge.

The variance-covariance matrix is also copied into the matrix below.

(S = left(egin157829.4 & 940.1 & 6075.8 & 102411.1 & 6701.6 940.1 & 35.8 & 114.1 & 2383.2 & 137.7 6075.8 & 114.1 & 934.9 & 7330.1 & 477.2 102411.1 & 2383.2 & 7330.1 & 2668452.4 & 22063.3 6701.6 & 137.7 & 477.2 & 22063.3 & 5416.3 end ight))

Interpretation

Because this covariance is positive, we see that calcium intake tends to increase with increasing iron intake. The strength of this positive association can only be judged by comparing s12 to the product of the sample standard deviations for calcium and iron. This comparison is most readily accomplished by looking at the sample correlation between the two variables.

  • The sample variances are given by the diagonal elements of S. For example, the variance of iron intake is (s_<2>^<2>). 35. 8 mg 2 .
  • The covariances are given by the off-diagonal elements of S. For example, the covariance between calcium and iron intake is (s_<12>)= 940. 1.
  • Note that, the covariances are all positive, indicating that the daily intake of each nutrient increases with increased intake of the remaining nutrients.

Sample Correlations

The sample correlations are included in the table below.

Calcium Iron Protein Vit. A Vit. C
Calcium 1.000 0.395 0.500 0.158 0.229
Iron 0.395 1.000 0.623 0.244 0.313
Protein 0.500 0.623 1.000 0.147 0.212
Vit. A 0.158 0.244 0.147 1.000 0.184
Vit. C 0.229 0.313 0.212 0.184 1.000

Here we can see that the correlation between each of the variables and themselves are all equal to one, and the off-diagonal elements give the correlation between each of the pairs of variables.

Generally, we look for the strongest correlations first. The results above suggest that protein, iron, and calcium are all positively associated. Each of these three nutrients intake increases with increasing values of the remaining two.

The coefficient of determination is another measure of association and is simply equal to the square of the correlation. For example, in this case, the coefficient of determination between protein and iron is ((0.623)^2) or about 0.388.

This says that about 39% of the variation in iron intake is explained by protein intake. Or, conversely, 39% of the protein intake is explained by the variation in the iron intake. Both interpretations are equivalent.


Content Preview

Frequency tables, pie charts, and bar charts can all be used to display data concerning one categorical (i.e., nominal- or ordinal-level) variable. Below are descriptions for each along with some examples. At the end of this lesson you will learn how to construct each of these using Minitab Express.

A frequency table contains the counts of how often each value occurs in the dataset. Some statistical software, such as Minitab Express, will use the term tally to describe a frequency table. Frequency tables are most commonly used with nominal- and ordinal-level variables, though they may also be used with interval- or ratio-level variables if there are a limited number of possible outcomes.

In addition to containing counts, some frequency tables may also include the percent of the dataset that falls into each category, and some may include cumulative values. A cumulative count is the number of cases in that category and all previous categories. A cumulative percent is the percent in that category and all previous categories. Cumulative counts and cumulative percentages should only be presented when the data are at least ordinal-level.

The first example is a frequency table displaying the counts and percentages for Penn State undergraduate student enrollment by campus. Because this is a nominal-level variable, cumulative values were not included.

Frequencies of Campus
Campus Count Percent
University Park 40,639 50.1%
Commonwealth Campuses 27,100 33.4%
PA College of Technology 4,981 6.1%
World Campus 8,360 10.3%
Total 81,080 100%

Penn State Fall 2019 Undergraduate Enrollments

The next example is a frequency table for an ordinal-level variable: class standing. Because ordinal-level variables have a meaningful order, we sometimes want to look at the cumulative counts or cumulative percents, which tell us the number or percent of cases at or below that level.

As an example, let's interpret the values in the "Sophomore" row. There are 22 sophomore students in this sample. There are 27 students who are sophomore or below (i.e., first-year or sophomore). In terms of percentages, 34.4% of students are sophomores and 42.2% of students are sophomores or below.

Frequencies of Class Standing
Class Standing Count Cumulative Count Percent Cumulative Percent
First-Year 5 5 7.8% 7.8%
Sophomore 22 27 34.4% 42.2%
Junior 17 44 26.6% 68.8%
Senior 20 64 31.3% 100.0%

A pie chart displays data concerning one categorical variable by partitioning a circle into "slices" that represent the proportion in each category. When constructing a pie chart, pay special attention to the colors being used to ensure that it is accessible to individuals with different types of colorblindness.

  • University Park (48.5%)
  • Commonwealth Campuses (34.9%)
  • PA College of Technology (6.5%)
  • World Campus (10.1%)

A bar chart is a graph that can be used to display data concerning one nominal- or ordinal-level variable. The bars, which may be vertical or horizontal, symbolize the number of cases in each category. Note that the bars on a bar chart are separated by spaces this communicates that this a categorical variable.

The first example below is a bar chart with vertical bars. The second example is a bar chart with horizontal bars. Both examples are displaying the same data. On both charts, the size of the bar represents the number of cases in that category.

Penn State Fall 2019 Undergraduate Enrollments

Penn State Fall 2019 Undergraduate Enrollments

Pie charts tend to work best when there are only a few categories. If a variable has many categories, a pie chart may be difficult to read. In those cases, a frequency table or bar chart may be more appropriate. Each visual display has its own strengths and weaknesses. When first starting out, you may need to make a few different types of displays to determine which most clearly communicates your data.


1.3.2: Summarizing the data- Descriptive statistics - Biology

Questions (a) to (c) refer to the following figure: NOTE: Connect the A points with a smooth curve to form distribution A, the B points with a smooth curve to form distribution B, and the C points with a smooth curve to form distribution C.

a. In the figure, which distribution's mean differs from the mean of the other two distributions?
b. In the figure, which distribution has the smallest standard deviation?
c. In the figure, is it likely that the mean of distribution A corresponds closely with the mode of distribution B?

Define the following term and give an example of its use. Your example should not be one given in class or in a handout.

Define the following term and give an example of its use. Your example should not be one given in class or in a handout.

Explain briefly how you would decide which of the following two events is the more unusual

In what sense is the mean of any distribution the "best guess" of the score of any single individual selected at random from the group?

The heights of a sample of ten people are:

67 73 70 60 67 66 68 71 70 67.

Which are the correct limits for the frequency table given below?

Ms. Sweetwater's biology class had a standard deviation of 2.4 on a standardized test, while Ms. Quincy's biology had a standard deviation of 1.2 on the same test. What can be said about these two classes?

A large mass of data can best be summarized pictorially by means of

For a symmetric distribution, the mean and median are

Consider the following data:

The mean and median for this data are

A distribution of 6 scores has a median of 21. If the highest score increases 3 points, the median will become ___________

Consider the following data:

53, 61, 38, 65, 72, 58, 52, 63, 69, 74, 66

You are given that SUM(i = 1, 11)(Y(i) - YBAR)**2 = 1082 and SUM(i = 1, 11)(Y(i)) = 671.

i) Find YBAR, the sample mean. a) 67.1
b) 98.4
c) 108.2
d) 61.0
e) None of the above

ii) Find S**2, the sample variance.

If you are told a population has a mean of 25 and a variance of 0, what must you conclude?

True or False? If False, correct it.

The sample mean of the following sample

If the mean, median and mode of a distribution are 5, 6, 7 respec- tively, then the distribution is

Which of the following measures of central tendency tends to be most influenced by an extreme score?

In a frequency distribution of 250 scores, the mean is reported as 78 and the median as 65. One would expect this distribution to be

The measure of central tendency which is sensitive to extreme scores on the higher or lower end of a distribution is the

Which of the following is not a measure of central tendency?

In a group of 12 scores, the largest score is increased by 36 points. What effect will this have on the mean of the scores?

The quantity SUM(X - XBAR) is not used as a measure of dispersion because it is

In popular usage, the term average may refer to

The mean of the following data is:

Consider the following data:

The mean XBAR of the data above is

A sample of 5 persons with hypertension underwent a special blood-pressure-reducing treatment program which resulted in the following reductions in systolic blood pressure for these persons (i.e. the scores give SBP after treatment - SBP before treatment): -5, 10, 20, 5, 10. The mean of this sample is

If a teacher computes the mean for a set of test scores and then subtracts this mean from each score, the SUM of the resulting set of difference scores will equal

In a set of 10 scores the value 2 occurs three times, the value 4 occurs twice, 6 occurs twice, and 7 occurs three times. What is the mean of the scores?

The following data represent scores of 50 students in a calculus test.

a. For each of the samples listed below obtain

1. a mean
2. a variance, and
3. a standard deviation Each sample was randomly obtained from the production of the hot dog manufacturer listed.

Below are measurements of characteristics for two samples of interest, For each characteristic obtain

a. Mean
b. Variance
c. Standard deviation
d. Frequency plot

Two workers on the same job show the following results over a long period of time.

The sample variance of the following sample of five numbers 3,3,3,3,3 is

The sample variance of the following sample of five numbers 1,2,3,4,5 is

The variance, S(Y)**2, of the numbers 4, 6 is

Let us define a new statistic as the distance between 70th sample percentile and 30th sample percentile. This new statistic would give us information concerning

Which one of the following CANNOT be used as a measure of dispersion?

True or False? If False, explain why.

Which of the following relations is always correct?

If a constant were to be added to a set of scores, the standard deviation would

Increasing the frequencies in the tails of a distribution will

If the variance of a distribution is 9, the standard deviation is

The standard deviation of a group of scores is 10. If 5 were subtracted from each score, the standard deviation of the new scores would be

A sample of 5 persons with hypertension underwent a special blood- pressure-reducing treatment program which resulted in the following values giving reduction in systolic blood pressure for these persons (i.e. the scores give SBP after treatment - SBP before treatment): -5, 10, 20, 5, 10.

Suppose for a second sample of 5 persons, the sample mean is 10, and the sample variance is 25. Then which of the following statements about this second sample is not correct?

The following set of scores is obtained on a test, X:

4, 6, 8, 9, 11, 13, 16, 24, 24, 24, 26.

The teacher computes all of the descriptive indices of central tendency and variability on these data, then discovers that an error was made, and one of the 24's is actually a 17. Which of the following indices will be changed from the original computation?

Each year, during a period of seven years, Mrs. Smith gave birth to a child. The standard deviation of the ages (in whole years) of the 7 children of the family Smith is equal to

What is the standard deviation for the following set of scores:

Calculate the variance and standard deviataion for the data given:

Listed below are two sample data sets, S and T. Which data set has the larger standard deviation?
(Hint: you can answer this question by inspecting the two data sets. But if you are not sure after inspection, calculate the standard deviation.)

True or False? If false, correct it.

True or False? If False, correct it.

True or False? If False, correct it.

True or false? If false, explain why.

Consider the following data:

53, 61, 38, 65, 72, 58, 52, 63, 69, 74, 66.

You are given that SUM(i = 1,11)([Y(i) - YBAR]**2) = 1082 and SUM(i = 1,11)(Y(i)) = 671. Find S**2, the unbiased estimator of the sample variance.

Consider the following data:

The variance, S**2, of this data is closest to:

The variance for the sample [47.1, 33.1, 26.1, 40.1, 54.1] is:

The variance of a group of 10 scores was 16. If 2 were subtracted from each score, the variance of the new scores would be:

The scores that have the greatest effect on the value of the variance are those

The sample variance is calculated as the average of the squared devia- tions of all the scores from the mean:

Why is the numerator squared?

True or False? If false correct it.

On a final examination, the following scores were earned:

Use these data to answer the following 4 questions.

A report states that a measurement is approximately normally distributed with mean 3.5 and variance 1. Further, it states that measurements were recorded for 7 measurement classes 0-1,1-2, etc.

18, 13, 2, 20, 8, 10, 5, 10, 6, 9, 10, 20, 2 15, 16, 16, 13, 10, 17, 10, 3, 2, 15, 8, 5

Suppose you are given a data set to analyze. The data consist of 1000 observations on one variable, the height of the subject being interviewed. Assume half the subjects are male and half are female. The sample mean height of the males is larger than the sample mean height of the females, while the sample standard deviation of the females is larger than the sample standard deviation of the males. Both histograms are approximately symmetric.

A friend of yours heard that you were taking statistics and has presented you with the following table from which he wants you to construct a histogram.

TABLE A The following table is a cross-tabulation of age and reading speed of 100 pupils. In Table A, what proportion of those whose reading speed was more than 33 were aged between 104 and 113 months?

Both a frequency diagram and a relative frequency diagram are drawn for the following data:

A list of the percentages of the total number of cases observed at each score value or each subinterval of scores is

If our lowest score were 40, and the highest score were 189, n=200, and we decide to group our scores into 15 class intervals for a frequency distribution, the width of each interval would be

A reading test with 50 possible points yields a bell-shaped distribution with scores ranging from 5 to 48 on a large sample of third graders. If the same test were administered to fifth graders, what would we expect the form of the frequency distribution to be?

A frequency distribution provides the following information:

True or False? If False, correct it.

A percentile score of 40 indicates that a person

If a person earned a score higher than 35 persons in his class of 50 students, what is his percentile score?

The scores on a midterm examination are presented below in decreasing order of magnitude. A score of 63 is approximately equivalent to a percentile rank of ________?

1) The frequency of 38 in the interval 48-56 means

a. 38 frequencies are at the upper real limit of the interval.
b. 38 frequencies are at the lower real limit of the interval.
c. 38 frequencies are spread out throughout the interval.
d. 38 frequencies are at the upper apparent limit of the interval.
e. 38 frequencies are at the lower apparent limit of the interval.

2) A cumulative percentage of 97.5 means that

a. 97.5 cases fall below a score of 74.
b. 97.5% of the cases fall below a score of 74.
c. 97.5% of the cases fall below a score of 65.5.
d. 97.5% of the cases fall below a midpoint of the interval 66-74.
e. 97.5% of the cases fall below a score of 74.5.

3) The score above which 35% of the cases are found is

Complete this sentence: "The kth percentile of a given distribution is . "

Edith G. obtained a score of 65 in a statistics test, placing her at the 78th percentile. If five points were added to each score in the distribution, her new score would be at the

If a given score is at the 30th percentile for reference group A and the 60th percentile for reference group B, which of the following is most likely true?

If 40% of a group obtain scores below 70, the percentile rank of the score is

The following data are the number of hours worked per week by seven State College students:

Half (50%) of the values in a distribution are

For items (i)-(iii) use the following graph.

(i) What is the estimated percent of 8th grade pupils whose arithmetic scores fall below the median score for grade 7?

a. 6
b. 12
c. 16
d. 24
e. It is impossible to estimate this percent from the ogives.

(ii) What would be the shape of the frequency distribution corresponding to the 8th grade ogive?

Suppose that the 60th percentile of a sample was 1468.3.

True or False? If false, correct it.

True or False? If false, correct it.

Frequency distributions are useful for ALL BUT which of the following objectives?

The mean of the population of ten scores:

78, 91, 91, 94, 74, 23, 63, 22, 78, 89 is 70.3, and the modes are 78 and 91.

The skewness of the population is

The distribution of entrance test scores of freshmen in a particular university has the following percentile scores. How may the distribution be described?

A graphical presentation may accomplish ALL BUT which of the following objectives?


Role of Statistics in Research

The role of statistics in research is to be used as a tool in analyzing and summarizing a large volume of raw data and coming up with conclusions on tests being made. The study of statistics is classified into two main branches: descriptive statistics and inferential statistics. Inferential statistics are used for hypotheses testing and estimating the parameters of a population while descriptive statistics is the way of summarizing and organizing sets of data to make it more easily understood by the audience it is meant for. It often describes information through patterns and graphs.

The first and foremost steps being used in data analysis, as it is difficult to analyze raw data in large volumes. Before you are able to go further on your research, you have to first gather and simplify your data sets.

There are two methods in descriptive statistics: the numerical method and the graphical method.


Descriptive Statistics with Python

There are a few ways to get descriptive statistics using Python. Below will show how to get descriptive statistics using Pandas and Researchpy. First, let's import an example data set.

Pandas

Continuous variables

This method returns many useful descriptive statistics with a mix of measures of central tendency and measures of variability. This includes the number of non-missing observations the mean standard deviation minimum value 25 th , 50 th (a.k.a. the median), and 75 th percentile as well as the maximum value. It's missing some useful information that is typically desired regarding the mean, this is the standard error and the 95% confidence interval. No worries though, pairing this with Researcpy's summary_cont() method provides the descriptive statistic information that is wanted - this method will be shown later.

Categorical variables

Using both the describe() and value_counts() methods are useful since they compliment each other with the information returned. The describe() method says that "Female" occurs more than "Male" but one can see that is not the case since they both occur an equal amount.

For more information about these methods, please see their official documentation page for describe() and value_counts().

Distribution measures

For more information on these methods, please see their official documentation page for kurtosis() and skew().

Researchpy

Continuous variables

Variable N Mean SD SE 95% Conf. Interval
0 bp_before 120.0 156.45 11.389845 1.039746 154.391199 158.508801

This method returns less overall information compared to the describe() method, but it does return more in-depth information regarding the mean. It returns the non-missing count, mean, stand deviation (SD). standard error (SE), and the 95% confidence interval.

Categorical variables

The method returns the variable name, the non-missing count, and the percentage of each category of a variable. By default, the outcomes are sorted in descending order.

For more information about these methods, please see the official documentation for summary_cont() and summary_cont().


Dispersion

(Chapter 4 in Zar, 2010)

While the position of a distribution on the X-axis is a critical piece of information to convey, the relevance of that measure depends on how wide that distribution is, i.e., the amount of variation in that variable, especially when making comparisons between or among distributions. Measures of dispersion are indices of how spread out the observations are along the X-axis.

The simplest measure of dispersion is the range, which involves reporting the lowest and highest observation, or the difference between them. This measure is very sensitive to outliers, which are values that are unusually high or low relative to the other observations. While it is not difficult to find recommendations for excluding outliers from a set of data, unless it is clear that the observation is impossible, e.g., a human body temperature of 183 degrees C, or it is known that an error in measurement occurred, one should always be hesitant to remove such observations (see section 2.5 in chapter 2 of your text).

The reason that range is sensitive to outliers is that it relies on only 2 of your observations. Clearly a measure of dispersion that relied on all of your observations would be of more value, and better justify all the hard work that went into collecting those observations. Our newfound, and in-depth, understanding of central tendency suggests one possible measure: the average distance of the observations from the center of the distribution.

The distance of an observation from the sample mean can be calculated by subtracting the sample mean from the observation as follows:

This value, indicated by a lowercase y, is called a deviate. Intuitively then, the average distance would be the sum of the deviates, ∑y, divided by the number of observations, n. The problem with this can be illustrated by examining the following table of quiz scores from 2 separate sections of a biology class:

Because the sample mean is the mathematical center of the observations, the sum of the deviates will always (within rounding error) be equal to zero. The two distributions of quiz scores are clearly different, but the average deviations will provide no information about these differences.

The solution that we will apply is to square the deviates, making all of the differences positive. The notation that we will use for a squared deviate will be y 2 , such that ∑y 2 will indicate the sum of the squared deviates. The sum of the squared deviates is generally referred to as the sum of squares, and is a value that will figure prominantly in virtually all of the analyses that we will address, so make sure that you are familiar with how to calculate it, and what it represents.

Applying this to the quiz score data, we can see that the sum of squares (∑y 2 ) better reflects the differences between the two distributions:

Dividing the sum of the squared deviates by the number of observations (∑y 2 /n) will give us the average squared distance of the observations from the mean of the observations. While it should be intuitive that this is a good measure of the spread of the observations (apart from using squared distances, which we will address shortly), we cannot lose sight of the fact that the purpose of deriving this value from a sample is to estimate the same parameter for the statistical population. Thus, it is important to establish whether calculating this value as described will introduce a bias in the estimation of the same population parameter.

Calculation of the average squared distance of the observations from the mean for a statistical population, i.e., using every observation that exists, is a parameter that we call the population variance, and denote using the symbol: σ 2 . Unfortunately, using the same calculation from sample data produces a biased estimate of σ 2 . The following distribution was produced by taking 1000 random samples from a statistical population with μ=10, and σ 2 =4, and calculating the average squared distance of the observations from the mean of the observations for each sample. For each sample, the population variance (σ 2 ) was subtracted from the average squared distance of the observations from the sample mean ((∑y 2 /n)-σ 2 ) to produce the values shown below, such that an estimate matching the population variance would result in a value of 0:

Note: These data were produced as the "pvd" object in this R program

Question 2: In what direction is the bias demonstrated for the average squared distance of the observations from the sample mean as an estimate of σ 2 ?

The distribution above suggests that a different calculation must be used to produce an unbiased estimate of σ 2 from sample data. In this instance the correction is a simple one, involving the use of n-1 in the denominator instead of n. The resulting formula calculates a parameter we call sample variance, denoted as s 2 :

In the following graph, the sample variance (s 2 ) calculated from the same series of 1000 random draws has been plotted as a second series (SS/(n-1)):

Note: The additional series was produced as the "svd" object in this R program

From this distribution, we can see that the correction for sample variance removes the bias from the estimate. Thus, we will use sample variance (s 2 ) as our best estimate of population variance (σ 2 ):

The only issue one may take with variance as an indication of the spread of the data, is that the units are squared relative to the values of the observations and, therefore, the mean. The solution to this, as you might imagine, is a simple one: simply take the square root of the variance. This produces a value referred to as the standard deviation, which, for a sample, we denote as s, and for a population, we denote as σ. Obviously (at least I hope that it is obvious), the square root of a sample variance (calculated with n-1 as the denominator) will produce a sample standard deviation (s), and the square root of a population variance (calculated using n as the denominator) will produce a population standard deviation (σ). Given that we will almost always be working with samples, we will use sample standard deviation as our estimate of population standard deviation:

Now let's practice calculating some descriptive statistics for some actual data. Download the Excel workbook for this week's exercise HERE.


Analyzing Qualitative Data

Qualitative data analysis works a little differently from quantitative data, primarily because qualitative data is made up of words, observations, images, and even symbols. Deriving absolute meaning from such data is nearly impossible hence, it is mostly used for exploratory research. While in quantitative research there is a clear distinction between the data preparation and data analysis stage, analysis for qualitative research often begins as soon as the data is available.

Data Preparation and Basic Data Analysis

Analysis and preparation happen in parallel and include the following steps:

  1. Getting familiar with the data: Since most qualitative data is just words, the researcher should start by reading the data several times to get familiar with it and start looking for basic observations or patterns. This also includes transcribing the data.
  2. Revisiting research objectives: Here, the researcher revisits the research objective and identifies the questions that can be answered through the collected data.
  3. Developing a framework: Also known as coding or indexing, here the researcher identifies broad ideas, concepts, behaviors, or phrases and assigns codes to them. For example, coding age, gender, socio-economic status, and even concepts such as the positive or negative response to a question. Coding is helpful in structuring and labeling the data.
  4. Identifying patterns and connections: Once the data is coded, the research can start identifying themes, looking for the most common responses to questions, identifying data or patterns that can answer research questions, and finding areas that can be explored further.

Qualitative Data Analysis Methods

Several methods are available to analyze qualitative data. The most commonly used data analysis methods are:

  • Content analysis: This is one of the most common methods to analyze qualitative data. It is used to analyze documented information in the form of texts, media, or even physical items. When to use this method depends on the research questions. Content analysis is usually used to analyze responses from interviewees.
  • Narrative analysis: This method is used to analyze content from various sources, such as interviews of respondents, observations from the field, or surveys. It focuses on using the stories and experiences shared by people to answer the research questions.
  • Discourse analysis: Like narrative analysis, discourse analysis is used to analyze interactions with people. However, it focuses on analyzing the social context in which the communication between the researcher and the respondent occurred. Discourse analysis also looks at the respondent’s day-to-day environment and uses that information during analysis.
  • Grounded theory: This refers to using qualitative data to explain why a certain phenomenon happened. It does this by studying a variety of similar cases in different settings and using the data to derive causal explanations. Researchers may alter the explanations or create new ones as they study more cases until they arrive at an explanation that fits all cases.

These methods are the ones used most commonly. However, other data analysis methods, such as conversational analysis, are also available.

Data analysis is perhaps the most important component of research. Weak analysis produces inaccurate results that not only hamper the authenticity of the research but also make the findings unusable. It’s imperative to choose your data analysis methods carefully to ensure that your findings are insightful and actionable.