The Probability and Statistics domain has about 14 questions. These questions account for 14% of the entire exam.
This domain can be neatly divided into 3 competencies:
- Analyzing and Representing Data
- Probability, Sampling, and Statistics
Let’s talk about some concepts that you will more than likely see on the test.
Measures of Central Tendency
Measures of central tendency help us to determine how data is distributed. We will consider several measures of central tendency below.
The mean of a data set is the average value of a data set. This can be found by adding together all of the values and dividing by the total number of values.
The mode of a data set is the value that occurs most frequently in the data set. It is possible to have more than one mode in a data set if several values occur the most.
The median of a data set is the middle value in the set. If there is an even number of data points, there will not be an exact middle. In this case, the median is found by taking the average of the two data points closest to the middle.
For example, suppose that the ages for a group of ten students were collected and are listed below:
9, 11, 13, 11, 8, 7, 13, 9, 9, 12
To find the mean of this data set add together all of the values and divide by the total number of values.
To find the mode and median of a data set, it is helpful to reorder the set from lowest to highest.
7, 8, 9, 9, 9, 11, 11, 12, 13, 13
Now we can see that the mode of the data set is 9, since 9 occurs 3 times, which is more than any other data point.
Since the data set has an even number of values, there are two values in the middle: 9 and 11. To find the median, you must average 9 and 11; therefore, the median of the data set is 10.
An outlier is a data point that is far outside of the normal range of the data set. It is far away from the rest of the data points. For example, suppose that some daily high temperatures in the month of May for a particular area are given below in degrees Fahrenheit:
66, 56, 61, 45, 48, 52, 23, 66, 53, 58, 59
Reordering gives 23, 45, 48, 52, 53, 56, 58, 59, 61, 66, 66
The mean of this data is (66 + 56 + 61 + 45 + 48 + 52 + 23 + 66 + 53 + 58 + 59) / 11 = 53.364
The mode of the data is 66 since this value occurs twice.
The median is 56, since that is the middle value.
However, the temperature of 23 degrees is an outlier, because there is a 22-degree difference between it and any other temperature. Therefore, we can remove this outlier from the data and calculate the mean, mode, and median again to get a better description of the central tendencies of this data set.
45, 48, 52, 53, 56, 58, 59, 61, 66, 66
After the temperature of 23 is discarded, the new mean is 56.4. The mode of the data is still the same in this case since 66 is the only temperature that occurred more than once. The median is now the average of 56 and 58, which is 57 degrees since now there are only ten data points.
A scatterplot is a graph made up of points in the x–y plane, that show a relationship between two variables x and y.
If the points go up as x increases, then there is a positive correlation between the two variables. For example, there is a positive correlation between the temperature outside and ice cream sales, since as it gets hotter, ice cream sales increase.
If the points go down as x increases, then there is a negative correlation between the two variables. For example, there is usually a negative correlation between the number of absences a student has and their grade, since as the absences increase their grade usually decreases.
If the points on the scatterplot don’t follow any pattern as x increases, then there is no correlation between the variables. Here are some examples of scatterplots below:
By finding a line of best fit on a scatter plot, predictions can be made about future data points.
For example, a line of best fit is shown for the scatterplot below:
Using this line, we can estimate that when the temperature is 21 degrees Celsius, ice cream sales are around $460.