Difference between revisions of "Diamonds Regression Lab"
(→Step 4: Outlier Elimination) |
(→Step 8: Testing your tool) |
||
(31 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
− | == | + | == Regression Analysis == |
− | The size of a diamond can described by its dimensions (x, y, and z) and by its weight (usually expressed in carats). We are going to investigate the relationship between these two descriptions, from a data set describing thousands of diamonds. | + | The size of a diamond can described by its dimensions (x, y, and z) and by its weight (usually expressed in carats). We are going to investigate the relationship between these two descriptions, from a data set describing thousands of diamonds. Jewelers often measure these characteristics of diamonds when they set the diamond's price, but jewelers don't always make the measurements accurately. An understanding of the relationship between these variables, and what is atypical, might alert a jeweler or a buyer to the possibility that the measurements were taken incorrectly. |
− | The weight of a diamond is its | + | The weight of a diamond is its density times its volume. As you might imagine, the shape of a diamond matters. If diamonds were cut as ''cylinders'', the relationship between weight and dimensions would be given by the geometry of the cut: |
− | weight = density*pi/4 * x*y*z. | + | weight = density * pi/4 * x*y*z. |
− | If diamonds were cut as ''right-circular-cones'' this relationship would be | + | If diamonds were cut as ''right-circular-cones'' this relationship would be given by a different geometry: |
− | weight = density*pi/12 * x*y*z. | + | weight = density * pi/12 * x*y*z. |
− | In both cases, the weight is a coefficient (a number) times the product of the dimensions. | + | In both cases, the weight is the density times a coefficient (a number) times the product of the dimensions. In both of these cases, the diamonds would be pretty ugly, but the point is the coefficient would be the same for diamonds of the same shape, and different for diamonds of different shapes. You can assume density is the same for all diamonds. |
− | You are going to find this coefficient for "round cut" | + | You are going to find this coefficient for the diamonds of a data set linked below, which presumably have a similar shape (the data code book says they are all "round cut"). Note the different diamonds in the data set may be cut slightly differently, with imperfections or peculiarities, so you should expect scatter in the data. |
− | You want to find a formula that predicts weight in terms of x, y, and z (listed in a data set). Your formula would be very useful for a jeweler! | + | You want to find a formula that predicts weight in terms of x, y, and z (listed in a data set). Your formula would be very useful for a jeweler! |
[[File:roundcutdiamond.jpg|300px|Round Cut Diamond]] | [[File:roundcutdiamond.jpg|300px|Round Cut Diamond]] | ||
Line 21: | Line 21: | ||
=== Step 0: Data Acquisition === | === Step 0: Data Acquisition === | ||
− | Download the [[Media:diamonds3K.xlsx|diamonds3K data]], load the data into StatCrunch | + | Download the [[Media:diamonds3K.xlsx|diamonds3K data]]. Once you have saved the file, you can load the data into StatCrunch. Note that it is not recommended to load the data into Excel first because then you lose the column headers. You should also view the [http://ggplot2.tidyverse.org/reference/diamonds.html codebook] a larger data set, from which our data were randomly sampled (so they would fit reliably into StatCrunch). |
=== Step 1: Feature Engineering === | === Step 1: Feature Engineering === | ||
Line 50: | Line 50: | ||
Again, keep a record of your results to write up later. | Again, keep a record of your results to write up later. | ||
− | === Step 4: Outlier Elimination === | + | === Step 4: Data Cleaning: Outlier Elimination === |
You might have noticed outliers and remembered that one of the conditions for regression was the "no outlier condition." You can eliminate outliers, but only if there is a justifiable reason to do so. Consider an outlier. If the jeweler measured the dimensions wrong, and there is no way to tell what the dimensions should be, then yes you should eliminate the diamond from consideration. On the other hand, if the store actually sold a round cut diamond with said dimensions and said weight, you should not eliminate the diamond from consideration. | You might have noticed outliers and remembered that one of the conditions for regression was the "no outlier condition." You can eliminate outliers, but only if there is a justifiable reason to do so. Consider an outlier. If the jeweler measured the dimensions wrong, and there is no way to tell what the dimensions should be, then yes you should eliminate the diamond from consideration. On the other hand, if the store actually sold a round cut diamond with said dimensions and said weight, you should not eliminate the diamond from consideration. | ||
− | How can you tell if the dimensions are wrong? In StatCrunch, click on the most extreme outlier. It will turn pink on the scatter plot and other graphs created. Now plot three scatter plots showing (x, carat), (y, carat), (z, carat) and locate the pink outlier on the | + | How can you tell if the dimensions are wrong? In StatCrunch, click on the most extreme outlier. It will turn pink on the scatter plot and other graphs created. Now plot three scatter plots showing (x, carat), (y, carat), (z, carat) and locate the pink outlier on the graphs. Could you change a single measurement of the four (i.e. x, y, z and carat) without affecting at least one of the others? What does this tell you about the validity measurements? It would be too tedious to check all of the other outliers (but if you were working for a client, that's what you would have to do) so we are going to assume that all other outliers tell the same story. Looks like the jeweler could have used your formula! |
Now we want to eliminate all outliers at once. Redo the regression of the chosen model and save residuals (that's an option in the Regression dialog box). Now redo the regression again and specify "Where abs(Residuals) < 'Threshold'". Here 'Threshold' should be some number that you get to choose, but choose appropriately, to eliminate outliers with large (positive or negative) residuals that stand apart from the main part of the graph. Look at the residual plot to see where to draw the line. Compare the R^2 value with and without outliers. Compare your slopes and intercepts with and without outliers. | Now we want to eliminate all outliers at once. Redo the regression of the chosen model and save residuals (that's an option in the Regression dialog box). Now redo the regression again and specify "Where abs(Residuals) < 'Threshold'". Here 'Threshold' should be some number that you get to choose, but choose appropriately, to eliminate outliers with large (positive or negative) residuals that stand apart from the main part of the graph. Look at the residual plot to see where to draw the line. Compare the R^2 value with and without outliers. Compare your slopes and intercepts with and without outliers. | ||
Line 60: | Line 60: | ||
=== Step 5: Model interpretation: Intercept === | === Step 5: Model interpretation: Intercept === | ||
− | What is the intercept of the final model (after outliers are removed). Is your intercept zero? | + | What is the intercept of the final model (after outliers are removed). What is the meaning of the intercept? Is your intercept zero? Should it be, or rather how much would a round cut diamond weigh if the product of its dimensions was zero (i.e. one of its dimensions was 0)? What does the model predict for the weight of the diamond if x*y*z=0? |
+ | |||
+ | The intercept is not zero but we would expect it to be. Let's investigate. | ||
+ | |||
+ | ==== Is intercept small enough to be considered zero? ==== | ||
+ | |||
+ | Just read this subsection. | ||
+ | |||
+ | True, because there is scatter in the data we would not expect the intercept to be exactly zero. But based on the scatter in the data, is the intercept small enough to be considered zero? StatCrunch gives you a resounding answer to this question, in the form of a number, even though you are not yet able to interpret this number (we will cover questions like this at the end of the semester). That said, hopefully the next paragraph will give you an idea (but don't worry if this doesn't make sense just yet). Just read. | ||
+ | |||
+ | StatCrunch provides a number which it labels "P-value" which provides you with the probability of seeing an intercept as great as, or greater than we see, given the type of scatter that we see in the data, and provided the data come from a process where the intercept really is zero. If this number (probability) is small (it is) we interpret it as evidence that the data really don't come from a process where the intercept really is zero. | ||
+ | |||
+ | ==== Is the intercept nonzero because the assumptions of regression are violated? ==== | ||
+ | |||
+ | Consider the residual plot after outliers are removed. What does that tell you about the model? Does this plot satisfy the "Does the plot thicken?" condition? Could this property of the data explain our result that the intercept isn't small enough to be considered zero? | ||
+ | |||
+ | ==== How can we fix our model so that the intercept is as we expect? ==== | ||
+ | |||
+ | From now on, when you write the model, drop the intercept from the model. Optional: In StatCrunch there is a way of recalculating the slope under the assumption that the intercept is zero. It involves choosing "Polynomial" Regression, choosing "Polynomial Order = 1" and checking "Without Intercept". The answer you get is almost exactly the same, so it's OK to just write the slope term you get with Simple Linear Regression, and don't write down the b0 intercept term. | ||
=== Step 6: Model interpretation: Slope === | === Step 6: Model interpretation: Slope === | ||
− | What is the slope of your model? Solve the second equation, below, for "a" and compare the resulting value with the ones for cylinders and right-circular-cones, above: | + | What is the slope of your model, b1? Solve the second equation, below, for "a" and compare the resulting value with the ones for cylinders (a=4) and right-circular-cones (a=12), above: |
− | Weight = | + | Weight = b1*x*y*z = Density * Volume_Slope*x*y*z |
− | + | Volume_Slope = pi/a | |
Density = 0.01755 carats/mm^3 | Density = 0.01755 carats/mm^3 | ||
pi = 3.1415 | pi = 3.1415 | ||
− | === Step 7: Write up | + | === Step 7: Your tool in production === |
+ | |||
+ | Going back to the problem of when to flag a diamond as possibly measured incorrectly, we are going to use your model (with intercept removed). We are going to measure x, y, z, and carat for the diamond, then compute a residual. If the residual is too high or too low we are going to flag the diamond. You choose what residuals are too high and too low based on your understanding of the data. Based on this choice, write a formula indicating which diamonds are flagged involving x, y, z, b1, and carat, and inequalities. | ||
+ | |||
+ | === Step 8: Testing your tool === | ||
+ | |||
+ | Compute the residual in Step 7, for each diamond in the data set. Use Data ---> Compute Expression. | ||
+ | |||
+ | Count the number of diamonds in the data set that would be flagged as possibly measured wrong. Use Stat ---> Summary Stats. Choose n (count) as the statistic to be computed, and set a where function based on the residual above to include only those flagged. What percentage of diamonds are possibly measured wrong? | ||
+ | |||
+ | == Write up == | ||
− | + | Write a report that explains what you have done. Include a short description, with headings for each step above. |
Latest revision as of 23:34, 12 January 2020
Contents
- 1 Regression Analysis
- 1.1 Step 0: Data Acquisition
- 1.2 Step 1: Feature Engineering
- 1.3 Step 2: Regression
- 1.4 Step 3: Model Selection
- 1.5 Step 4: Data Cleaning: Outlier Elimination
- 1.6 Step 5: Model interpretation: Intercept
- 1.7 Step 6: Model interpretation: Slope
- 1.8 Step 7: Your tool in production
- 1.9 Step 8: Testing your tool
- 2 Write up
Regression Analysis
The size of a diamond can described by its dimensions (x, y, and z) and by its weight (usually expressed in carats). We are going to investigate the relationship between these two descriptions, from a data set describing thousands of diamonds. Jewelers often measure these characteristics of diamonds when they set the diamond's price, but jewelers don't always make the measurements accurately. An understanding of the relationship between these variables, and what is atypical, might alert a jeweler or a buyer to the possibility that the measurements were taken incorrectly.
The weight of a diamond is its density times its volume. As you might imagine, the shape of a diamond matters. If diamonds were cut as cylinders, the relationship between weight and dimensions would be given by the geometry of the cut:
weight = density * pi/4 * x*y*z.
If diamonds were cut as right-circular-cones this relationship would be given by a different geometry:
weight = density * pi/12 * x*y*z.
In both cases, the weight is the density times a coefficient (a number) times the product of the dimensions. In both of these cases, the diamonds would be pretty ugly, but the point is the coefficient would be the same for diamonds of the same shape, and different for diamonds of different shapes. You can assume density is the same for all diamonds.
You are going to find this coefficient for the diamonds of a data set linked below, which presumably have a similar shape (the data code book says they are all "round cut"). Note the different diamonds in the data set may be cut slightly differently, with imperfections or peculiarities, so you should expect scatter in the data.
You want to find a formula that predicts weight in terms of x, y, and z (listed in a data set). Your formula would be very useful for a jeweler!
Step 0: Data Acquisition
Download the diamonds3K data. Once you have saved the file, you can load the data into StatCrunch. Note that it is not recommended to load the data into Excel first because then you lose the column headers. You should also view the codebook a larger data set, from which our data were randomly sampled (so they would fit reliably into StatCrunch).
Step 1: Feature Engineering
We have three predictor variables x, y, and z. But for simple linear regression, we can have only one explanatory variable. Therefore, we want to create a new feature that combines the information from all these variables. As suggested above, a good feature might be
x*y*z
For comparison, we are also going to use the following feature in a separate model:
x+y+z
Go ahead and create these columns in your data set. Use Data --> Compute --> Expression in StatCrunch.
Step 2: Regression
Use simple linear regression (Stat --> Regression --> Simple Linear) to compute b0 and b1 for the following linear models:
PRODUCT MODEL: Carat = b0 + b1 "x*y*z"
SUM MODEL: Carat = b0 + b1 "x+y+z"
Here "x*y*z" and "x+y+z" should be whatever you called the columns in the Feature Engineering Step. If you didn't specify labels, they will be labeled as shown here, by default. Record the slopes and intercepts for later reference. Keep a record of all your results to write up later.
Step 3: Model Selection
Make a judgement concerning which of the two models is more appropriate for simple linear regression. Remember the conditions listed in the book, and how you check for them. Don't worry about outliers yet, we will deal with them, below. Be prepared to justify your judgement when you write up your lab. Be sure to check the residuals-versus-X-values plot as well as the scatter plot. Also compare R^2 for the two models. Again, keep a record of your results to write up later.
Step 4: Data Cleaning: Outlier Elimination
You might have noticed outliers and remembered that one of the conditions for regression was the "no outlier condition." You can eliminate outliers, but only if there is a justifiable reason to do so. Consider an outlier. If the jeweler measured the dimensions wrong, and there is no way to tell what the dimensions should be, then yes you should eliminate the diamond from consideration. On the other hand, if the store actually sold a round cut diamond with said dimensions and said weight, you should not eliminate the diamond from consideration.
How can you tell if the dimensions are wrong? In StatCrunch, click on the most extreme outlier. It will turn pink on the scatter plot and other graphs created. Now plot three scatter plots showing (x, carat), (y, carat), (z, carat) and locate the pink outlier on the graphs. Could you change a single measurement of the four (i.e. x, y, z and carat) without affecting at least one of the others? What does this tell you about the validity measurements? It would be too tedious to check all of the other outliers (but if you were working for a client, that's what you would have to do) so we are going to assume that all other outliers tell the same story. Looks like the jeweler could have used your formula!
Now we want to eliminate all outliers at once. Redo the regression of the chosen model and save residuals (that's an option in the Regression dialog box). Now redo the regression again and specify "Where abs(Residuals) < 'Threshold'". Here 'Threshold' should be some number that you get to choose, but choose appropriately, to eliminate outliers with large (positive or negative) residuals that stand apart from the main part of the graph. Look at the residual plot to see where to draw the line. Compare the R^2 value with and without outliers. Compare your slopes and intercepts with and without outliers.
Step 5: Model interpretation: Intercept
What is the intercept of the final model (after outliers are removed). What is the meaning of the intercept? Is your intercept zero? Should it be, or rather how much would a round cut diamond weigh if the product of its dimensions was zero (i.e. one of its dimensions was 0)? What does the model predict for the weight of the diamond if x*y*z=0?
The intercept is not zero but we would expect it to be. Let's investigate.
Is intercept small enough to be considered zero?
Just read this subsection.
True, because there is scatter in the data we would not expect the intercept to be exactly zero. But based on the scatter in the data, is the intercept small enough to be considered zero? StatCrunch gives you a resounding answer to this question, in the form of a number, even though you are not yet able to interpret this number (we will cover questions like this at the end of the semester). That said, hopefully the next paragraph will give you an idea (but don't worry if this doesn't make sense just yet). Just read.
StatCrunch provides a number which it labels "P-value" which provides you with the probability of seeing an intercept as great as, or greater than we see, given the type of scatter that we see in the data, and provided the data come from a process where the intercept really is zero. If this number (probability) is small (it is) we interpret it as evidence that the data really don't come from a process where the intercept really is zero.
Is the intercept nonzero because the assumptions of regression are violated?
Consider the residual plot after outliers are removed. What does that tell you about the model? Does this plot satisfy the "Does the plot thicken?" condition? Could this property of the data explain our result that the intercept isn't small enough to be considered zero?
How can we fix our model so that the intercept is as we expect?
From now on, when you write the model, drop the intercept from the model. Optional: In StatCrunch there is a way of recalculating the slope under the assumption that the intercept is zero. It involves choosing "Polynomial" Regression, choosing "Polynomial Order = 1" and checking "Without Intercept". The answer you get is almost exactly the same, so it's OK to just write the slope term you get with Simple Linear Regression, and don't write down the b0 intercept term.
Step 6: Model interpretation: Slope
What is the slope of your model, b1? Solve the second equation, below, for "a" and compare the resulting value with the ones for cylinders (a=4) and right-circular-cones (a=12), above:
Weight = b1*x*y*z = Density * Volume_Slope*x*y*z
Volume_Slope = pi/a
Density = 0.01755 carats/mm^3 pi = 3.1415
Step 7: Your tool in production
Going back to the problem of when to flag a diamond as possibly measured incorrectly, we are going to use your model (with intercept removed). We are going to measure x, y, z, and carat for the diamond, then compute a residual. If the residual is too high or too low we are going to flag the diamond. You choose what residuals are too high and too low based on your understanding of the data. Based on this choice, write a formula indicating which diamonds are flagged involving x, y, z, b1, and carat, and inequalities.
Step 8: Testing your tool
Compute the residual in Step 7, for each diamond in the data set. Use Data ---> Compute Expression.
Count the number of diamonds in the data set that would be flagged as possibly measured wrong. Use Stat ---> Summary Stats. Choose n (count) as the statistic to be computed, and set a where function based on the residual above to include only those flagged. What percentage of diamonds are possibly measured wrong?
Write up
Write a report that explains what you have done. Include a short description, with headings for each step above.