Difference between revisions of "Diamonds Regression Lab"

From Sean_Carver
Jump to: navigation, search
(Regression Analysis)
(Step 0: Data Acquisition)
Line 21: Line 21:
 
=== Step 0: Data Acquisition ===
 
=== Step 0: Data Acquisition ===
  
Download the [[Media:diamonds3K.xlsx|diamonds3K data]], load the data into StatCrunch and view the [http://ggplot2.tidyverse.org/reference/diamonds.html codebook] a larger data set, from which our data were randomly sampled (so they would fit reliably into StatCrunch).
+
Download the [[Media:diamonds3K.xlsx|diamonds3K data]]. Once you have saved the file, you can load the data into StatCrunch.  Also view the [http://ggplot2.tidyverse.org/reference/diamonds.html codebook] a larger data set, from which our data were randomly sampled (so they would fit reliably into StatCrunch).
  
 
=== Step 1: Feature Engineering ===
 
=== Step 1: Feature Engineering ===

Revision as of 12:40, 26 September 2019

Regression Analysis

The size of a diamond can described by its dimensions (x, y, and z) and by its weight (usually expressed in carats). We are going to investigate the relationship between these two descriptions, from a data set describing thousands of diamonds.

The weight of a diamond is its volume times its density. As you might imagine, the shape of a diamond matters. If diamonds were cut as cylinders, the relationship between weight and dimensions would be:

weight = density*pi/4 * x*y*z.

If diamonds were cut as right-circular-cones this relationship would be

weight = density*pi/12 * x*y*z.

In both cases, the weight is a coefficient (a number) times the product of the dimensions. The coefficient on x*y*z is the same for all diamonds of the same shape.

You are going to find this coefficient for the diamonds of this data set, which presumably have a similar shape (code book says "round cut"). Note the shape of different diamonds may be slightly different, depending on the diamond, so you should expect scatter in the data.

You want to find a formula that predicts weight in terms of x, y, and z (listed in a data set). Your formula would be very useful for a jeweler! Measuring the dimensions of a diamond is sometimes prone to error, and if a round cut diamond doesn't closely follow the formula, that would suggest to the jeweler that they should check that the diamond really has a proper round cut and check that the measurements have been performed correctly.

Round Cut Diamond

Step 0: Data Acquisition

Download the diamonds3K data. Once you have saved the file, you can load the data into StatCrunch. Also view the codebook a larger data set, from which our data were randomly sampled (so they would fit reliably into StatCrunch).

Step 1: Feature Engineering

We have three predictor variables x, y, and z. But for simple linear regression, we can have only one explanatory variable. Therefore, we want to create a new feature that combines the information from all these variables. As suggested above, a good feature might be

x*y*z

For comparison, we are also going to use the following feature in a separate model:

x+y+z

Go ahead and create these columns in your data set. Use Data --> Compute --> Expression in StatCrunch.

Step 2: Regression

Use simple linear regression (Stat --> Regression --> Simple Linear) to compute b0 and b1 for the following linear models:

PRODUCT MODEL: Carat = b0 + b1 "x*y*z"
SUM MODEL: Carat = b0 + b1 "x+y+z"

Here "x*y*z" and "x+y+z" should be whatever you called the columns in the Feature Engineering Step. If you didn't specify labels, they will be labeled as shown here, by default. Record the slopes and intercepts for later reference. Keep a record of all your results to write up later.

Step 3: Model Selection

Make a judgement concerning which of the two models is more appropriate for simple linear regression. Remember the conditions listed in the book, and how you check for them. Don't worry about outliers yet, we will deal with them, below. Be prepared to justify your judgement when you write up your lab. Be sure to check the residuals-versus-X-values plot as well as the scatter plot. Also compare R^2 for the two models. Again, keep a record of your results to write up later.

Step 4: Data Cleaning: Outlier Elimination

You might have noticed outliers and remembered that one of the conditions for regression was the "no outlier condition." You can eliminate outliers, but only if there is a justifiable reason to do so. Consider an outlier. If the jeweler measured the dimensions wrong, and there is no way to tell what the dimensions should be, then yes you should eliminate the diamond from consideration. On the other hand, if the store actually sold a round cut diamond with said dimensions and said weight, you should not eliminate the diamond from consideration.

How can you tell if the dimensions are wrong? In StatCrunch, click on the most extreme outlier. It will turn pink on the scatter plot and other graphs created. Now plot three scatter plots showing (x, carat), (y, carat), (z, carat) and locate the pink outlier on the graphs. Could you change a single measurement of the four without affecting at least one of the others? What does this tell you about the validity measurements? It would be too tedious to check all of the other outliers (but if you were working for a client, that's what you would have to do) so we are going to assume that all other outliers tell the same story. Looks like the jeweler could have used your formula!

Now we want to eliminate all outliers at once. Redo the regression of the chosen model and save residuals (that's an option in the Regression dialog box). Now redo the regression again and specify "Where abs(Residuals) < 'Threshold'". Here 'Threshold' should be some number that you get to choose, but choose appropriately, to eliminate outliers with large (positive or negative) residuals that stand apart from the main part of the graph. Look at the residual plot to see where to draw the line. Compare the R^2 value with and without outliers. Compare your slopes and intercepts with and without outliers.

Step 5: Model interpretation: Intercept

What is the intercept of the final model (after outliers are removed). Is your intercept zero? Why not? Consider the residual plot after outliers are removed. What does that tell you about the model? (The banding in the residual plot may not matter so much but can be explored by considering histogram of carat with a narrow bin width, or a scatter plot of price and carat.)

From now on, drop the intercept from the model (set it equal to zero).

Step 6: Model interpretation: Slope

What is the slope of your model? Solve the second equation, below, for "a" and compare the resulting value with the ones for cylinders and right-circular-cones, above:

Weight = Coefficient * x*y*z
Coefficient = Density*pi/a
Density = 0.01755 carats/mm^3
pi = 3.1415

Write up

Write a report that explains what you have done. Include a short description, with headings for each step above.