Statistics Case Study
This guide will assist you in completing the case studies for this semester. Please refer to it often to ensure that you arecompleting the case studies appropriately. Keep in mind that it is only a guide. In particular, the rubric specified in thisguide is only being provided to give student an idea of how we have graded these assignments in the past. Studentsshould NOT expect future rubrics to following these grading guidelines.This guide is separated into four sections: instructions, example case study assignment, example case studysolutions/rubric, and examples of actual completed case studies and their scores.Instructions:It is imperative that students follow the instructions for the case studies exactly as they are given below. Theseinstructions have been created to ensure the student’s files are read correctly by Carmen, that TA’s can read and gradethe files quickly and efficiently, and to ensure academic integrity. You may receive points for following these directions.For instance, if you answer a question correctly but do not do so in full sentences, you may receive partial or no credit. Case study submissions MUST be submitted in PDF format. DO NOT upload your Excel file. IT WILL NOT BEGRADED!!! This is for your benefit. Carmen does not always read Word files correctly. Under no circumstances should you include the questions in your file with the answers. Just include youranswers, given according to these instructions, and numbered accordingly. You do not have to type your case study, although doing so will create a more professional final product. Files should be uploaded to the Dropbox on Carmen by the due date. You may not email the Case Study to yourTA – you must upload the file to the Dropbox in Carmen. (You can find the Dropbox under the ACTIVITIES sectionon Carmen, right next to the CONTENT button.) Please give your Case Study the following name: LastName_FirstName_CaseStudy#. (Example: Jennifer Mann’sfirst case study would be labeled Mann_Jennifer_CaseStudy1). Make sure to include your name in the Case Study itself. Case studies are due BY 5:00 PM. This means they must be submitted BEFORE 5 p.m. Case studies submittedup to 1 hour late will receive an automatic 25% penalty. Case studies cannot be submitted more than 1 hourlate. THERE WILL BE NO EXCEPTIONS TO THIS POLICY!!! You will receive an email informing you that you have successfully submitted a file to Carmen. Please keep thisemail in case there are any technical issues with your submission. You may resubmit your assignment as many times as you’d like without penalty. Only the last file that yousubmit will be kept in the Dropbox. It is the student’s responsibility to ensure that the file uploaded in the Dropbox is correct, complete, andaccurate. Please make sure to check that the file you submit is complete and accurate before the due date haspassed. You can access the file once it is submitted by going back to the Dropbox. TA’s will only grade the filethat appears in the Dropbox. Resubmissions after the deadline will not be allowed for any reason. Answers should be given in complete sentences, use appropriate units, and check all conditions, whennecessary. Remember, treat these case studies as if they are reports you are providing in a professionalenvironment. When performing hypothesis tests, be sure to specify ALL parts of a hypothesis test, including both hypotheses,the test statistics, the p-value, the decision to reject or fail to reject the null hypothesis, and your finalconclusion. When reporting confidence intervals, be sure to give your answer as an interval: (lower value, higher value). Remember that all confidence levels are 95% and all significance levels are 0.05 unless otherwise noted. You should work on this case study on your own with no help from others, including tutors in the tutoring lab,your TA, or your fellow classmates. You may ask for clarification about a question, but you may not ask TA’s ifyour answers are correct or complete. Students who do not follow this policy will be reported to theCommittee on Academic Misconduct. Remember, all of the Case Study grades will count toward your final grade in the course. No Case Study scoreswill be dropped, so plan accordingly.Case Study Assignment ExampleSTAT 1430 – Case Study #1 – This is the actual assignment from Autumn 2015The goal of this case study is to take what you’ve learned and apply it to a real world data set.On Carmen, under the “Case Studies” section, you will find an Excel data set called “CEO_Sal”. Use this data set toaddress the following questions.1. We don’t know how this data was collected. Explain briefly (2-3 sentences) why this is a problem when it comesto using this data set for inference.2. Assume moving forward that we know that the data was produced using a valid simple random samplingmethod. Further assume that the data were collected on 60 CEO’s of publicly traded companies here inColumbus. Find the following pieces of information:a. What is the mean age of the CEO’s? The standard deviation of ages? The median age? The IQR of theages?b. What is the mean salary of the CEO’s? The standard deviation of salary? The median salary? The IQR ofthe salary?c. Find the statistics needed to make a boxplot for ages. (Do not make the boxplot.)d. Create a histogram of CEO ages, selecting the number of bins that you feel are appropriate for the data.Briefly describe the histogram. Based on the graph, which do you feel is the more appropriate measureof central tendency – the mean or the median? Include the histogram in your report.e. Create a histogram of the CEO salaries, selecting the number of bins that you feel are appropriate forthe data. Based on the graph, which do you feel is the more appropriate measure of central tendency –the mean or the median? Include the histogram in your report.3. Briefly discuss any interesting results (2-3 sentences). These can be any details about the statistics you createdabove which stand out to you. Use the statistical knowledge you have gained in class to pick out specific details.Case Study Solutions/Rubric ExampleSTAT 1430 – Case Study #1 – SOLUTIONS/RUBRIC1. Not knowing how the data was collected means we have no way of knowing what kinds of bias may or maynot be present. This means that any information we get from the data should not be considered conclusive.(In statistical terms, we can’t rely on the data for valid inference.)(WORTH 3 POINTS) Just one sentence, +1 point 2-3 sentences, + 1 point Mention bias or randomness in any way, + 1 point2. See answer for each part below:a. The mean age of CEO’s in Columbus, Ohio for this data set was about 51.47 years. The CEO ages had astandard deviation of about 8.92 years. The median age of CEO’s in Columbus, Ohio for this data setwas 50 years. The interquartile range of ages for CEO’s was 11.25 years.(WORTH 2 POINTS) All four numbers (mean, SD, median, and IQR), +1o Missing any of the numbers, no point Included units for all the numbers, or indicated in some way that the units for each number isthe same, +1o If they only list units once or twice, no pointb. The mean salary of CEO’s in Columbus, Ohio for this data set was about $404,169. The CEO salarieshad a standard deviation of about $220,534. The median salary of CEO’s in Columbus, Ohio for thisdata set was $350,000. The interquartile range of salaries for CEO’s was $289,500.Not gradedc. The youngest CEO (the minimum age in the data set) was 32 years old. 25% of the CEO’s (the firstquartile) were 45.75 years of age or younger. 50% of the CEO’s (the median) were 50 years of age oryounger. 75% of the CEO’s (the third quartile) were 57 years old or younger. The oldest CEO (themaximum age in the data set) was 74 years old.(WORTH 2 POINTS) All 5 numbers (min, Q1, med, Q2, max), +1o If they are missing any of the numbers, no point Included units for all the numbers, or indicated in some way that the units for each number isthe same, +1o If they only list units once or twice, no pointd.5 010152035 40 45 50 55 60 65 70 75Number of CEO’sAges, in yearsAges of CEO’s in Columbus, n=60This histogram is approximately symmetric. It has a relatively small spread, since the data values areall clusters fairly close to the mean and median ages. Since the data are approximately symmetric,either the mean or the median can be used as a measure of central tendency. Most people, however,will use the mean in this case since most individuals can readily understand what a mean is.** Note to students: If you used less bins, it is likely that you got a histogram that appeared skewed,like the histogram below. However, the descriptive statistics you found in part (a) above should havetold you that the histogram should look symmetric. Furthermore, your histograms should have hadappropriate titles which included the sample size. The axes should have been clearly labeled. Yourhistogram should not have had gaps between the bars. Finally, there should have been no “More” binor empty bins in your histogram.Histogram with too few bins that looks skewed:Histogram that Excel creates, including bins that Excel chose, without any modifications:(WORTH 10 points)The write up is worth 2 points:– Mentioned shape in any way, + 1 point5 0101520253040 50 60 70 80Number of CEO’sAges, in yearsAges of CEO’s in Columbus, n=605 010152032 38 44 50 56 62 68 MoreFrequencyBinHistogramFrequency– Mentioned center or spread in any way, +1 pointHaving a histogram is worth 8 points:– No gaps, +1– Appropriate title, + 1 point (Histogram alone is NOT appropriate)– Appropriate y-axis name, + 1 point (Frequency alone is NOT appropriate)– Appropriate x-axis name, + 1 point (Bins is NOT appropriate)– Sample size, + 1 point (can be in the write up, too)– Symmetric, +3 points– OR– Skewed, +2 points– OR– Default Excel histogram, without modifying the graph in any way, + 1 pointe. Create a histogram of the CEO salaries, selecting the number of bins that you feel are appropriate forthe data. Based on the graph, which do you feel is the more appropriate measure of central tendency –the mean or the median? Include the histogram in your report.Based on the histograms, the median salary would be the more appropriate value to report, since thedata is skewed.There are several options of histograms below. The first has bin widths of $50,000; the second has binwidths of $100,000; the third has bin widths of $200,000; the last is the default histogram created in Excelwhen the user does not specify the bins. ** It is important to note that the data set was missing data forone of the CEO’s, and therefore the sample size for this histograms is smaller than the sample sizes ofages. ** Each of the histograms looks right skewed, because there are a few higher salaries in the data set.The histograms also appear to be quite spread out. For the measures of center, see the part (b) above.Notice that in the third histogram (the one with the fewest bins), while the distribution looks skewed, itdoes not show the large gap (i.e., the part of the histogram where there are not columns for salariesranging from $900,000 – $1,100,000 in the first graph and for salaries ranging from $1,000,000 – $1,100,000in the second). This is an important omission, and for that reason a histogram with so few columns is notthe best to use to report on the distribution of salaries.The first graph also shows that a fairly high number of CEO’s receive salaries between $500,000 and$550,000, as well as between $700,000 and $750,000, at least when you look at the surrounding columns.This is not as apparent in the second histogram.8 6 4 2 010501001502002503003504004505005506006507007508008509009501000105011001150Number of CEO’sAnnual Salary, in Thousands of DollarsSalaries of CEO’s in Columbus, n=59Here the problem with the default options becomes apparent, as the decimal places that Excel chooses makethe histogram difficult to understand.Not graded5 0101520100 200 300 400 500 600 700 800 900 1000 1100 1200Number of CEO’sAnnual Salaries, in Thousands of DollarsSalaries of CEO’s in Columbus, n=595 0101520253035200 400 600 800 1000 1200Number of CEO’sAnnual Salary, in Thousands of DollarsSalaries of CEO’s in Columbus, n=595 010152025FrequencyBinHistogramFrequency3. Briefly discuss any interesting results (2-3 sentences). These can be any details about the statistics you createdabove which stand out to you. Use the statistical knowledge you have gained in class to pick out specific details.Answers here may vary.One interesting element of the data is that the ages were fairly symmetric, which might be surprising sincepeople typically tend to think that CEO’s are older individuals.The minimum salary for the salary data set was much lower than some would expect, at $21,000/year. Themaximum salary was also very large, at $1,103,000/year. These outliers certainly affected the look of thehistograms.Another interesting facet of the data is that the standard deviation and interquartile range for the salarieswere large, both of which indicate the large variation among the different salaries.The missing data point is also of interest. It is reasonable to question how the results might change if the datafor this 47 year old CEO could have been included.(WORTH 3 PTS) Less than 2 sentences, +1.5 point At least 2-3 sentences, full creditCase Study Student ExamplesBelow are examples of actual student responses submitted last semester, along with their corresponding grade.Student #1 Received 18/20STAT 1430Case Study #11. It is a problem that we don’t know how this data was collected because we don’t know for sure whetheror not this was a random sample. We do not know if this sample is of men, women, or both and we alsodon’t know where, when, and why this data was collected. This is an obvious problem because withoutthat information we have no reason for the data; it essentially tells us nothing but numbers listed asages and numbers listed as salaries.2.a) The mean age of the CEO’s is 51.47The SD of the ages is 8.92The median age is 50The IQR of the ages is 11.25b) The mean salary is 404.17The SD of the salaries is 220.53The median salary is 350The IQR of the salaries is 289.5c) The minimum for ages is 32The 1st quartile is 45.75The 2nd quartile is 50The 3rd quartile is 57The maximum for ages is 74d)The histogram shows the ages of 60 CEO’s in the Columbus area. The ages range from 32 to74 and are divided into 7 groups in increments of 7.The most appropriate measure of central tendency for this graph would be the mean. The datais not affected by outliers and has a fairly low variability, both good when using the mean tointerpret data. The mode would not be appropriate, even though it provides a fairly similar result,because it does not take in to consideration the ages of younger or older CEO’s.e)The histogram shows the salaries of 60 CEO’s in the Columbus area. Salaries range from $21,000 to$1,221,000 and the data is divided into 9 groups with increments of 150.5 01015202532 39 46 53 60 67 74#of CEO’sAges60 CEO Ages5 01015202521 171 321 471 621 771 921 1071 1221# of CEO’sSalary (thousands of dollars)Salaries of 60 CEO’sThe mode would be a better measurement of central tendency for this graph because the meanbecomes affected by the outliers. The mode will give us the most common salary of CEO’s and will giveus a general idea of what the average would be without outliers.3. I found it interesting that the ages of the CEO’s were fairly symmetrical and were also not skewed.While the salaries also had some symmetry, they were affected by outliers, skewing the data to theright. It would be interesting to see a scatterplot of this data to see if there is any correlation betweenthe age of the CEO’s and their salaries. Taking a quick glance at the data would give you the thoughtthat there is not much correlation considering the maximum salary of $1,221,000 belongs to a 57 yearold CEO, which is closer to the mean of the ages.Student #2 – Received 14/20Statistics 1430Case Study 1September 14, 20151. The problem that arises when we do not know how the data was collected is it allows for a lot ofvariables that could change the data. This becomes a problem when interpreting the data because itcould very well extremely skewed by the data collecting process.2. A. The mean is 51.5, the standard deviation is 8.9, the median is 50, and the IQR is 11.3.B. The mean is 404.2 thousands of dollars, the standard deviation is 220.53 thousands of dollars, themedian is 350 thousands of dollars, and the IQR is 289.5 thousands of dollars.C.D.This histogram is a fairly good bell curve representation with a slight skew right. The best item to usewould be the mean.8 6 4 2 0101232 35 38 41 44 47 50 53 56 59 62 65 68 71 74FrequencyAgesCEO AgesFrequencyE.This histogram shows the CEO salaries and is an example of a skewed right histogram. The bestdescriber is the median.3. I saw that the average age was a lower than I would have expected. As well as the difference betweenthe highest and lowest salaries was substantially more than I thought.5 010152112122132142152162172182192110211121FrequencySalaryCEO SalariesFrequency