## ── Attaching packages ─────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Data Origins

The data set was retrieved from Kaggle, published by a user named ‘Jakki’.

It contains math, reading and writing marks secured by High School students in the United States of America along with a range of variables; sex, race/ethnicity, parent’s level of education, lunch and if they took part in a test preparation course… very self explanatory

No other specifics have been stated regarding the data and its origins.

df <- read.csv("StudentsPerformance.csv", header = TRUE, sep = ',') #load data
head(df,5) #show first 5 rows of raw data
summary(df) #show summary of all raw data
##     gender    race.ethnicity     parental.level.of.education          lunch    
##  female:518   group A: 89    associate's degree:222          free/reduced:355  
##  male  :482   group B:190    bachelor's degree :118          standard    :645  
##               group C:319    high school       :196                            
##               group D:262    master's degree   : 59                            
##               group E:140    some college      :226                            
##                              some high school  :179                            
##  test.preparation.course   math.score     reading.score    writing.score   
##  completed:358           Min.   :  0.00   Min.   : 17.00   Min.   : 10.00  
##  none     :642           1st Qu.: 57.00   1st Qu.: 59.00   1st Qu.: 57.75  
##                          Median : 66.00   Median : 70.00   Median : 69.00  
##                          Mean   : 66.09   Mean   : 69.17   Mean   : 68.05  
##                          3rd Qu.: 77.00   3rd Qu.: 79.00   3rd Qu.: 79.00  
##                          Max.   :100.00   Max.   :100.00   Max.   :100.00

Research Question

I will be looking at the differences between these scores in terms of the sex of the student, particulary which students prepared for the tests and what the outcome of this was in terms of average score. Average score was taken rather than maths, reading and writing separately as it was not specificied as to whether the preparation course was in relation to a specific topic.

Data Preparation

namesOfColumns <- c("sex", "race_ethnicity", "parent_lvl_education", "lunch", "test_prep","math","reading","writing")
colnames(df) <- namesOfColumns #change column names
df$avg_score=rowMeans(df[,c(6,7,8)]) #create column for average of math, reading and writing score
head(df,5) # show first 5 rows of processed data
summary(df) #summaried processed data
##      sex      race_ethnicity         parent_lvl_education          lunch    
##  female:518   group A: 89    associate's degree:222       free/reduced:355  
##  male  :482   group B:190    bachelor's degree :118       standard    :645  
##               group C:319    high school       :196                         
##               group D:262    master's degree   : 59                         
##               group E:140    some college      :226                         
##                              some high school  :179                         
##      test_prep        math           reading          writing      
##  completed:358   Min.   :  0.00   Min.   : 17.00   Min.   : 10.00  
##  none     :642   1st Qu.: 57.00   1st Qu.: 59.00   1st Qu.: 57.75  
##                  Median : 66.00   Median : 70.00   Median : 69.00  
##                  Mean   : 66.09   Mean   : 69.17   Mean   : 68.05  
##                  3rd Qu.: 77.00   3rd Qu.: 79.00   3rd Qu.: 79.00  
##                  Max.   :100.00   Max.   :100.00   Max.   :100.00  
##    avg_score     
##  Min.   :  9.00  
##  1st Qu.: 58.33  
##  Median : 68.33  
##  Mean   : 67.77  
##  3rd Qu.: 77.67  
##  Max.   :100.00

Visualisation

ylab <- 'Average score across tests'
xlab <- 'Sex of student'
colbp <-ggplot(df, aes(x=sex, y=avg_score, colour=test_prep )) + xlab(xlab) + ylab(ylab) + geom_boxplot() +
geom_jitter(position=position_jitter(0.2), size=0.4) +ggtitle('Distribution of female and male student average test scores noting for test preparation')
colbp

Summary

Issue with the data set is that it is vague in many ways, from the origin to details of the data itself

In terms of averaged out scores, test preparation is handy… no suprise and females and males are similar in their preparation actions, in the case of this data set at least.

To further this anaysis, other variables should be looked into, such as the effect (if there is one) of Parental level of education.