## ── Attaching packages ─────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
The data set was retrieved from Kaggle, published by a user named ‘Jakki’.
It contains math, reading and writing marks secured by High School students in the United States of America along with a range of variables; sex, race/ethnicity, parent’s level of education, lunch and if they took part in a test preparation course… very self explanatory
No other specifics have been stated regarding the data and its origins.
df <- read.csv("StudentsPerformance.csv", header = TRUE, sep = ',') #load data
head(df,5) #show first 5 rows of raw data
summary(df) #show summary of all raw data
## gender race.ethnicity parental.level.of.education lunch
## female:518 group A: 89 associate's degree:222 free/reduced:355
## male :482 group B:190 bachelor's degree :118 standard :645
## group C:319 high school :196
## group D:262 master's degree : 59
## group E:140 some college :226
## some high school :179
## test.preparation.course math.score reading.score writing.score
## completed:358 Min. : 0.00 Min. : 17.00 Min. : 10.00
## none :642 1st Qu.: 57.00 1st Qu.: 59.00 1st Qu.: 57.75
## Median : 66.00 Median : 70.00 Median : 69.00
## Mean : 66.09 Mean : 69.17 Mean : 68.05
## 3rd Qu.: 77.00 3rd Qu.: 79.00 3rd Qu.: 79.00
## Max. :100.00 Max. :100.00 Max. :100.00
I will be looking at the differences between these scores in terms of the sex of the student, particulary which students prepared for the tests and what the outcome of this was in terms of average score. Average score was taken rather than maths, reading and writing separately as it was not specificied as to whether the preparation course was in relation to a specific topic.
namesOfColumns <- c("sex", "race_ethnicity", "parent_lvl_education", "lunch", "test_prep","math","reading","writing")
colnames(df) <- namesOfColumns #change column names
df$avg_score=rowMeans(df[,c(6,7,8)]) #create column for average of math, reading and writing score
head(df,5) # show first 5 rows of processed data
summary(df) #summaried processed data
## sex race_ethnicity parent_lvl_education lunch
## female:518 group A: 89 associate's degree:222 free/reduced:355
## male :482 group B:190 bachelor's degree :118 standard :645
## group C:319 high school :196
## group D:262 master's degree : 59
## group E:140 some college :226
## some high school :179
## test_prep math reading writing
## completed:358 Min. : 0.00 Min. : 17.00 Min. : 10.00
## none :642 1st Qu.: 57.00 1st Qu.: 59.00 1st Qu.: 57.75
## Median : 66.00 Median : 70.00 Median : 69.00
## Mean : 66.09 Mean : 69.17 Mean : 68.05
## 3rd Qu.: 77.00 3rd Qu.: 79.00 3rd Qu.: 79.00
## Max. :100.00 Max. :100.00 Max. :100.00
## avg_score
## Min. : 9.00
## 1st Qu.: 58.33
## Median : 68.33
## Mean : 67.77
## 3rd Qu.: 77.67
## Max. :100.00
ylab <- 'Average score across tests'
xlab <- 'Sex of student'
colbp <-ggplot(df, aes(x=sex, y=avg_score, colour=test_prep )) + xlab(xlab) + ylab(ylab) + geom_boxplot() +
geom_jitter(position=position_jitter(0.2), size=0.4) +ggtitle('Distribution of female and male student average test scores noting for test preparation')
colbp
Issue with the data set is that it is vague in many ways, from the origin to details of the data itself
In terms of averaged out scores, test preparation is handy… no suprise and females and males are similar in their preparation actions, in the case of this data set at least.
To further this anaysis, other variables should be looked into, such as the effect (if there is one) of Parental level of education.