Fall23 Barry Grant Bioinformatics
DAR, A69026881
candy <- read.csv("~/Desktop/candy-data.csv", row.names=1)
dim(candy)
[1] 85 12
sum(candy$fruity)
[1] 38
Q1: 85 kinds of candy Q2: 38 kinds of fruity candy
candy["Smarties", ]$winpercent
[1] 45.99583
candy["Kit Kat", ]$winpercent
[1] 76.7686
candy["Tootsie Roll Snack Bars", ]$winpercent
[1] 49.6535
Q3: The win percent value for Smarties (i.e. my favorite candy) is 45.99%. Q4: The win percent value for Kit Kat is 76.77% Q5: The win percent value for 49.65%
#install.packages("skimr")
library(skimr)
skim(candy)
Name | candy |
Number of rows | 85 |
Number of columns | 12 |
_______________________ | |
Column type frequency: | |
numeric | 12 |
________________________ | |
Group variables | None |
Data summary
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
chocolate | 0 | 1 | 0.44 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
fruity | 0 | 1 | 0.45 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▆ |
caramel | 0 | 1 | 0.16 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
peanutyalmondy | 0 | 1 | 0.16 | 0.37 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
nougat | 0 | 1 | 0.08 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
crispedricewafer | 0 | 1 | 0.08 | 0.28 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
hard | 0 | 1 | 0.18 | 0.38 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
bar | 0 | 1 | 0.25 | 0.43 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▂ |
pluribus | 0 | 1 | 0.52 | 0.50 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
sugarpercent | 0 | 1 | 0.48 | 0.28 | 0.01 | 0.22 | 0.47 | 0.73 | 0.99 | ▇▇▇▇▆ |
pricepercent | 0 | 1 | 0.47 | 0.29 | 0.01 | 0.26 | 0.47 | 0.65 | 0.98 | ▇▇▇▇▆ |
winpercent | 0 | 1 | 50.32 | 14.71 | 22.45 | 39.14 | 47.83 | 59.86 | 84.18 | ▃▇▆▅▂ |
Q6: the variable that is on a different scale is winpercent. Q7: A zero represents non-chocolate and a one represents chocolate candy
hist(candy$winpercent)
Q9: the distribution of winpercent is not symmetrical Q10: below 50
choc.ind <- as.logical(candy$chocolate)
fruit.ind <- as.logical(candy$fruit)
choc.win <- candy[choc.ind, ]$winpercent
fruit.win <- candy[fruit.ind, ]$winpercent
mean(choc.win)
[1] 60.92153
mean(fruit.win)
[1] 44.11974
Q11: Yes, ~60% for chocolate compared to ~44% for fruit
t.test(choc.win, fruit.win)
Welch Two Sample t-test
data: choc.win and fruit.win
t = 6.2582, df = 68.882, p-value = 2.871e-08
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
11.44563 22.15795
sample estimates:
mean of x mean of y
60.92153 44.11974
yes, this is significant. p = 2.871e-08 < 0.05
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.3 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
candy %>% arrange(winpercent) %>% head(5)
chocolate fruity caramel peanutyalmondy nougat
Nik L Nip 0 1 0 0 0
Boston Baked Beans 0 0 0 1 0
Chiclets 0 1 0 0 0
Super Bubble 0 1 0 0 0
Jawbusters 0 1 0 0 0
crispedricewafer hard bar pluribus sugarpercent pricepercent
Nik L Nip 0 0 0 1 0.197 0.976
Boston Baked Beans 0 0 0 1 0.313 0.511
Chiclets 0 0 0 1 0.046 0.325
Super Bubble 0 0 0 0 0.162 0.116
Jawbusters 0 1 0 1 0.093 0.511
winpercent
Nik L Nip 22.44534
Boston Baked Beans 23.41782
Chiclets 24.52499
Super Bubble 27.30386
Jawbusters 28.12744
candy %>% arrange(-winpercent) %>% head(5)
chocolate fruity caramel peanutyalmondy nougat
Reese's Peanut Butter cup 1 0 0 1 0
Reese's Miniatures 1 0 0 1 0
Twix 1 0 1 0 0
Kit Kat 1 0 0 0 0
Snickers 1 0 1 1 1
crispedricewafer hard bar pluribus sugarpercent
Reese's Peanut Butter cup 0 0 0 0 0.720
Reese's Miniatures 0 0 0 0 0.034
Twix 1 0 1 0 0.546
Kit Kat 1 0 1 0 0.313
Snickers 0 0 1 0 0.546
pricepercent winpercent
Reese's Peanut Butter cup 0.651 84.18029
Reese's Miniatures 0.279 81.86626
Twix 0.906 81.64291
Kit Kat 0.511 76.76860
Snickers 0.651 76.67378
Q13: The least liked types of candy: Nik L Nip, Boston Baked Beans, Chiclets, Super Bubble, and Jawbusters
Q14: The most liked types of candy: Reese’s Peanut Butter cup, Reese’s Miniatures, Twix, Kit Kat, and Snickers
ggplot(candy) +
aes(winpercent, rownames(candy)) +
geom_col()
ggplot(candy) +
aes(winpercent, reorder(rownames(candy), winpercent)) +
geom_col()
my_cols=rep("black", nrow(candy))
my_cols[as.logical(candy$chocolate)] = "chocolate"
my_cols[as.logical(candy$bar)] = "brown"
my_cols[as.logical(candy$fruity)] = "pink"
ggplot(candy) +
aes(winpercent, reorder(rownames(candy),winpercent)) +
geom_col(fill=my_cols)
Q17: worst chocolate: Sixlets
Q18: best fruity: starbursts
library(ggrepel)
# How about a plot of price vs win
ggplot(candy) +
aes(winpercent, pricepercent, label=rownames(candy)) +
geom_point(col=my_cols) +
geom_text_repel(col=my_cols, size=3.3, max.overlaps = 15)
Warning: ggrepel: 1 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
Q19: Tootsie Roll Midgies
ord <- order(candy$pricepercent, decreasing =T)
head( candy[ord,c(11,12)], n=5 )
pricepercent winpercent
Nik L Nip 0.976 22.44534
Nestle Smarties 0.976 37.88719
Ring pop 0.965 35.29076
Hershey's Krackel 0.918 62.28448
Hershey's Milk Chocolate 0.918 56.49050
Q20: Five most expensive candies: Nik L Nip, Nestle Smarties, Ring Pop, Hershey’s Krackel, and Hershey’s Milk Chocolate. Nik L Nip is the least popular of these 5.
# Make a lollipop chart of pricepercent
ggplot(candy) +
aes(pricepercent, reorder(rownames(candy), pricepercent)) +
geom_segment(aes(yend = reorder(rownames(candy), pricepercent),
xend = 0), col="gray40") +
geom_point()
#install.packages("corrplot")
library(corrplot)
corrplot 0.92 loaded
cij <- cor(candy)
corrplot(cij)
Q22: Fruity and Chocolate Q23: Chocolate and winpercent
pca <- prcomp(candy, scale=T)
summary(pca)
Importance of components:
PC1 PC2 PC3 PC4 PC5 PC6 PC7
Standard deviation 2.0788 1.1378 1.1092 1.07533 0.9518 0.81923 0.81530
Proportion of Variance 0.3601 0.1079 0.1025 0.09636 0.0755 0.05593 0.05539
Cumulative Proportion 0.3601 0.4680 0.5705 0.66688 0.7424 0.79830 0.85369
PC8 PC9 PC10 PC11 PC12
Standard deviation 0.74530 0.67824 0.62349 0.43974 0.39760
Proportion of Variance 0.04629 0.03833 0.03239 0.01611 0.01317
Cumulative Proportion 0.89998 0.93832 0.97071 0.98683 1.00000
plot(pca$x[,1], pca$x[,2])
plot(pca$x[,1:2], col=my_cols, pch=16)
#ggplot
my_data <- cbind(candy, pca$x[,1:3])
p <- ggplot(my_data) +
aes(x=PC1, y=PC2,
size=winpercent/100,
text=rownames(my_data),
label=rownames(my_data)) +
geom_point(col=my_cols)
p
library(ggrepel)
p + geom_text_repel(size=3.3, col=my_cols, max.overlaps = 7) +
theme(legend.position = "none") +
labs(title="Halloween Candy PCA Space",
subtitle="Colored by type: chocolate bar (dark brown), chocolate other (light brown), fruity (red), other (black)",
caption="Data from 538")
Warning: ggrepel: 39 unlabeled data points (too many overlaps). Consider
increasing max.overlaps
#install.packages("plotly")
library(plotly)
Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':
last_plot
The following object is masked from 'package:stats':
filter
The following object is masked from 'package:graphics':
layout
#ggplotly(p)
par(mar=c(8,4,2,2))
barplot(pca$rotation[,1], las=2, ylab="PC1 Contribution")
Q24: The variables fruity, pluribus, and hard are what account for the positive correlation variance across PC1. These make sense.