Save a data file and an R script in the same directory. Set working
directory to Source File Location.
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
Then, load the data.
solar_data <- read.csv("data/RES_STAT_Lab5_Data.csv")
head(solar_data) ## view the first part of the data frame.
## id user deci sex age kno1 kno2 kno3 kno4 kno5 kno6 kno7 kno8 inno att1 att2
## 1 1 regular residence 2 female 43 1 1 1 1 1 1 1 1 early majority 2 3
## 2 2 regular residence 2 female 51 1 1 1 1 1 1 1 1 laggards 4 4
## 3 3 regular residence 2 male 28 4 4 4 4 2 3 4 2 early adopters 2 5
## 4 4 regular residence 2 female 32 1 1 1 1 1 1 1 1 late majority 3 4
## 5 5 regular residence 2 male 38 4 4 4 3 3 3 3 4 innovators 4 4
## 6 6 regular residence 2 male 51 1 1 1 1 1 1 1 1 early adopters 5 5
## att3 att4 att5 att6 att7 att8 interest re_att1 re_att2 re_att3 re_att4 re_att5 re_att6
## 1 3 1 3 3 3 3 probably not 1 2 2 0 2 2
## 2 3 3 2 2 2 1 definitely not 3 3 2 2 1 1
## 3 5 3 4 5 4 4 likely to 1 4 4 2 3 4
## 4 4 4 5 5 4 3 probably will 2 3 3 3 4 4
## 5 4 3 5 5 5 3 likely to 3 3 3 2 4 4
## 6 3 2 5 5 5 4 definitely going to 4 4 2 1 4 4
## re_att7 re_att8 re_kno1 re_kno2 re_kno3 re_kno4 re_kno5 re_kno6 re_kno7 re_kno8 attitude knowledge
## 1 2 2 0 0 0 0 0 0 0 0 1.625 0.000
## 2 1 0 0 0 0 0 0 0 0 0 1.625 0.000
## 3 3 3 3 3 3 3 1 2 3 1 3.000 2.375
## 4 3 2 0 0 0 0 0 0 0 0 3.000 0.000
## 5 4 2 3 3 3 2 2 2 2 3 3.125 2.500
## 6 4 3 0 0 0 0 0 0 0 0 3.250 0.000
ในแบบฝึกหัดนี้เราจะใช้ตัวแปรความรู้เกี่ยวกับแผงโซลาเซลล์ (knowledge) ตัวแปรนี้ได้มาจากการเฉลี่ยคะแนนการประเมินความรู้ของตนเองจำนวน
For each X, its standardized score, z, can be calculated as \[z = \frac{X-\bar{X}}{s}\]
You can calculate z-score step by step. First, the top part
of the equation \(X-\bar{X}\) is called
mean centering. That is, the value X is subtracted by
the mean of X, X - mean(X)
.
The mean of X is now the “center” (0) of the distribution. Each \(X-\bar{X}\) represents how far each X is
from the \(\bar{X}\), i.e., a deviation
score.
Then, the centered value is scaled to the SD of X,
X - mean(X)/sd(X)
. When the score is scaled to SD. It means
that one unit change in the z-score represents one SD change in
the raw score.
To standardize a variable, you will calculate z-scores for
each row in that column. Let put it the calculated values in
z_kno_m
variable.
z
for z-scores; kno
for knowledge;
and m
for manual calculation.
solar_data$z_kno_m <- (solar_data$knowledge - mean(solar_data$knowledge))/sd(solar_data$knowledge)
head(solar_data$z_kno_m)
## [1] -0.837499 -0.837499 1.532611 -0.837499 1.657354 -0.837499
scale()
functionThe scale function will center the value by substracting the
column mean. Then it scales the centered value by the column
standard deviation.
Let create a variable name z_kno_f
. f
for
function.
solar_data$z_kno_f <- scale(solar_data$knowledge)
head(solar_data$z_kno_f)
## [,1]
## [1,] -0.837499
## [2,] -0.837499
## [3,] 1.532611
## [4,] -0.837499
## [5,] 1.657354
## [6,] -0.837499
Are all values in the manual method equal to values from
scale()
?
all(solar_data$z_kno_m == solar_data$z_kno_f)
## [1] FALSE
Both methods lead to the same results.
There are many methods to identify outliers. In this example, any values outside ±3 SD are considered outliers. Because z-scores are in a unit of SD, any z beyond ±3 will be marked as outliers.
outliers <- solar_data$z_kno_f > 3 | solar_data$z_kno_f < -3
solar_data[outliers, ]
## id user deci sex age kno1 kno2 kno3 kno4 kno5 kno6 kno7 kno8 inno att1 att2
## 167 211 regular residence 2 male 47 5 5 5 5 5 5 5 5 early adopters 4 4
## 264 332 regular residence 2 male 30 5 5 5 5 5 5 5 5 early adopters 5 5
## att3 att4 att5 att6 att7 att8 interest re_att1 re_att2 re_att3 re_att4 re_att5 re_att6 re_att7
## 167 5 4 5 5 5 4 likely to 3 3 4 3 4 4 4
## 264 5 5 5 5 5 5 probably will 4 4 4 4 4 4 4
## re_att8 re_kno1 re_kno2 re_kno3 re_kno4 re_kno5 re_kno6 re_kno7 re_kno8 attitude knowledge z_kno_m
## 167 3 4 4 4 4 4 4 4 4 3.5 4 3.154265
## 264 4 4 4 4 4 4 4 4 4 4.0 4 3.154265
## z_kno_f
## 167 3.154265
## 264 3.154265
z_kno_f
was calculated from the entire data set. This
would be appropriate if we are looking at a homogenous group. However,
it is not a case here. Our data contain subgroups of innovators
inno
. Therefore, we should find outilers within
each group instead.
We will use ave()
to calculate the z-score by group.
ave
is normally used to calculate means for subsets of X.
However, you can specified the argument FUN =
to other
functions, such as sd()
or scale()
.
The first argument in ave()
is a numeric object to be
calculated (knowledge
). The second argument is a grouping
variable (inno
). The third argument FUN =
is a
function to by applied for each factor level
(FUN = scale
).
solar_data$z_kno_g <- ave(solar_data$knowledge, solar_data$inno, FUN = scale)
# wrong <- ave(solar_data$z_kno_f, solar_data$inno, FUN = scale)
head(solar_data$z_kno_g)
## [1] -0.9203839 -0.5926129 0.9015546 -0.7532102 1.4462915 -1.0470435
# head(wrong)
To check whether the code works correctly, we will calculate the mean and SD of the z-scores for each innovator group. The mean should be 0 and SD should be 1 for each group.
We will use tapply
to apply mean()
to each
inno
group. In tapply
, the first argument is
an input value. The second argument is a grouping variable (a factor).
The third argument is a function to apply.
tapply(solar_data$z_kno_g, solar_data$inno, mean)
## early adopters early majority innovators laggards late majority
## -8.778508e-17 -6.073015e-17 1.240837e-16 0.000000e+00 -2.670157e-17
# The values contain very small decimals. Let's round them up to make them easier to read.
round(tapply(solar_data$z_kno_g, solar_data$inno, mean))
## early adopters early majority innovators laggards late majority
## 0 0 0 0 0
# Then calculate the SD
tapply(solar_data$z_kno_g, solar_data$inno, sd)
## early adopters early majority innovators laggards late majority
## 1 1 1 1 1
With the mean = 0 and SD = 1 for each innovator group, it seems that the z-score-by-group code works correctly.
Now we find outliers that have z-score beyond ±3. Because the z-scores were calculated for each innovator group, any values beyond ±3 are outliers for that group.
outliers_g <- solar_data$z_kno_g > 3 | solar_data$z_kno_g < -3
solar_data[outliers_g, ]
## id user deci sex age kno1 kno2 kno3 kno4 kno5 kno6 kno7 kno8 inno att1 att2 att3
## 115 136 small residence 2 female 55 3 3 4 3 3 3 4 3 laggards 4 4 5
## 166 210 small residence 2 male 37 4 4 4 4 4 4 4 4 laggards 3 3 3
## att4 att5 att6 att7 att8 interest re_att1 re_att2 re_att3 re_att4 re_att5 re_att6 re_att7 re_att8
## 115 4 4 5 5 5 probably not 3 3 4 3 3 4 4 4
## 166 3 3 3 3 3 likely to 2 2 2 2 2 2 2 2
## re_kno1 re_kno2 re_kno3 re_kno4 re_kno5 re_kno6 re_kno7 re_kno8 attitude knowledge z_kno_m z_kno_f
## 115 2 2 3 2 2 2 3 2 3.5 2.25 1.407868 1.407868
## 166 3 3 3 3 3 3 3 3 2.0 3.00 2.156324 2.156324
## z_kno_g
## 115 3.038717
## 166 4.249160
In this case, we will remove outliers from the data. We will choose
rows that are NOT outliers
!outliers_g
.
solar_data_new <- solar_data[!outliers_g, ] # choose rows that are not outliers and choose all columns.
nrow(solar_data) # Number of observation in the original data.
## [1] 304
nrow(solar_data_new) # The number of observation should be reduced by 2 cases.
## [1] 302
We will use describe()
from package psych
to calculate means and SDs by
each inno
group.
The first argument of by
is an input data,
knowledge
. The second argument is a grouping factor,
inno
. The third argument is a function to apply,
describe
.
library(psych)
by(solar_data_new$knowledge, solar_data_new$inno, psych::describe)
## solar_data_new$inno: early adopters
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 43 1.28 1.22 1.12 1.15 1.67 0 4 4 0.58 -0.78 0.19
## --------------------------------------------------------------------------------
## solar_data_new$inno: early majority
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 117 0.93 1.01 0.62 0.79 0.93 0 3.5 3.5 0.87 -0.36 0.09
## --------------------------------------------------------------------------------
## solar_data_new$inno: innovators
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 17 1.08 0.98 1.12 1.06 1.3 0 2.5 2.5 0.2 -1.66 0.24
## --------------------------------------------------------------------------------
## solar_data_new$inno: laggards
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 46 0.27 0.4 0 0.2 0 0 1.25 1.25 1.3 0.37 0.06
## --------------------------------------------------------------------------------
## solar_data_new$inno: late majority
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 79 0.71 0.94 0.25 0.54 0.37 0 3.5 3.5 1.37 0.85 0.11
# We will create an object from the code above for future use.
kno_table <- by(solar_data_new$knowledge, solar_data_new$inno, psych::describe)
The object kno_table
from by()
and
describe()
is a list. We will access its mean and SD values
to calculate effect sizes.
An effect size is \(\frac{X_1-X_2}{SD}\).
We will need means of the two groups and an SD.
kno_table$innovators$mean # mean of the innovators group
## [1] 1.080882
kno_table$laggards$mean # mean of the laggards group
## [1] 0.2690217
# For SD we will use the SD from the whole data set.
sd_all <- sd(solar_data_new$knowledge)
sd_all
## [1] 0.9942528
Now we can calculate an effect size for each comparison.
effect_size1 <- (kno_table$innovators$mean - kno_table$laggards$mean)/sd_all
effect_size1
## [1] 0.8165535
effect_size2 <- (kno_table$innovators$mean - kno_table$`early adopters`$mean)/sd_all
effect_size2
## [1] -0.1964092
effect_size3 <- (kno_table$`early majority`$mean - kno_table$laggards$mean)/sd_all
effect_size3
## [1] 0.659985
effect_size4 <- (kno_table$`early majority`$mean - kno_table$`late majority`$mean)/sd_all
effect_size4
## [1] 0.2176035
Copyright © 2022 Kris Ariyabuddhiphongs