Save a data file and an R script in the same directory. Set working
directory to Source File Location.
setwd(dirname(rstudioapi::getSourceEditorContext()$path))
Then, load the data.
solar_data <- read.csv("data/RES_STAT_Lab5_Data.csv")
head(solar_data) ## view the first part of the data frame.
## id user deci sex age kno1 kno2 kno3 kno4 kno5 kno6 kno7 kno8 inno att1 att2
## 1 1 regular residence 2 female 43 1 1 1 1 1 1 1 1 early majority 2 3
## 2 2 regular residence 2 female 51 1 1 1 1 1 1 1 1 laggards 4 4
## 3 3 regular residence 2 male 28 4 4 4 4 2 3 4 2 early adopters 2 5
## 4 4 regular residence 2 female 32 1 1 1 1 1 1 1 1 late majority 3 4
## 5 5 regular residence 2 male 38 4 4 4 3 3 3 3 4 innovators 4 4
## 6 6 regular residence 2 male 51 1 1 1 1 1 1 1 1 early adopters 5 5
## att3 att4 att5 att6 att7 att8 interest re_att1 re_att2 re_att3 re_att4 re_att5 re_att6
## 1 3 1 3 3 3 3 probably not 1 2 2 0 2 2
## 2 3 3 2 2 2 1 definitely not 3 3 2 2 1 1
## 3 5 3 4 5 4 4 likely to 1 4 4 2 3 4
## 4 4 4 5 5 4 3 probably will 2 3 3 3 4 4
## 5 4 3 5 5 5 3 likely to 3 3 3 2 4 4
## 6 3 2 5 5 5 4 definitely going to 4 4 2 1 4 4
## re_att7 re_att8 re_kno1 re_kno2 re_kno3 re_kno4 re_kno5 re_kno6 re_kno7 re_kno8 attitude knowledge
## 1 2 2 0 0 0 0 0 0 0 0 1.625 0.000
## 2 1 0 0 0 0 0 0 0 0 0 1.625 0.000
## 3 3 3 3 3 3 3 1 2 3 1 3.000 2.375
## 4 3 2 0 0 0 0 0 0 0 0 3.000 0.000
## 5 4 2 3 3 3 2 2 2 2 3 3.125 2.500
## 6 4 3 0 0 0 0 0 0 0 0 3.250 0.000
ในแบบฝึกหัดนี้เราจะใช้ตัวแปรความรู้เกี่ยวกับแผงโซลาเซลล์ (knowledge) ตัวแปรนี้ได้มาจากการเฉลี่ยคะแนนการประเมินความรู้ของตนเองจำนวน
For each X, its standardized score, z, can be calculated as \[z = \frac{X-\bar{X}}{s}\]
You can calculate z-score step by step. First, the top part
of the equation \(X-\bar{X}\) is called
mean centering. That is, the value X is subtracted by
the mean of X, X - mean(X).
The mean of X is now the “center” (0) of the distribution. Each \(X-\bar{X}\) represents how far each X is
from the \(\bar{X}\), i.e., a deviation
score.
Then, the centered value is scaled to the SD of X,
X - mean(X)/sd(X). When the score is scaled to SD. It means
that one unit change in the z-score represents one SD change in
the raw score.
To standardize a variable, you will calculate z-scores for
each row in that column. Let put it the calculated values in
z_kno_m variable.
z for z-scores; kno for knowledge;
and m for manual calculation.
solar_data$z_kno_m <- (solar_data$knowledge - mean(solar_data$knowledge))/sd(solar_data$knowledge)
head(solar_data$z_kno_m)
## [1] -0.837499 -0.837499 1.532611 -0.837499 1.657354 -0.837499
scale() functionThe scale function will center the value by substracting the
column mean. Then it scales the centered value by the column
standard deviation.
Let create a variable name z_kno_f. f for
function.
solar_data$z_kno_f <- scale(solar_data$knowledge)
head(solar_data$z_kno_f)
## [,1]
## [1,] -0.837499
## [2,] -0.837499
## [3,] 1.532611
## [4,] -0.837499
## [5,] 1.657354
## [6,] -0.837499
Are all values in the manual method equal to values from
scale()?
all(solar_data$z_kno_m == solar_data$z_kno_f)
## [1] FALSE
Both methods lead to the same results.
There are many methods to identify outliers. In this example, any values outside ±3 SD are considered outliers. Because z-scores are in a unit of SD, any z beyond ±3 will be marked as outliers.
outliers <- solar_data$z_kno_f > 3 | solar_data$z_kno_f < -3
solar_data[outliers, ]
## id user deci sex age kno1 kno2 kno3 kno4 kno5 kno6 kno7 kno8 inno att1 att2
## 167 211 regular residence 2 male 47 5 5 5 5 5 5 5 5 early adopters 4 4
## 264 332 regular residence 2 male 30 5 5 5 5 5 5 5 5 early adopters 5 5
## att3 att4 att5 att6 att7 att8 interest re_att1 re_att2 re_att3 re_att4 re_att5 re_att6 re_att7
## 167 5 4 5 5 5 4 likely to 3 3 4 3 4 4 4
## 264 5 5 5 5 5 5 probably will 4 4 4 4 4 4 4
## re_att8 re_kno1 re_kno2 re_kno3 re_kno4 re_kno5 re_kno6 re_kno7 re_kno8 attitude knowledge z_kno_m
## 167 3 4 4 4 4 4 4 4 4 3.5 4 3.154265
## 264 4 4 4 4 4 4 4 4 4 4.0 4 3.154265
## z_kno_f
## 167 3.154265
## 264 3.154265
z_kno_f was calculated from the entire data set. This
would be appropriate if we are looking at a homogenous group. However,
it is not a case here. Our data contain subgroups of innovators
inno. Therefore, we should find outilers within
each group instead.
We will use ave() to calculate the z-score by group.
ave is normally used to calculate means for subsets of X.
However, you can specified the argument FUN = to other
functions, such as sd() or scale().
The first argument in ave() is a numeric object to be
calculated (knowledge). The second argument is a grouping
variable (inno). The third argument FUN = is a
function to by applied for each factor level
(FUN = scale).
solar_data$z_kno_g <- ave(solar_data$knowledge, solar_data$inno, FUN = scale)
# wrong <- ave(solar_data$z_kno_f, solar_data$inno, FUN = scale)
head(solar_data$z_kno_g)
## [1] -0.9203839 -0.5926129 0.9015546 -0.7532102 1.4462915 -1.0470435
# head(wrong)
To check whether the code works correctly, we will calculate the mean and SD of the z-scores for each innovator group. The mean should be 0 and SD should be 1 for each group.
We will use tapply to apply mean() to each
inno group. In tapply, the first argument is
an input value. The second argument is a grouping variable (a factor).
The third argument is a function to apply.
tapply(solar_data$z_kno_g, solar_data$inno, mean)
## early adopters early majority innovators laggards late majority
## -8.778508e-17 -6.073015e-17 1.240837e-16 0.000000e+00 -2.670157e-17
# The values contain very small decimals. Let's round them up to make them easier to read.
round(tapply(solar_data$z_kno_g, solar_data$inno, mean))
## early adopters early majority innovators laggards late majority
## 0 0 0 0 0
# Then calculate the SD
tapply(solar_data$z_kno_g, solar_data$inno, sd)
## early adopters early majority innovators laggards late majority
## 1 1 1 1 1
With the mean = 0 and SD = 1 for each innovator group, it seems that the z-score-by-group code works correctly.
Now we find outliers that have z-score beyond ±3. Because the z-scores were calculated for each innovator group, any values beyond ±3 are outliers for that group.
outliers_g <- solar_data$z_kno_g > 3 | solar_data$z_kno_g < -3
solar_data[outliers_g, ]
## id user deci sex age kno1 kno2 kno3 kno4 kno5 kno6 kno7 kno8 inno att1 att2 att3
## 115 136 small residence 2 female 55 3 3 4 3 3 3 4 3 laggards 4 4 5
## 166 210 small residence 2 male 37 4 4 4 4 4 4 4 4 laggards 3 3 3
## att4 att5 att6 att7 att8 interest re_att1 re_att2 re_att3 re_att4 re_att5 re_att6 re_att7 re_att8
## 115 4 4 5 5 5 probably not 3 3 4 3 3 4 4 4
## 166 3 3 3 3 3 likely to 2 2 2 2 2 2 2 2
## re_kno1 re_kno2 re_kno3 re_kno4 re_kno5 re_kno6 re_kno7 re_kno8 attitude knowledge z_kno_m z_kno_f
## 115 2 2 3 2 2 2 3 2 3.5 2.25 1.407868 1.407868
## 166 3 3 3 3 3 3 3 3 2.0 3.00 2.156324 2.156324
## z_kno_g
## 115 3.038717
## 166 4.249160
In this case, we will remove outliers from the data. We will choose
rows that are NOT outliers
!outliers_g.
solar_data_new <- solar_data[!outliers_g, ] # choose rows that are not outliers and choose all columns.
nrow(solar_data) # Number of observation in the original data.
## [1] 304
nrow(solar_data_new) # The number of observation should be reduced by 2 cases.
## [1] 302
We will use describe() from package psych
to calculate means and SDs by each inno group.
The first argument of by is an input data,
knowledge. The second argument is a grouping factor,
inno. The third argument is a function to apply,
describe.
library(psych)
by(solar_data_new$knowledge, solar_data_new$inno, psych::describe)
## solar_data_new$inno: early adopters
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 43 1.28 1.22 1.12 1.15 1.67 0 4 4 0.58 -0.78 0.19
## --------------------------------------------------------------------------------
## solar_data_new$inno: early majority
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 117 0.93 1.01 0.62 0.79 0.93 0 3.5 3.5 0.87 -0.36 0.09
## --------------------------------------------------------------------------------
## solar_data_new$inno: innovators
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 17 1.08 0.98 1.12 1.06 1.3 0 2.5 2.5 0.2 -1.66 0.24
## --------------------------------------------------------------------------------
## solar_data_new$inno: laggards
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 46 0.27 0.4 0 0.2 0 0 1.25 1.25 1.3 0.37 0.06
## --------------------------------------------------------------------------------
## solar_data_new$inno: late majority
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 79 0.71 0.94 0.25 0.54 0.37 0 3.5 3.5 1.37 0.85 0.11
# We will create an object from the code above for future use.
kno_table <- by(solar_data_new$knowledge, solar_data_new$inno, psych::describe)
The object kno_table from by() and
describe() is a list. We will access its mean and SD values
to calculate effect sizes.
An effect size is \(\frac{X_1-X_2}{SD}\).
We will need means of the two groups and an SD.
kno_table$innovators$mean # mean of the innovators group
## [1] 1.080882
kno_table$laggards$mean # mean of the laggards group
## [1] 0.2690217
# For SD we will use the SD from the whole data set.
sd_all <- sd(solar_data_new$knowledge)
sd_all
## [1] 0.9942528
Now we can calculate an effect size for each comparison.
effect_size1 <- (kno_table$innovators$mean - kno_table$laggards$mean)/sd_all
effect_size1
## [1] 0.8165535
effect_size2 <- (kno_table$innovators$mean - kno_table$`early adopters`$mean)/sd_all
effect_size2
## [1] -0.1964092
effect_size3 <- (kno_table$`early majority`$mean - kno_table$laggards$mean)/sd_all
effect_size3
## [1] 0.659985
effect_size4 <- (kno_table$`early majority`$mean - kno_table$`late majority`$mean)/sd_all
effect_size4
## [1] 0.2176035
Copyright © 2022 Kris Ariyabuddhiphongs