The advent of smart devices opens the door to multiple possibilities, among which monitoring people’s physical exercice using inexpensive devices such as Jawbone Up, Nike FuelBand and Fitbit. Indeed, ones can qualify the effectiveness of their activity based on the feedbacks provided by these devices. In this study, the quality of weight lifting (unilateral dumbbell biceps curl) exercises is evaluated with the state-of-art random forest classifier. The set up we use to build our random forest provides results that are very close to a perfect classification with very a low generalization error rate.
The raw data sets are collected from four Razor’s sensors Inertial Measurement Units (IMU)(1) which provide, each, three-axes acceleration, gyroscope and magnometer data sampled at a rate of 45 Hz. These IMUs are attached to the dumb bell and the subject’s hand, arm and lumbar. Each participant is asked to execute a serie of unilateral dumbbell biceps curl exercises according to five different ways:
Except the class A, the other four classes correspond to common mistakes people make while executing this exercise.
The data as described above comes from the research and development group of groupware technologies. This dataset is splitted into the training and the testing sets. These can be obtained from here:
training data: [https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv]
testing data: [https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv]
To obtain these data sets, the following code snippets can be used:
fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileUrl, destfile="pml-training.csv", method="wget")
fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileUrl, destfile="pml-testing.csv", method="wget")
# Load the data from file (Note: This supposes you've already downloaded the datasets)
dfTraining <- read.csv("pml-training.csv", stringsAsFactors=F, header=T)
dfTesting <- read.csv("pml-testing.csv", stringsAsFactors=F, header=T)
myVars <- !(names(dfTraining) %in% "classe")
dfTraining$classe <- as.factor(dfTraining$classe) # set the outcome as a factor variable
dfTraining[, myVars] <- apply(dfTraining[, myVars], 2, as.numeric)
dfTesting[, myVars] <- apply(dfTesting[, myVars], 2, as.numeric)
An exploratory data analysis reveals that the datasets train and test have, respectively, 19622 and 20 observations with 160 variables for both of them. The variable names are listed in Appendix A1. Among those, some variables are not relevant in describing the physical movement executions. They are:
Therefore, they can be dropped. Also, the summary variables (ie. those with the suffixes min, max, avg, stddev, var, kurtosis, skewness and amplitude) are derived features that were computed only when a new_window is triggered and, most of the time, their values are not defined (NA). They are thus not much useful for the modelization.
The following code is used to get rid off these unnecessary variables to get the tidy dataset.
toMatch <- c("kurtosis", "skewness", "max", "min", "amplitude", "var", "avg", "stddev",
"X", "timestamp", "user_name", "window")
toRemove <- grep(paste(toMatch,collapse="|"), names(dfTraining), value=TRUE)
dfTraining <- dfTraining[, !(names(dfTraining) %in% toRemove)]
dfTesting <- dfTesting[, !(names(dfTesting) %in% toRemove)]
The datasets that result from this cleaning process is now reduced to 53 variables for both the train and test sets.
Before choosing and fitting a model to the training dataset, it could be important to preprocess its predictors. This is an important part in the data preparation since it can reduce the complexity of the algorithm being considered. In our context, the predictors are analyzed in order to detect eventually weak variances and those which fall in such a case are removed.
myVars <- !(names(dfTraining) %in% "classe")
nzv <- nearZeroVar(dfTraining[, myVars], saveMetrics=TRUE)
if ( any(nzv$nzv) ){
dfTraining <- dfTraining[, nzv$nzv]
}
Other steps could include the standardization of the dataset. However, this step depends on the algorith under consideration. As we will mention later, our modelization involves the random forest algorithm which bases its decision on individual features at each split (node) and thus, monotonic transformations of features will appear invariant in the decision. The standardization process is therefore not necessary in this case.
Since the training dataset we have at our disposal is a medium size, we decide to split it into a 60% (training) - 40% (validation) proportion for the prediction study. The following code snippet permits to do that:
# split the training data set for cross-validation
tIndices <- createDataPartition(y=dfTraining$classe, p=0.6, list=FALSE)
dfTrainCv <- dfTraining[tIndices,]
dfTestCv <- dfTraining[-tIndices,]
In our approach, the random forest algorithm is adopted for the classification due to its numerous properties; in particular, it is robust to high variances (ie. it has a good rate of generalization), can efficiently decorrelate the trees (as compared to the bagging method), can estimate the feature’s importances and provides usually satisfying results. To train the classifier, the bootstrap resampling with 25 sample sets is used and the process is repeated 25 times. This approach is implemented as follow:
set.seed(1248)
ptime <- system.time(modfitrf <- randomForest(classe ~ ., data=dfTrainCv, importance=TRUE, proximity=TRUE)))
modfitrf
##
## Call:
## randomForest(formula = classe ~ ., data = dfTrainCv, importance = TRUE, proximity = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.73%
## Confusion matrix:
## A B C D E class.error
## A 3347 1 0 0 0 0.0002986858
## B 13 2262 4 0 0 0.0074594120
## C 0 20 2024 10 0 0.0146056475
## D 0 0 26 1902 2 0.0145077720
## E 0 1 2 7 2155 0.0046189376
The fitted model has an estimate of the overall out-of-bag (OOB) error rate of 0.74%, which is quite good. In random forest, the OOB error rate is equivalent to the out of sample error that is estimated internally as the forest is built. Thus, we can anticipate that the out of sample error will be the same magnitude order as the OOB error.
Although cross validation step is not needed for the random forest algorithm (it is estimated internally with the OOB samples during the run)(2), we want to include anyway this extra step to verify the generalization ability of the fitted model from an external validation set. The prediction on the validation set is done as follow:
set.seed(13579)
predRf <- predict(modfitrf, dfTestCv)
cm <- confusionMatrix(dfTestCv$classe, predRf); cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2230 1 0 0 1
## B 6 1512 0 0 0
## C 0 12 1356 0 0
## D 0 0 2 1284 0
## E 0 0 0 0 1442
##
## Overall Statistics
##
## Accuracy : 0.9972
## 95% CI : (0.9958, 0.9982)
## No Information Rate : 0.285
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9965
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9973 0.9915 0.9985 1.0000 0.9993
## Specificity 0.9996 0.9991 0.9982 0.9997 1.0000
## Pos Pred Value 0.9991 0.9960 0.9912 0.9984 1.0000
## Neg Pred Value 0.9989 0.9979 0.9997 1.0000 0.9998
## Prevalence 0.2850 0.1944 0.1731 0.1637 0.1839
## Detection Rate 0.2842 0.1927 0.1728 0.1637 0.1838
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9985 0.9953 0.9983 0.9998 0.9997
The table of results shows that the fitted model has an accuracy of 99.72% (out of sample error is then 0.28%). This accuracy based on an unseen dataset (cross validation) corroborates the OOB error rate estimated earlier from the training step. So, the designed model has an unbiased estimate of the test set error which, in turn, insures a very good rate of generalization.
library(pROC)
predictions <- as.numeric(predict(modfitrf, dfTestCv, type='response'))
mcauc <- multiclass.roc(dfTestCv$classe, predictions, percent=TRUE); mcauc
##
## Call:
## multiclass.roc.default(response = dfTestCv$classe, predictor = predictions, percent = TRUE)
##
## Data: predictions with 5 levels of dfTestCv$classe: A, B, C, D, E.
## Multi-class area under the curve: 99.91%
Also, our model’s multi-class AUC value of 0.9991 is very close to 1 which is the AUC of a perfect classifier.
One interesting fact with the random forest algorithm (as CART algorithms in general) is that each predictor’s importance can be deduced, based on an impurity measure. Appendix A2 lists the variable importances ordered according to the Gini impurity measure (meanDecreaseGini). This measure quantifies the randomness of misclassification (impurity) a given variable can influence on the impurity decrease. Simply speaking, the lower its value, the more important role this variable has in the correct classification. This is convenient in a feature selection context.
The 20 samples from the test dataset are classified as followed:
predtest <- predict(modfitrf, dfTesting);
predtest
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. “Qualitative Activity Recognition of Weight Lifting Exercises”, Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
Leo Brieman and Adele Cutler, Random Forest
## [1] "X" "user_name"
## [3] "raw_timestamp_part_1" "raw_timestamp_part_2"
## [5] "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt"
## [9] "pitch_belt" "yaw_belt"
## [11] "total_accel_belt" "kurtosis_roll_belt"
## [13] "kurtosis_picth_belt" "kurtosis_yaw_belt"
## [15] "skewness_roll_belt" "skewness_roll_belt.1"
## [17] "skewness_yaw_belt" "max_roll_belt"
## [19] "max_picth_belt" "max_yaw_belt"
## [21] "min_roll_belt" "min_pitch_belt"
## [23] "min_yaw_belt" "amplitude_roll_belt"
## [25] "amplitude_pitch_belt" "amplitude_yaw_belt"
## [27] "var_total_accel_belt" "avg_roll_belt"
## [29] "stddev_roll_belt" "var_roll_belt"
## [31] "avg_pitch_belt" "stddev_pitch_belt"
## [33] "var_pitch_belt" "avg_yaw_belt"
## [35] "stddev_yaw_belt" "var_yaw_belt"
## [37] "gyros_belt_x" "gyros_belt_y"
## [39] "gyros_belt_z" "accel_belt_x"
## [41] "accel_belt_y" "accel_belt_z"
## [43] "magnet_belt_x" "magnet_belt_y"
## [45] "magnet_belt_z" "roll_arm"
## [47] "pitch_arm" "yaw_arm"
## [49] "total_accel_arm" "var_accel_arm"
## [51] "avg_roll_arm" "stddev_roll_arm"
## [53] "var_roll_arm" "avg_pitch_arm"
## [55] "stddev_pitch_arm" "var_pitch_arm"
## [57] "avg_yaw_arm" "stddev_yaw_arm"
## [59] "var_yaw_arm" "gyros_arm_x"
## [61] "gyros_arm_y" "gyros_arm_z"
## [63] "accel_arm_x" "accel_arm_y"
## [65] "accel_arm_z" "magnet_arm_x"
## [67] "magnet_arm_y" "magnet_arm_z"
## [69] "kurtosis_roll_arm" "kurtosis_picth_arm"
## [71] "kurtosis_yaw_arm" "skewness_roll_arm"
## [73] "skewness_pitch_arm" "skewness_yaw_arm"
## [75] "max_roll_arm" "max_picth_arm"
## [77] "max_yaw_arm" "min_roll_arm"
## [79] "min_pitch_arm" "min_yaw_arm"
## [81] "amplitude_roll_arm" "amplitude_pitch_arm"
## [83] "amplitude_yaw_arm" "roll_dumbbell"
## [85] "pitch_dumbbell" "yaw_dumbbell"
## [87] "kurtosis_roll_dumbbell" "kurtosis_picth_dumbbell"
## [89] "kurtosis_yaw_dumbbell" "skewness_roll_dumbbell"
## [91] "skewness_pitch_dumbbell" "skewness_yaw_dumbbell"
## [93] "max_roll_dumbbell" "max_picth_dumbbell"
## [95] "max_yaw_dumbbell" "min_roll_dumbbell"
## [97] "min_pitch_dumbbell" "min_yaw_dumbbell"
## [99] "amplitude_roll_dumbbell" "amplitude_pitch_dumbbell"
## [101] "amplitude_yaw_dumbbell" "total_accel_dumbbell"
## [103] "var_accel_dumbbell" "avg_roll_dumbbell"
## [105] "stddev_roll_dumbbell" "var_roll_dumbbell"
## [107] "avg_pitch_dumbbell" "stddev_pitch_dumbbell"
## [109] "var_pitch_dumbbell" "avg_yaw_dumbbell"
## [111] "stddev_yaw_dumbbell" "var_yaw_dumbbell"
## [113] "gyros_dumbbell_x" "gyros_dumbbell_y"
## [115] "gyros_dumbbell_z" "accel_dumbbell_x"
## [117] "accel_dumbbell_y" "accel_dumbbell_z"
## [119] "magnet_dumbbell_x" "magnet_dumbbell_y"
## [121] "magnet_dumbbell_z" "roll_forearm"
## [123] "pitch_forearm" "yaw_forearm"
## [125] "kurtosis_roll_forearm" "kurtosis_picth_forearm"
## [127] "kurtosis_yaw_forearm" "skewness_roll_forearm"
## [129] "skewness_pitch_forearm" "skewness_yaw_forearm"
## [131] "max_roll_forearm" "max_picth_forearm"
## [133] "max_yaw_forearm" "min_roll_forearm"
## [135] "min_pitch_forearm" "min_yaw_forearm"
## [137] "amplitude_roll_forearm" "amplitude_pitch_forearm"
## [139] "amplitude_yaw_forearm" "total_accel_forearm"
## [141] "var_accel_forearm" "avg_roll_forearm"
## [143] "stddev_roll_forearm" "var_roll_forearm"
## [145] "avg_pitch_forearm" "stddev_pitch_forearm"
## [147] "var_pitch_forearm" "avg_yaw_forearm"
## [149] "stddev_yaw_forearm" "var_yaw_forearm"
## [151] "gyros_forearm_x" "gyros_forearm_y"
## [153] "gyros_forearm_z" "accel_forearm_x"
## [155] "accel_forearm_y" "accel_forearm_z"
## [157] "magnet_forearm_x" "magnet_forearm_y"
## [159] "magnet_forearm_z" "classe"
A | B | C | D | E | MeanDecreaseAccuracy | MeanDecreaseGini | |
---|---|---|---|---|---|---|---|
gyros_arm_z | 0.0050225 | 0.0045273 | 0.0039092 | 0.0045855 | 0.0030702 | 0.0042983 | 36.99014 |
gyros_forearm_x | 0.0031082 | 0.0057950 | 0.0088756 | 0.0093058 | 0.0026574 | 0.0055650 | 45.40488 |
gyros_forearm_z | 0.0047288 | 0.0083247 | 0.0053113 | 0.0038730 | 0.0024903 | 0.0049754 | 50.19649 |
gyros_dumbbell_z | 0.0031446 | 0.0071945 | 0.0056385 | 0.0058126 | 0.0036431 | 0.0048957 | 52.41906 |
gyros_belt_x | 0.0112111 | 0.0091934 | 0.0196929 | 0.0121652 | 0.0034192 | 0.0110402 | 54.91703 |
total_accel_arm | 0.0077674 | 0.0084395 | 0.0109646 | 0.0117264 | 0.0057236 | 0.0087253 | 64.17706 |
gyros_belt_y | 0.0087265 | 0.0160730 | 0.0223532 | 0.0144325 | 0.0061878 | 0.0130035 | 65.64987 |
total_accel_forearm | 0.0157780 | 0.0078179 | 0.0104160 | 0.0086942 | 0.0050312 | 0.0101584 | 70.27534 |
gyros_forearm_y | 0.0072035 | 0.0126087 | 0.0076831 | 0.0097917 | 0.0044780 | 0.0082529 | 73.11439 |
accel_belt_x | 0.0183746 | 0.0234457 | 0.0301236 | 0.0200599 | 0.0075763 | 0.0196966 | 77.25316 |
accel_arm_z | 0.0171785 | 0.0106935 | 0.0172166 | 0.0102051 | 0.0061742 | 0.0127634 | 78.91468 |
gyros_dumbbell_x | 0.0048955 | 0.0137483 | 0.0187227 | 0.0122831 | 0.0078844 | 0.0107871 | 78.95510 |
gyros_arm_x | 0.0197860 | 0.0122563 | 0.0078804 | 0.0104655 | 0.0053238 | 0.0120687 | 81.20208 |
accel_forearm_y | 0.0183313 | 0.0136022 | 0.0223476 | 0.0155492 | 0.0077646 | 0.0157137 | 83.51441 |
gyros_arm_y | 0.0130845 | 0.0156529 | 0.0070519 | 0.0119388 | 0.0055157 | 0.0109534 | 84.64759 |
accel_belt_y | 0.0213945 | 0.0215537 | 0.0251514 | 0.0324871 | 0.0109188 | 0.0219798 | 84.81910 |
accel_arm_y | 0.0203185 | 0.0191830 | 0.0176791 | 0.0208022 | 0.0096976 | 0.0177665 | 97.78520 |
yaw_forearm | 0.0162530 | 0.0158276 | 0.0268242 | 0.0382422 | 0.0116470 | 0.0207784 | 100.43565 |
pitch_arm | 0.0163687 | 0.0203359 | 0.0216935 | 0.0209497 | 0.0115849 | 0.0179427 | 105.04392 |
pitch_dumbbell | 0.0217716 | 0.0292981 | 0.0388038 | 0.0318504 | 0.0195763 | 0.0274601 | 108.08058 |
magnet_arm_z | 0.0213520 | 0.0198561 | 0.0200372 | 0.0128993 | 0.0080734 | 0.0170020 | 112.38083 |
magnet_forearm_y | 0.0222520 | 0.0190485 | 0.0278879 | 0.0288940 | 0.0165891 | 0.0226576 | 128.51545 |
magnet_arm_y | 0.0272279 | 0.0330026 | 0.0370462 | 0.0501245 | 0.0217393 | 0.0328041 | 129.36346 |
magnet_forearm_x | 0.0268747 | 0.0158612 | 0.0209969 | 0.0332373 | 0.0192755 | 0.0233523 | 129.82700 |
accel_arm_x | 0.0304255 | 0.0234829 | 0.0256853 | 0.0430316 | 0.0161714 | 0.0276915 | 139.44320 |
total_accel_belt | 0.0173906 | 0.0233491 | 0.0257660 | 0.0250194 | 0.0221909 | 0.0221340 | 140.01929 |
yaw_arm | 0.0232137 | 0.0191143 | 0.0260349 | 0.0269577 | 0.0102085 | 0.0211328 | 140.96476 |
accel_forearm_z | 0.0245532 | 0.0272793 | 0.0517790 | 0.0412692 | 0.0227288 | 0.0322398 | 144.92912 |
magnet_belt_x | 0.0185672 | 0.0307866 | 0.0429687 | 0.0323963 | 0.0152365 | 0.0268295 | 147.90910 |
accel_dumbbell_x | 0.0290362 | 0.0338737 | 0.0492995 | 0.0416988 | 0.0256401 | 0.0349824 | 151.45015 |
yaw_dumbbell | 0.0299674 | 0.0356011 | 0.0608455 | 0.0413435 | 0.0270713 | 0.0377845 | 153.74850 |
magnet_arm_x | 0.0600883 | 0.0395969 | 0.0428598 | 0.0617101 | 0.0291868 | 0.0477098 | 156.31836 |
total_accel_dumbbell | 0.0338707 | 0.0309443 | 0.0394825 | 0.0570376 | 0.0239026 | 0.0362438 | 159.06478 |
gyros_dumbbell_y | 0.0266702 | 0.0306184 | 0.0462819 | 0.0287792 | 0.0155376 | 0.0291577 | 165.02473 |
magnet_forearm_z | 0.0290397 | 0.0249584 | 0.0370571 | 0.0411755 | 0.0201326 | 0.0299856 | 168.75157 |
gyros_belt_z | 0.0282199 | 0.0391090 | 0.0418926 | 0.0352375 | 0.0303498 | 0.0342689 | 184.31019 |
roll_arm | 0.0250423 | 0.0365796 | 0.0530015 | 0.0653112 | 0.0210765 | 0.0380024 | 186.91642 |
accel_forearm_x | 0.0409852 | 0.0392238 | 0.0433190 | 0.0761958 | 0.0310900 | 0.0449897 | 186.99584 |
accel_dumbbell_z | 0.0389369 | 0.0446580 | 0.0648370 | 0.0569297 | 0.0426827 | 0.0481886 | 195.99288 |
magnet_belt_y | 0.0318265 | 0.0485406 | 0.0512960 | 0.0589625 | 0.0399584 | 0.0443944 | 225.24378 |
accel_belt_z | 0.0298142 | 0.0358739 | 0.0525637 | 0.0437662 | 0.0366055 | 0.0384730 | 227.73361 |
magnet_belt_z | 0.0386145 | 0.0542071 | 0.0574914 | 0.0648343 | 0.0395518 | 0.0493911 | 235.03581 |
roll_dumbbell | 0.0394330 | 0.0738221 | 0.1036183 | 0.0767889 | 0.0438397 | 0.0642194 | 254.23782 |
accel_dumbbell_y | 0.0485603 | 0.0558788 | 0.1130781 | 0.0656890 | 0.0411422 | 0.0626798 | 259.88089 |
magnet_dumbbell_x | 0.1056454 | 0.0808095 | 0.1406698 | 0.1130797 | 0.0510258 | 0.0980953 | 290.86491 |
roll_forearm | 0.1337024 | 0.0861652 | 0.1752968 | 0.1164821 | 0.0628424 | 0.1158936 | 354.19407 |
magnet_dumbbell_y | 0.0979712 | 0.1071347 | 0.1599375 | 0.1435520 | 0.0555445 | 0.1102078 | 394.46940 |
pitch_belt | 0.0759085 | 0.1187436 | 0.0982781 | 0.1014234 | 0.0351515 | 0.0847816 | 399.76097 |
pitch_forearm | 0.0953326 | 0.0575115 | 0.0779447 | 0.1289853 | 0.0501202 | 0.0821714 | 472.08592 |
magnet_dumbbell_z | 0.1361086 | 0.0980454 | 0.1538077 | 0.1295085 | 0.0630891 | 0.1173236 | 482.43098 |
yaw_belt | 0.1162264 | 0.0974325 | 0.1232083 | 0.1594407 | 0.0660065 | 0.1116528 | 528.08876 |
roll_belt | 0.0816078 | 0.1014162 | 0.1215034 | 0.1352998 | 0.1965186 | 0.1222927 | 769.33329 |