Executive Summary

The advent of smart devices opens the door to multiple possibilities, among which monitoring people’s physical exercice using inexpensive devices such as Jawbone Up, Nike FuelBand and Fitbit. Indeed, ones can qualify the effectiveness of their activity based on the feedbacks provided by these devices. In this study, the quality of weight lifting (unilateral dumbbell biceps curl) exercises is evaluated with the state-of-art random forest classifier. The set up we use to build our random forest provides results that are very close to a perfect classification with very a low generalization error rate.

1. Description of the data

The raw data sets are collected from four Razor’s sensors Inertial Measurement Units (IMU)(1) which provide, each, three-axes acceleration, gyroscope and magnometer data sampled at a rate of 45 Hz. These IMUs are attached to the dumb bell and the subject’s hand, arm and lumbar. Each participant is asked to execute a serie of unilateral dumbbell biceps curl exercises according to five different ways:

Except the class A, the other four classes correspond to common mistakes people make while executing this exercise.

2. Getting the data

The data as described above comes from the research and development group of groupware technologies. This dataset is splitted into the training and the testing sets. These can be obtained from here:

training data: [https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv]
testing data: [https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv]

To obtain these data sets, the following code snippets can be used:

fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileUrl, destfile="pml-training.csv", method="wget")
fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileUrl, destfile="pml-testing.csv", method="wget")

3. Data Cleaning

# Load the data from file (Note: This supposes you've already downloaded the datasets)
dfTraining <- read.csv("pml-training.csv", stringsAsFactors=F, header=T)
dfTesting <- read.csv("pml-testing.csv", stringsAsFactors=F, header=T)

myVars <- !(names(dfTraining) %in% "classe")
dfTraining$classe <- as.factor(dfTraining$classe)     # set the outcome as a factor variable
dfTraining[, myVars] <- apply(dfTraining[, myVars], 2, as.numeric)
dfTesting[, myVars] <- apply(dfTesting[, myVars], 2, as.numeric)

An exploratory data analysis reveals that the datasets train and test have, respectively, 19622 and 20 observations with 160 variables for both of them. The variable names are listed in Appendix A1. Among those, some variables are not relevant in describing the physical movement executions. They are:

Therefore, they can be dropped. Also, the summary variables (ie. those with the suffixes min, max, avg, stddev, var, kurtosis, skewness and amplitude) are derived features that were computed only when a new_window is triggered and, most of the time, their values are not defined (NA). They are thus not much useful for the modelization.

The following code is used to get rid off these unnecessary variables to get the tidy dataset.

toMatch <- c("kurtosis", "skewness", "max", "min", "amplitude", "var", "avg", "stddev", 
             "X", "timestamp", "user_name", "window")
toRemove <- grep(paste(toMatch,collapse="|"), names(dfTraining), value=TRUE)
dfTraining <- dfTraining[, !(names(dfTraining) %in% toRemove)]
dfTesting <- dfTesting[, !(names(dfTesting) %in% toRemove)]

The datasets that result from this cleaning process is now reduced to 53 variables for both the train and test sets.

4. Model Building

4.1. Preprocessing

Before choosing and fitting a model to the training dataset, it could be important to preprocess its predictors. This is an important part in the data preparation since it can reduce the complexity of the algorithm being considered. In our context, the predictors are analyzed in order to detect eventually weak variances and those which fall in such a case are removed.

myVars <- !(names(dfTraining) %in% "classe")
nzv <- nearZeroVar(dfTraining[, myVars], saveMetrics=TRUE)
if ( any(nzv$nzv) ){
      dfTraining <- dfTraining[, nzv$nzv]
}

Other steps could include the standardization of the dataset. However, this step depends on the algorith under consideration. As we will mention later, our modelization involves the random forest algorithm which bases its decision on individual features at each split (node) and thus, monotonic transformations of features will appear invariant in the decision. The standardization process is therefore not necessary in this case.

4.2. Data Slicing

Since the training dataset we have at our disposal is a medium size, we decide to split it into a 60% (training) - 40% (validation) proportion for the prediction study. The following code snippet permits to do that:

# split the training data set for cross-validation
tIndices <- createDataPartition(y=dfTraining$classe, p=0.6, list=FALSE)
dfTrainCv <- dfTraining[tIndices,]
dfTestCv <- dfTraining[-tIndices,]

4.3. Train

In our approach, the random forest algorithm is adopted for the classification due to its numerous properties; in particular, it is robust to high variances (ie. it has a good rate of generalization), can efficiently decorrelate the trees (as compared to the bagging method), can estimate the feature’s importances and provides usually satisfying results. To train the classifier, the bootstrap resampling with 25 sample sets is used and the process is repeated 25 times. This approach is implemented as follow:

set.seed(1248)
ptime <- system.time(modfitrf <- randomForest(classe ~ ., data=dfTrainCv, importance=TRUE, proximity=TRUE)))
modfitrf
## 
## Call:
##  randomForest(formula = classe ~ ., data = dfTrainCv, importance = TRUE,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.73%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3347    1    0    0    0 0.0002986858
## B   13 2262    4    0    0 0.0074594120
## C    0   20 2024   10    0 0.0146056475
## D    0    0   26 1902    2 0.0145077720
## E    0    1    2    7 2155 0.0046189376

The fitted model has an estimate of the overall out-of-bag (OOB) error rate of 0.74%, which is quite good. In random forest, the OOB error rate is equivalent to the out of sample error that is estimated internally as the forest is built. Thus, we can anticipate that the out of sample error will be the same magnitude order as the OOB error.

4.4. Cross-validation and Result Analysis

Although cross validation step is not needed for the random forest algorithm (it is estimated internally with the OOB samples during the run)(2), we want to include anyway this extra step to verify the generalization ability of the fitted model from an external validation set. The prediction on the validation set is done as follow:

set.seed(13579)
predRf <- predict(modfitrf, dfTestCv)
  • Confusion Matrix
cm <- confusionMatrix(dfTestCv$classe, predRf); cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2230    1    0    0    1
##          B    6 1512    0    0    0
##          C    0   12 1356    0    0
##          D    0    0    2 1284    0
##          E    0    0    0    0 1442
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9972          
##                  95% CI : (0.9958, 0.9982)
##     No Information Rate : 0.285           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9965          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9973   0.9915   0.9985   1.0000   0.9993
## Specificity            0.9996   0.9991   0.9982   0.9997   1.0000
## Pos Pred Value         0.9991   0.9960   0.9912   0.9984   1.0000
## Neg Pred Value         0.9989   0.9979   0.9997   1.0000   0.9998
## Prevalence             0.2850   0.1944   0.1731   0.1637   0.1839
## Detection Rate         0.2842   0.1927   0.1728   0.1637   0.1838
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9985   0.9953   0.9983   0.9998   0.9997

The table of results shows that the fitted model has an accuracy of 99.72% (out of sample error is then 0.28%). This accuracy based on an unseen dataset (cross validation) corroborates the OOB error rate estimated earlier from the training step. So, the designed model has an unbiased estimate of the test set error which, in turn, insures a very good rate of generalization.

library(pROC)
predictions <- as.numeric(predict(modfitrf, dfTestCv, type='response'))
mcauc <- multiclass.roc(dfTestCv$classe, predictions, percent=TRUE); mcauc
## 
## Call:
## multiclass.roc.default(response = dfTestCv$classe, predictor = predictions,     percent = TRUE)
## 
## Data: predictions with 5 levels of dfTestCv$classe: A, B, C, D, E.
## Multi-class area under the curve: 99.91%

Also, our model’s multi-class AUC value of 0.9991 is very close to 1 which is the AUC of a perfect classifier.

  • Variable importance

One interesting fact with the random forest algorithm (as CART algorithms in general) is that each predictor’s importance can be deduced, based on an impurity measure. Appendix A2 lists the variable importances ordered according to the Gini impurity measure (meanDecreaseGini). This measure quantifies the randomness of misclassification (impurity) a given variable can influence on the impurity decrease. Simply speaking, the lower its value, the more important role this variable has in the correct classification. This is convenient in a feature selection context.

5. Prediction on the test set

The 20 samples from the test dataset are classified as followed:

predtest <- predict(modfitrf, dfTesting);
predtest
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

References

  1. Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. “Qualitative Activity Recognition of Weight Lifting Exercises”, Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

  2. Leo Brieman and Adele Cutler, Random Forest

Appendices

A1. Variable names

##   [1] "X"                        "user_name"               
##   [3] "raw_timestamp_part_1"     "raw_timestamp_part_2"    
##   [5] "cvtd_timestamp"           "new_window"              
##   [7] "num_window"               "roll_belt"               
##   [9] "pitch_belt"               "yaw_belt"                
##  [11] "total_accel_belt"         "kurtosis_roll_belt"      
##  [13] "kurtosis_picth_belt"      "kurtosis_yaw_belt"       
##  [15] "skewness_roll_belt"       "skewness_roll_belt.1"    
##  [17] "skewness_yaw_belt"        "max_roll_belt"           
##  [19] "max_picth_belt"           "max_yaw_belt"            
##  [21] "min_roll_belt"            "min_pitch_belt"          
##  [23] "min_yaw_belt"             "amplitude_roll_belt"     
##  [25] "amplitude_pitch_belt"     "amplitude_yaw_belt"      
##  [27] "var_total_accel_belt"     "avg_roll_belt"           
##  [29] "stddev_roll_belt"         "var_roll_belt"           
##  [31] "avg_pitch_belt"           "stddev_pitch_belt"       
##  [33] "var_pitch_belt"           "avg_yaw_belt"            
##  [35] "stddev_yaw_belt"          "var_yaw_belt"            
##  [37] "gyros_belt_x"             "gyros_belt_y"            
##  [39] "gyros_belt_z"             "accel_belt_x"            
##  [41] "accel_belt_y"             "accel_belt_z"            
##  [43] "magnet_belt_x"            "magnet_belt_y"           
##  [45] "magnet_belt_z"            "roll_arm"                
##  [47] "pitch_arm"                "yaw_arm"                 
##  [49] "total_accel_arm"          "var_accel_arm"           
##  [51] "avg_roll_arm"             "stddev_roll_arm"         
##  [53] "var_roll_arm"             "avg_pitch_arm"           
##  [55] "stddev_pitch_arm"         "var_pitch_arm"           
##  [57] "avg_yaw_arm"              "stddev_yaw_arm"          
##  [59] "var_yaw_arm"              "gyros_arm_x"             
##  [61] "gyros_arm_y"              "gyros_arm_z"             
##  [63] "accel_arm_x"              "accel_arm_y"             
##  [65] "accel_arm_z"              "magnet_arm_x"            
##  [67] "magnet_arm_y"             "magnet_arm_z"            
##  [69] "kurtosis_roll_arm"        "kurtosis_picth_arm"      
##  [71] "kurtosis_yaw_arm"         "skewness_roll_arm"       
##  [73] "skewness_pitch_arm"       "skewness_yaw_arm"        
##  [75] "max_roll_arm"             "max_picth_arm"           
##  [77] "max_yaw_arm"              "min_roll_arm"            
##  [79] "min_pitch_arm"            "min_yaw_arm"             
##  [81] "amplitude_roll_arm"       "amplitude_pitch_arm"     
##  [83] "amplitude_yaw_arm"        "roll_dumbbell"           
##  [85] "pitch_dumbbell"           "yaw_dumbbell"            
##  [87] "kurtosis_roll_dumbbell"   "kurtosis_picth_dumbbell" 
##  [89] "kurtosis_yaw_dumbbell"    "skewness_roll_dumbbell"  
##  [91] "skewness_pitch_dumbbell"  "skewness_yaw_dumbbell"   
##  [93] "max_roll_dumbbell"        "max_picth_dumbbell"      
##  [95] "max_yaw_dumbbell"         "min_roll_dumbbell"       
##  [97] "min_pitch_dumbbell"       "min_yaw_dumbbell"        
##  [99] "amplitude_roll_dumbbell"  "amplitude_pitch_dumbbell"
## [101] "amplitude_yaw_dumbbell"   "total_accel_dumbbell"    
## [103] "var_accel_dumbbell"       "avg_roll_dumbbell"       
## [105] "stddev_roll_dumbbell"     "var_roll_dumbbell"       
## [107] "avg_pitch_dumbbell"       "stddev_pitch_dumbbell"   
## [109] "var_pitch_dumbbell"       "avg_yaw_dumbbell"        
## [111] "stddev_yaw_dumbbell"      "var_yaw_dumbbell"        
## [113] "gyros_dumbbell_x"         "gyros_dumbbell_y"        
## [115] "gyros_dumbbell_z"         "accel_dumbbell_x"        
## [117] "accel_dumbbell_y"         "accel_dumbbell_z"        
## [119] "magnet_dumbbell_x"        "magnet_dumbbell_y"       
## [121] "magnet_dumbbell_z"        "roll_forearm"            
## [123] "pitch_forearm"            "yaw_forearm"             
## [125] "kurtosis_roll_forearm"    "kurtosis_picth_forearm"  
## [127] "kurtosis_yaw_forearm"     "skewness_roll_forearm"   
## [129] "skewness_pitch_forearm"   "skewness_yaw_forearm"    
## [131] "max_roll_forearm"         "max_picth_forearm"       
## [133] "max_yaw_forearm"          "min_roll_forearm"        
## [135] "min_pitch_forearm"        "min_yaw_forearm"         
## [137] "amplitude_roll_forearm"   "amplitude_pitch_forearm" 
## [139] "amplitude_yaw_forearm"    "total_accel_forearm"     
## [141] "var_accel_forearm"        "avg_roll_forearm"        
## [143] "stddev_roll_forearm"      "var_roll_forearm"        
## [145] "avg_pitch_forearm"        "stddev_pitch_forearm"    
## [147] "var_pitch_forearm"        "avg_yaw_forearm"         
## [149] "stddev_yaw_forearm"       "var_yaw_forearm"         
## [151] "gyros_forearm_x"          "gyros_forearm_y"         
## [153] "gyros_forearm_z"          "accel_forearm_x"         
## [155] "accel_forearm_y"          "accel_forearm_z"         
## [157] "magnet_forearm_x"         "magnet_forearm_y"        
## [159] "magnet_forearm_z"         "classe"

A2. Variable Importance

A B C D E MeanDecreaseAccuracy MeanDecreaseGini
gyros_arm_z 0.0050225 0.0045273 0.0039092 0.0045855 0.0030702 0.0042983 36.99014
gyros_forearm_x 0.0031082 0.0057950 0.0088756 0.0093058 0.0026574 0.0055650 45.40488
gyros_forearm_z 0.0047288 0.0083247 0.0053113 0.0038730 0.0024903 0.0049754 50.19649
gyros_dumbbell_z 0.0031446 0.0071945 0.0056385 0.0058126 0.0036431 0.0048957 52.41906
gyros_belt_x 0.0112111 0.0091934 0.0196929 0.0121652 0.0034192 0.0110402 54.91703
total_accel_arm 0.0077674 0.0084395 0.0109646 0.0117264 0.0057236 0.0087253 64.17706
gyros_belt_y 0.0087265 0.0160730 0.0223532 0.0144325 0.0061878 0.0130035 65.64987
total_accel_forearm 0.0157780 0.0078179 0.0104160 0.0086942 0.0050312 0.0101584 70.27534
gyros_forearm_y 0.0072035 0.0126087 0.0076831 0.0097917 0.0044780 0.0082529 73.11439
accel_belt_x 0.0183746 0.0234457 0.0301236 0.0200599 0.0075763 0.0196966 77.25316
accel_arm_z 0.0171785 0.0106935 0.0172166 0.0102051 0.0061742 0.0127634 78.91468
gyros_dumbbell_x 0.0048955 0.0137483 0.0187227 0.0122831 0.0078844 0.0107871 78.95510
gyros_arm_x 0.0197860 0.0122563 0.0078804 0.0104655 0.0053238 0.0120687 81.20208
accel_forearm_y 0.0183313 0.0136022 0.0223476 0.0155492 0.0077646 0.0157137 83.51441
gyros_arm_y 0.0130845 0.0156529 0.0070519 0.0119388 0.0055157 0.0109534 84.64759
accel_belt_y 0.0213945 0.0215537 0.0251514 0.0324871 0.0109188 0.0219798 84.81910
accel_arm_y 0.0203185 0.0191830 0.0176791 0.0208022 0.0096976 0.0177665 97.78520
yaw_forearm 0.0162530 0.0158276 0.0268242 0.0382422 0.0116470 0.0207784 100.43565
pitch_arm 0.0163687 0.0203359 0.0216935 0.0209497 0.0115849 0.0179427 105.04392
pitch_dumbbell 0.0217716 0.0292981 0.0388038 0.0318504 0.0195763 0.0274601 108.08058
magnet_arm_z 0.0213520 0.0198561 0.0200372 0.0128993 0.0080734 0.0170020 112.38083
magnet_forearm_y 0.0222520 0.0190485 0.0278879 0.0288940 0.0165891 0.0226576 128.51545
magnet_arm_y 0.0272279 0.0330026 0.0370462 0.0501245 0.0217393 0.0328041 129.36346
magnet_forearm_x 0.0268747 0.0158612 0.0209969 0.0332373 0.0192755 0.0233523 129.82700
accel_arm_x 0.0304255 0.0234829 0.0256853 0.0430316 0.0161714 0.0276915 139.44320
total_accel_belt 0.0173906 0.0233491 0.0257660 0.0250194 0.0221909 0.0221340 140.01929
yaw_arm 0.0232137 0.0191143 0.0260349 0.0269577 0.0102085 0.0211328 140.96476
accel_forearm_z 0.0245532 0.0272793 0.0517790 0.0412692 0.0227288 0.0322398 144.92912
magnet_belt_x 0.0185672 0.0307866 0.0429687 0.0323963 0.0152365 0.0268295 147.90910
accel_dumbbell_x 0.0290362 0.0338737 0.0492995 0.0416988 0.0256401 0.0349824 151.45015
yaw_dumbbell 0.0299674 0.0356011 0.0608455 0.0413435 0.0270713 0.0377845 153.74850
magnet_arm_x 0.0600883 0.0395969 0.0428598 0.0617101 0.0291868 0.0477098 156.31836
total_accel_dumbbell 0.0338707 0.0309443 0.0394825 0.0570376 0.0239026 0.0362438 159.06478
gyros_dumbbell_y 0.0266702 0.0306184 0.0462819 0.0287792 0.0155376 0.0291577 165.02473
magnet_forearm_z 0.0290397 0.0249584 0.0370571 0.0411755 0.0201326 0.0299856 168.75157
gyros_belt_z 0.0282199 0.0391090 0.0418926 0.0352375 0.0303498 0.0342689 184.31019
roll_arm 0.0250423 0.0365796 0.0530015 0.0653112 0.0210765 0.0380024 186.91642
accel_forearm_x 0.0409852 0.0392238 0.0433190 0.0761958 0.0310900 0.0449897 186.99584
accel_dumbbell_z 0.0389369 0.0446580 0.0648370 0.0569297 0.0426827 0.0481886 195.99288
magnet_belt_y 0.0318265 0.0485406 0.0512960 0.0589625 0.0399584 0.0443944 225.24378
accel_belt_z 0.0298142 0.0358739 0.0525637 0.0437662 0.0366055 0.0384730 227.73361
magnet_belt_z 0.0386145 0.0542071 0.0574914 0.0648343 0.0395518 0.0493911 235.03581
roll_dumbbell 0.0394330 0.0738221 0.1036183 0.0767889 0.0438397 0.0642194 254.23782
accel_dumbbell_y 0.0485603 0.0558788 0.1130781 0.0656890 0.0411422 0.0626798 259.88089
magnet_dumbbell_x 0.1056454 0.0808095 0.1406698 0.1130797 0.0510258 0.0980953 290.86491
roll_forearm 0.1337024 0.0861652 0.1752968 0.1164821 0.0628424 0.1158936 354.19407
magnet_dumbbell_y 0.0979712 0.1071347 0.1599375 0.1435520 0.0555445 0.1102078 394.46940
pitch_belt 0.0759085 0.1187436 0.0982781 0.1014234 0.0351515 0.0847816 399.76097
pitch_forearm 0.0953326 0.0575115 0.0779447 0.1289853 0.0501202 0.0821714 472.08592
magnet_dumbbell_z 0.1361086 0.0980454 0.1538077 0.1295085 0.0630891 0.1173236 482.43098
yaw_belt 0.1162264 0.0974325 0.1232083 0.1594407 0.0660065 0.1116528 528.08876
roll_belt 0.0816078 0.1014162 0.1215034 0.1352998 0.1965186 0.1222927 769.33329