Qualitative Assessment of Weight Lifting Exercices

Executive Summary

The advent of smart devices opens the door to multiple possibilities, among which monitoring people’s physical exercice using inexpensive devices such as Jawbone Up, Nike FuelBand and Fitbit. Indeed, ones can qualify the effectiveness of their activity based on the feedbacks provided by these devices. In this study, the quality of weight lifting (unilateral dumbbell biceps curl) exercises is evaluated with the state-of-art random forest classifier. The set up we use to build our random forest provides results that are very close to a perfect classification with very a low generalization error rate.

1. Description of the data

The raw data sets are collected from four Razor’s sensors Inertial Measurement Units (IMU)⁽¹⁾ which provide, each, three-axes acceleration, gyroscope and magnometer data sampled at a rate of 45 Hz. These IMUs are attached to the dumb bell and the subject’s hand, arm and lumbar. Each participant is asked to execute a serie of unilateral dumbbell biceps curl exercises according to five different ways:

A: Exactly according to the specification;
B: Throwing the elbows to the front;
C: Lifting the dumbbell only half way;
D: Lowering the dumbbell only half way;
E: Throwing the hips to the front.

Except the class A, the other four classes correspond to common mistakes people make while executing this exercise.

2. Getting the data

The data as described above comes from the research and development group of groupware technologies. This dataset is splitted into the training and the testing sets. These can be obtained from here:

training data: [https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv]
testing data: [https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv]

To obtain these data sets, the following code snippets can be used:

fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
download.file(fileUrl, destfile="pml-training.csv", method="wget")
fileUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(fileUrl, destfile="pml-testing.csv", method="wget")

3. Data Cleaning

# Load the data from file (Note: This supposes you've already downloaded the datasets)
dfTraining <- read.csv("pml-training.csv", stringsAsFactors=F, header=T)
dfTesting <- read.csv("pml-testing.csv", stringsAsFactors=F, header=T)

myVars <- !(names(dfTraining) %in% "classe")
dfTraining$classe <- as.factor(dfTraining$classe)     # set the outcome as a factor variable
dfTraining[, myVars] <- apply(dfTraining[, myVars], 2, as.numeric)
dfTesting[, myVars] <- apply(dfTesting[, myVars], 2, as.numeric)

An exploratory data analysis reveals that the datasets train and test have, respectively, 19622 and 20 observations with 160 variables for both of them. The variable names are listed in Appendix A1. Among those, some variables are not relevant in describing the physical movement executions. They are:

the “gate keepers” variables (X, user_name, new_window and num_window) and,
the timestamp variables (raw_timestamp_part1, raw_timestamp_part2 and cvtd_timestamp).

Therefore, they can be dropped. Also, the summary variables (ie. those with the suffixes min, max, avg, stddev, var, kurtosis, skewness and amplitude) are derived features that were computed only when a new_window is triggered and, most of the time, their values are not defined (NA). They are thus not much useful for the modelization.

The following code is used to get rid off these unnecessary variables to get the tidy dataset.

toMatch <- c("kurtosis", "skewness", "max", "min", "amplitude", "var", "avg", "stddev", 
             "X", "timestamp", "user_name", "window")
toRemove <- grep(paste(toMatch,collapse="|"), names(dfTraining), value=TRUE)
dfTraining <- dfTraining[, !(names(dfTraining) %in% toRemove)]
dfTesting <- dfTesting[, !(names(dfTesting) %in% toRemove)]

The datasets that result from this cleaning process is now reduced to 53 variables for both the train and test sets.

4. Model Building

4.1. Preprocessing

Before choosing and fitting a model to the training dataset, it could be important to preprocess its predictors. This is an important part in the data preparation since it can reduce the complexity of the algorithm being considered. In our context, the predictors are analyzed in order to detect eventually weak variances and those which fall in such a case are removed.

myVars <- !(names(dfTraining) %in% "classe")
nzv <- nearZeroVar(dfTraining[, myVars], saveMetrics=TRUE)
if ( any(nzv$nzv) ){
      dfTraining <- dfTraining[, nzv$nzv]
}

Other steps could include the standardization of the dataset. However, this step depends on the algorith under consideration. As we will mention later, our modelization involves the random forest algorithm which bases its decision on individual features at each split (node) and thus, monotonic transformations of features will appear invariant in the decision. The standardization process is therefore not necessary in this case.

4.2. Data Slicing

Since the training dataset we have at our disposal is a medium size, we decide to split it into a 60% (training) - 40% (validation) proportion for the prediction study. The following code snippet permits to do that:

# split the training data set for cross-validation
tIndices <- createDataPartition(y=dfTraining$classe, p=0.6, list=FALSE)
dfTrainCv <- dfTraining[tIndices,]
dfTestCv <- dfTraining[-tIndices,]

4.3. Train

In our approach, the random forest algorithm is adopted for the classification due to its numerous properties; in particular, it is robust to high variances (ie. it has a good rate of generalization), can efficiently decorrelate the trees (as compared to the bagging method), can estimate the feature’s importances and provides usually satisfying results. To train the classifier, the bootstrap resampling with 25 sample sets is used and the process is repeated 25 times. This approach is implemented as follow:

set.seed(1248)
ptime <- system.time(modfitrf <- randomForest(classe ~ ., data=dfTrainCv, importance=TRUE, proximity=TRUE)))

modfitrf

## 
## Call:
##  randomForest(formula = classe ~ ., data = dfTrainCv, importance = TRUE,      proximity = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.73%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3347    1    0    0    0 0.0002986858
## B   13 2262    4    0    0 0.0074594120
## C    0   20 2024   10    0 0.0146056475
## D    0    0   26 1902    2 0.0145077720
## E    0    1    2    7 2155 0.0046189376

The fitted model has an estimate of the overall out-of-bag (OOB) error rate of 0.74%, which is quite good. In random forest, the OOB error rate is equivalent to the out of sample error that is estimated internally as the forest is built. Thus, we can anticipate that the out of sample error will be the same magnitude order as the OOB error.

4.4. Cross-validation and Result Analysis

Although cross validation step is not needed for the random forest algorithm (it is estimated internally with the OOB samples during the run)⁽²⁾, we want to include anyway this extra step to verify the generalization ability of the fitted model from an external validation set. The prediction on the validation set is done as follow:

set.seed(13579)
predRf <- predict(modfitrf, dfTestCv)

Confusion Matrix

cm <- confusionMatrix(dfTestCv$classe, predRf); cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2230    1    0    0    1
##          B    6 1512    0    0    0
##          C    0   12 1356    0    0
##          D    0    0    2 1284    0
##          E    0    0    0    0 1442
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9972          
##                  95% CI : (0.9958, 0.9982)
##     No Information Rate : 0.285           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9965          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9973   0.9915   0.9985   1.0000   0.9993
## Specificity            0.9996   0.9991   0.9982   0.9997   1.0000
## Pos Pred Value         0.9991   0.9960   0.9912   0.9984   1.0000
## Neg Pred Value         0.9989   0.9979   0.9997   1.0000   0.9998
## Prevalence             0.2850   0.1944   0.1731   0.1637   0.1839
## Detection Rate         0.2842   0.1927   0.1728   0.1637   0.1838
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9985   0.9953   0.9983   0.9998   0.9997

The table of results shows that the fitted model has an accuracy of 99.72% (out of sample error is then 0.28%). This accuracy based on an unseen dataset (cross validation) corroborates the OOB error rate estimated earlier from the training step. So, the designed model has an unbiased estimate of the test set error which, in turn, insures a very good rate of generalization.

library(pROC)
predictions <- as.numeric(predict(modfitrf, dfTestCv, type='response'))
mcauc <- multiclass.roc(dfTestCv$classe, predictions, percent=TRUE); mcauc

## 
## Call:
## multiclass.roc.default(response = dfTestCv$classe, predictor = predictions,     percent = TRUE)
## 
## Data: predictions with 5 levels of dfTestCv$classe: A, B, C, D, E.
## Multi-class area under the curve: 99.91%

Also, our model’s multi-class AUC value of 0.9991 is very close to 1 which is the AUC of a perfect classifier.

Variable importance

One interesting fact with the random forest algorithm (as CART algorithms in general) is that each predictor’s importance can be deduced, based on an impurity measure. Appendix A2 lists the variable importances ordered according to the Gini impurity measure (meanDecreaseGini). This measure quantifies the randomness of misclassification (impurity) a given variable can influence on the impurity decrease. Simply speaking, the lower its value, the more important role this variable has in the correct classification. This is convenient in a feature selection context.

5. Prediction on the test set

The 20 samples from the test dataset are classified as followed:

predtest <- predict(modfitrf, dfTesting);
predtest

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

References

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. “Qualitative Activity Recognition of Weight Lifting Exercises”, Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
Leo Brieman and Adele Cutler, Random Forest

Appendices

A1. Variable names

##   [1] "X"                        "user_name"               
##   [3] "raw_timestamp_part_1"     "raw_timestamp_part_2"    
##   [5] "cvtd_timestamp"           "new_window"              
##   [7] "num_window"               "roll_belt"               
##   [9] "pitch_belt"               "yaw_belt"                
##  [11] "total_accel_belt"         "kurtosis_roll_belt"      
##  [13] "kurtosis_picth_belt"      "kurtosis_yaw_belt"       
##  [15] "skewness_roll_belt"       "skewness_roll_belt.1"    
##  [17] "skewness_yaw_belt"        "max_roll_belt"           
##  [19] "max_picth_belt"           "max_yaw_belt"            
##  [21] "min_roll_belt"            "min_pitch_belt"          
##  [23] "min_yaw_belt"             "amplitude_roll_belt"     
##  [25] "amplitude_pitch_belt"     "amplitude_yaw_belt"      
##  [27] "var_total_accel_belt"     "avg_roll_belt"           
##  [29] "stddev_roll_belt"         "var_roll_belt"           
##  [31] "avg_pitch_belt"           "stddev_pitch_belt"       
##  [33] "var_pitch_belt"           "avg_yaw_belt"            
##  [35] "stddev_yaw_belt"          "var_yaw_belt"            
##  [37] "gyros_belt_x"             "gyros_belt_y"            
##  [39] "gyros_belt_z"             "accel_belt_x"            
##  [41] "accel_belt_y"             "accel_belt_z"            
##  [43] "magnet_belt_x"            "magnet_belt_y"           
##  [45] "magnet_belt_z"            "roll_arm"                
##  [47] "pitch_arm"                "yaw_arm"                 
##  [49] "total_accel_arm"          "var_accel_arm"           
##  [51] "avg_roll_arm"             "stddev_roll_arm"         
##  [53] "var_roll_arm"             "avg_pitch_arm"           
##  [55] "stddev_pitch_arm"         "var_pitch_arm"           
##  [57] "avg_yaw_arm"              "stddev_yaw_arm"          
##  [59] "var_yaw_arm"              "gyros_arm_x"             
##  [61] "gyros_arm_y"              "gyros_arm_z"             
##  [63] "accel_arm_x"              "accel_arm_y"             
##  [65] "accel_arm_z"              "magnet_arm_x"            
##  [67] "magnet_arm_y"             "magnet_arm_z"            
##  [69] "kurtosis_roll_arm"        "kurtosis_picth_arm"      
##  [71] "kurtosis_yaw_arm"         "skewness_roll_arm"       
##  [73] "skewness_pitch_arm"       "skewness_yaw_arm"        
##  [75] "max_roll_arm"             "max_picth_arm"           
##  [77] "max_yaw_arm"              "min_roll_arm"            
##  [79] "min_pitch_arm"            "min_yaw_arm"             
##  [81] "amplitude_roll_arm"       "amplitude_pitch_arm"     
##  [83] "amplitude_yaw_arm"        "roll_dumbbell"           
##  [85] "pitch_dumbbell"           "yaw_dumbbell"            
##  [87] "kurtosis_roll_dumbbell"   "kurtosis_picth_dumbbell" 
##  [89] "kurtosis_yaw_dumbbell"    "skewness_roll_dumbbell"  
##  [91] "skewness_pitch_dumbbell"  "skewness_yaw_dumbbell"   
##  [93] "max_roll_dumbbell"        "max_picth_dumbbell"      
##  [95] "max_yaw_dumbbell"         "min_roll_dumbbell"       
##  [97] "min_pitch_dumbbell"       "min_yaw_dumbbell"        
##  [99] "amplitude_roll_dumbbell"  "amplitude_pitch_dumbbell"
## [101] "amplitude_yaw_dumbbell"   "total_accel_dumbbell"    
## [103] "var_accel_dumbbell"       "avg_roll_dumbbell"       
## [105] "stddev_roll_dumbbell"     "var_roll_dumbbell"       
## [107] "avg_pitch_dumbbell"       "stddev_pitch_dumbbell"   
## [109] "var_pitch_dumbbell"       "avg_yaw_dumbbell"        
## [111] "stddev_yaw_dumbbell"      "var_yaw_dumbbell"        
## [113] "gyros_dumbbell_x"         "gyros_dumbbell_y"        
## [115] "gyros_dumbbell_z"         "accel_dumbbell_x"        
## [117] "accel_dumbbell_y"         "accel_dumbbell_z"        
## [119] "magnet_dumbbell_x"        "magnet_dumbbell_y"       
## [121] "magnet_dumbbell_z"        "roll_forearm"            
## [123] "pitch_forearm"            "yaw_forearm"             
## [125] "kurtosis_roll_forearm"    "kurtosis_picth_forearm"  
## [127] "kurtosis_yaw_forearm"     "skewness_roll_forearm"   
## [129] "skewness_pitch_forearm"   "skewness_yaw_forearm"    
## [131] "max_roll_forearm"         "max_picth_forearm"       
## [133] "max_yaw_forearm"          "min_roll_forearm"        
## [135] "min_pitch_forearm"        "min_yaw_forearm"         
## [137] "amplitude_roll_forearm"   "amplitude_pitch_forearm" 
## [139] "amplitude_yaw_forearm"    "total_accel_forearm"     
## [141] "var_accel_forearm"        "avg_roll_forearm"        
## [143] "stddev_roll_forearm"      "var_roll_forearm"        
## [145] "avg_pitch_forearm"        "stddev_pitch_forearm"    
## [147] "var_pitch_forearm"        "avg_yaw_forearm"         
## [149] "stddev_yaw_forearm"       "var_yaw_forearm"         
## [151] "gyros_forearm_x"          "gyros_forearm_y"         
## [153] "gyros_forearm_z"          "accel_forearm_x"         
## [155] "accel_forearm_y"          "accel_forearm_z"         
## [157] "magnet_forearm_x"         "magnet_forearm_y"        
## [159] "magnet_forearm_z"         "classe"

A2. Variable Importance

	A	B	C	D	E	MeanDecreaseAccuracy	MeanDecreaseGini
gyros_arm_z	0.0050225	0.0045273	0.0039092	0.0045855	0.0030702	0.0042983	36.99014
gyros_forearm_x	0.0031082	0.0057950	0.0088756	0.0093058	0.0026574	0.0055650	45.40488
gyros_forearm_z	0.0047288	0.0083247	0.0053113	0.0038730	0.0024903	0.0049754	50.19649
gyros_dumbbell_z	0.0031446	0.0071945	0.0056385	0.0058126	0.0036431	0.0048957	52.41906
gyros_belt_x	0.0112111	0.0091934	0.0196929	0.0121652	0.0034192	0.0110402	54.91703
total_accel_arm	0.0077674	0.0084395	0.0109646	0.0117264	0.0057236	0.0087253	64.17706
gyros_belt_y	0.0087265	0.0160730	0.0223532	0.0144325	0.0061878	0.0130035	65.64987
total_accel_forearm	0.0157780	0.0078179	0.0104160	0.0086942	0.0050312	0.0101584	70.27534
gyros_forearm_y	0.0072035	0.0126087	0.0076831	0.0097917	0.0044780	0.0082529	73.11439
accel_belt_x	0.0183746	0.0234457	0.0301236	0.0200599	0.0075763	0.0196966	77.25316
accel_arm_z	0.0171785	0.0106935	0.0172166	0.0102051	0.0061742	0.0127634	78.91468
gyros_dumbbell_x	0.0048955	0.0137483	0.0187227	0.0122831	0.0078844	0.0107871	78.95510
gyros_arm_x	0.0197860	0.0122563	0.0078804	0.0104655	0.0053238	0.0120687	81.20208
accel_forearm_y	0.0183313	0.0136022	0.0223476	0.0155492	0.0077646	0.0157137	83.51441
gyros_arm_y	0.0130845	0.0156529	0.0070519	0.0119388	0.0055157	0.0109534	84.64759
accel_belt_y	0.0213945	0.0215537	0.0251514	0.0324871	0.0109188	0.0219798	84.81910
accel_arm_y	0.0203185	0.0191830	0.0176791	0.0208022	0.0096976	0.0177665	97.78520
yaw_forearm	0.0162530	0.0158276	0.0268242	0.0382422	0.0116470	0.0207784	100.43565
pitch_arm	0.0163687	0.0203359	0.0216935	0.0209497	0.0115849	0.0179427	105.04392
pitch_dumbbell	0.0217716	0.0292981	0.0388038	0.0318504	0.0195763	0.0274601	108.08058
magnet_arm_z	0.0213520	0.0198561	0.0200372	0.0128993	0.0080734	0.0170020	112.38083
magnet_forearm_y	0.0222520	0.0190485	0.0278879	0.0288940	0.0165891	0.0226576	128.51545
magnet_arm_y	0.0272279	0.0330026	0.0370462	0.0501245	0.0217393	0.0328041	129.36346
magnet_forearm_x	0.0268747	0.0158612	0.0209969	0.0332373	0.0192755	0.0233523	129.82700
accel_arm_x	0.0304255	0.0234829	0.0256853	0.0430316	0.0161714	0.0276915	139.44320
total_accel_belt	0.0173906	0.0233491	0.0257660	0.0250194	0.0221909	0.0221340	140.01929
yaw_arm	0.0232137	0.0191143	0.0260349	0.0269577	0.0102085	0.0211328	140.96476
accel_forearm_z	0.0245532	0.0272793	0.0517790	0.0412692	0.0227288	0.0322398	144.92912
magnet_belt_x	0.0185672	0.0307866	0.0429687	0.0323963	0.0152365	0.0268295	147.90910
accel_dumbbell_x	0.0290362	0.0338737	0.0492995	0.0416988	0.0256401	0.0349824	151.45015
yaw_dumbbell	0.0299674	0.0356011	0.0608455	0.0413435	0.0270713	0.0377845	153.74850
magnet_arm_x	0.0600883	0.0395969	0.0428598	0.0617101	0.0291868	0.0477098	156.31836
total_accel_dumbbell	0.0338707	0.0309443	0.0394825	0.0570376	0.0239026	0.0362438	159.06478
gyros_dumbbell_y	0.0266702	0.0306184	0.0462819	0.0287792	0.0155376	0.0291577	165.02473
magnet_forearm_z	0.0290397	0.0249584	0.0370571	0.0411755	0.0201326	0.0299856	168.75157
gyros_belt_z	0.0282199	0.0391090	0.0418926	0.0352375	0.0303498	0.0342689	184.31019
roll_arm	0.0250423	0.0365796	0.0530015	0.0653112	0.0210765	0.0380024	186.91642
accel_forearm_x	0.0409852	0.0392238	0.0433190	0.0761958	0.0310900	0.0449897	186.99584
accel_dumbbell_z	0.0389369	0.0446580	0.0648370	0.0569297	0.0426827	0.0481886	195.99288
magnet_belt_y	0.0318265	0.0485406	0.0512960	0.0589625	0.0399584	0.0443944	225.24378
accel_belt_z	0.0298142	0.0358739	0.0525637	0.0437662	0.0366055	0.0384730	227.73361
magnet_belt_z	0.0386145	0.0542071	0.0574914	0.0648343	0.0395518	0.0493911	235.03581
roll_dumbbell	0.0394330	0.0738221	0.1036183	0.0767889	0.0438397	0.0642194	254.23782
accel_dumbbell_y	0.0485603	0.0558788	0.1130781	0.0656890	0.0411422	0.0626798	259.88089
magnet_dumbbell_x	0.1056454	0.0808095	0.1406698	0.1130797	0.0510258	0.0980953	290.86491
roll_forearm	0.1337024	0.0861652	0.1752968	0.1164821	0.0628424	0.1158936	354.19407
magnet_dumbbell_y	0.0979712	0.1071347	0.1599375	0.1435520	0.0555445	0.1102078	394.46940
pitch_belt	0.0759085	0.1187436	0.0982781	0.1014234	0.0351515	0.0847816	399.76097
pitch_forearm	0.0953326	0.0575115	0.0779447	0.1289853	0.0501202	0.0821714	472.08592
magnet_dumbbell_z	0.1361086	0.0980454	0.1538077	0.1295085	0.0630891	0.1173236	482.43098
yaw_belt	0.1162264	0.0974325	0.1232083	0.1594407	0.0660065	0.1116528	528.08876
roll_belt	0.0816078	0.1014162	0.1215034	0.1352998	0.1965186	0.1222927	769.33329