GetCleanDataProject

The source code is contained within the run_analysis.R. It processes data collected from the accelerometers from the Samsung Galaxy S smartphone.

A full description of data can be found via the link below: http://archive.ics.uci.edu/ml/datasets/Human+Activity+Recognition+Using+Smartphones

The data can be downloaded from this link: https://d396qusza40orc.cloudfront.net/getdata%2Fprojectfiles%2FUCI%20HAR%20Dataset.zip

Note that the source code (GettingCleaningDataProject.R) has to reside in the same working directory as the data folder (UCI HAR Dataset).

###Summary of files in the repo

There are 4 files in this repo

run_analysis.R - R source code to generate the tidy data from the Samsung dataset
README.md - Provides general info and describes how the code works
Codebook.md - Provides info about the tidy dataset generated
GetCleanDataProject_Step5.txt - The exported tidy dataset in text format

###Summary of Source Code

The code performs data cleaning and reshaping in 5 steps:

Forms complete dataset for TEST and TRAIN results comprising feature names as field names, subject id as a new column and activity id and labels as new columns.
Combines the TEST and TRAIN data using cbind().
Extracts Mean and Standard Deviation of each measurement and clean variable names by replacing the dots with a single space. Variables with the term "mean" and "std" in the labels are extracted. "." symbol and extra spaces are removed from the label to make the labels more readable.
Creates a second, independent tidy data set with the average of each variable for each activity and each subject. This is done by splitting (split()) the dataset using the activity and subject variables follow by applying colmean() to all the mean and std features. Transformation is done to the result to form the final tidy dataset. The word "average" is added to the labels of the features to reflect the new variables.

Source Code

#Form complete dataset for TEST results comprising feature names as field names, subject id as a new column and activity id and labels as new columns
datafeature <- read.table("./UCI HAR Dataset/features.txt", col.names = c("SN", "Feature"))
datasubjecttest <- read.table("./UCI HAR Dataset/test/subject_test.txt", 
                              col.names = "Subject")
datatestx <- read.table("./UCI HAR Dataset/test/X_test.txt", 
                              col.names = datafeature$Feature) #Read in main data and append feature names as field names
datatesty <- read.table("./UCI HAR Dataset/test/y_test.txt", 
                        col.names = "ActID")
tbtestxy <- cbind(datasubjecttest, datatestx, datatesty) #Form dataset without activity labels
dataactivitylabels <- read.table("./UCI HAR Dataset/activity_labels.txt", 
                                 col.names = c("ActID", "Activity"))
tbtest <- merge(tbtestxy, dataactivitylabels, by.x = "ActID", 
                by.y = "ActID", sort = FALSE) #Match to activity labels


#Form complete dataset for TRAIN results comprising feature names as field names, 
#subject id as a new column and activity id and labels as new columns
datasubjecttrain <- read.table("./UCI HAR Dataset/train/subject_train.txt", 
                              col.names = "Subject")
datatrainx <- read.table("./UCI HAR Dataset/train/X_train.txt", 
                        col.names = datafeature$Feature) #Read in main data and append feature names as field names
datatrainy <- read.table("./UCI HAR Dataset/train/y_train.txt", 
                        col.names = "ActID")
tbtrainxy <- cbind(datasubjecttrain, datatrainx, datatrainy) #Form dataset without activity labels
tbtrain <- merge(tbtrainxy, dataactivitylabels, by.x = "ActID", 
                by.y = "ActID", sort = FALSE) #Match to activity labels

##Combine the TEST and TRAIN data
tb <- rbind(tbtest, tbtrain)

##Extract Mean and Standard Deviation of each measurement
indexmeanstd <- grep("mean()|std()", colnames(tb))
indexsubject <- grep("Subject", colnames(tb), fixed = TRUE)
indexactivity <- grep("Activity", colnames(tb), fixed = TRUE)
tbmeanstd <- tb[, c(indexsubject, indexactivity, indexmeanstd)]

##Clean variable names by replacing the dots with a single space
temp <- gsub(".", " ", colnames(tbmeanstd), fixed = TRUE)
colnames(tbmeanstd) <-  gsub("^ *|(?<= ) | *$", "", temp, perl=T)
colnames(tbmeanstd) <-  gsub("std", "StandardDeviation", colnames(tbmeanstd), fixed = TRUE)

##Creates a second, independent tidy data set with the average of each
##variable for each activity and each subject

listsplit <- split(tbmeanstd[,c(-1,-2)], list(tbmeanstd$Subject, tbmeanstd$Activity)) ##split by subject and activity
tbsplitmean <- t(sapply(listsplit, colMeans)) ## Results in dataframe with subject and activity as rows and features as columns
colnames(tbsplitmean) <-  paste("Average", colnames(tbsplitmean))
listtemp <- strsplit(rownames(tbsplitmean), split = "\\.") ##Append subject and activity as variables
tbtemp <- do.call(rbind.data.frame, listtemp)
colnames(tbtemp) <-  c("Subject", "Activity") 
tbsubjectactivity <- cbind(tbtemp, tbsplitmean)

write.table(tbsubjectactivity, "GetCleanDataProject_Step5.txt", row.names = FALSE)

=======

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
CodeBook.md		CodeBook.md
GetCleanDataProject_Step5.txt		GetCleanDataProject_Step5.txt
README.md		README.md
run_analysis.R		run_analysis.R

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GetCleanDataProject

Source Code

About

Uh oh!

Releases

Packages

Languages

weipinglim/Project-GetCleanData

Folders and files

Latest commit

History

Repository files navigation

GetCleanDataProject

Source Code

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages