Neural Network with R

human brain neural network

Computer scientists have long been inspired by the human brain. The artificial neural network (ANN) is a computational system modeled after the connectivity of human brain. A neural network does not process data in a linear fashion. Instead, information is processed collectively, in parallel throughout a network of nodes (the nodes, in this case, being neurons).

In this simple experiment, it is an attempt to utilize the neural network with R programming.

Step 1: Load the dataset

For this experiment, the Titanic dataset from Kaggle will be used. The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. In this dataset, the training dataset consists of 891 rows while the testing dataset consists of 418 rows of data. The training dataset consists of labelled survived (YES/NO) rows of data. While the test dataset will be used for prediction and to be submitted to Kaggle for evaluation.

train.data <- read.csv("train.csv")
test.data <- read.csv("test.csv")

Overview of the train.data:

str(train.data)
> str(train.data)
'data.frame':	891 obs. of  12 variables:
 $ PassengerId: int  1 2 3 4 5 6 7 8 9 10 ...
 $ Survived   : int  0 1 1 1 0 0 0 0 1 1 ...
 $ Pclass     : int  3 1 3 1 3 3 1 3 3 2 ...
 $ Name       : chr  "Braund, Mr. Owen Harris" "Cumings, Mrs. John Bradley (Florence Briggs Thayer)" "Heikkinen, Miss. Laina" "Futrelle, Mrs. Jacques Heath (Lily May Peel)" ...
 $ Sex        : chr  "male" "female" "female" "female" ...
 $ Age        : num  22 38 26 35 35 NA 54 2 27 14 ...
 $ SibSp      : int  1 1 0 1 0 0 0 3 0 1 ...
 $ Parch      : int  0 0 0 0 0 0 0 1 2 0 ...
 $ Ticket     : chr  "A/5 21171" "PC 17599" "STON/O2. 3101282" "113803" ...
 $ Fare       : num  7.25 71.28 7.92 53.1 8.05 ...
 $ Cabin      : chr  "" "C85" "" "C123" ...
 $ Embarked   : chr  "S" "C" "S" "S" ...

Plot generic density plots to take a look at a few values and to get a better understanding of the data.

plot(density(train.data$Age, na.rm = TRUE))
plot(density(train.data$Fare, na.rm = TRUE))

titanic kaggle train dataset

titanic kaggle train dataset

Step 2: Split the train dataset to training and testing

In order to train the neural network, the train dataset will be divided to 80% for training and 20% for testing. In this case, the library(caret) will be used.

library(caret)
inTrain<- createDataPartition(train.data$Survived,
p=0.8, list=FALSE)
training<-train.data[inTrain,]
testing<-train.data[-inTrain,]

Step 3: Load the nnet package

For this experiment, the nnet package will be used. The nnet package is for feed-forward neural networks with a single hidden layer, and for multinomial log-linear models.

library(nnet)

The nnet package requires that the target variable of the classification (i.e. Survived) to be in two-column matrix — one column for No and the other for Yes — with a 1/0 as appropriate. Convert the Survived column by using the built-in utility class.ind function.

training$Surv = class.ind(training$Survived)
testing$Surv = class.ind(testing$Survived)

Step 4: Train the Model

Fit a neural network for classification purposes:

fitnn = nnet(Surv~Sex+Age+Pclass, training, size=1, softmax=TRUE)
fitnn
summary(fitnn)

Step 5: Evaluate the Model
Evaluate the overall performance of the neural network by looking at a table
of how predictions using the testing data.

table(data.frame(predicted=predict(fitnn, testing)[,2] > 0.5,
actual=testing$Surv[,2]>0.5))

In this evaluation, the probability more than 0.5 will be labelled as “Survived”.

          actual
predicted FALSE TRUE
    FALSE    82   16
    TRUE      9   37

Step 6: Predict the test data

For submission to Kaggle online, use the predict function to predict the survivors for the test data.

*Note: The testing data has been slightly modified by adding column “Survived” to the second column.

predicted=predict(fitnn, test.data) [,2]
predicted[is.na(predicted)]<-0
predicted[predicted >0.5]<-1
predicted[predicted <0.5]<-0
test.data$Survived<-predicted
test.data$Survived
write.csv(test.data[,1:2], "nnet-result1.csv", row.names = FALSE)

Similar to the training, the probability of more than 0.5 will be marked as “Survived (1)” while NA (NA for some passengers’ age) and less than 0.5 will be marked as “Not Survived (0)”

Step 7: Submit to Kaggle

kaggle titanic neural network r

Submission to Kaggle Online yielded accuracy of 76.077%.