Titanic: Pandas Practice Project

We will be using the “Titanic: Machine Learning from Disaster” data to practice python analysis.

Data Description

Data Source:
The data we are using can be downloaded from kaggle at https://www.kaggle.com/c/titanic/data.

We downloaded the train.csv file format.

Variable Description:
Variable description and special notes where copy from the file link of this database.

Variable Names Variable Details
survival Survival (0 = No; 1 = Yes)
pclass Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
name Name
sex Sex
age Age
sibsp Number of Siblings/Spouses Aboard
parch Number of Parents/Children Aboard
ticket Ticket Number
fare Passenger Fare
cabin Cabin
embarked Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton

Special Notes:

  • Pclass is a proxy for socio-economic status (SES) 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower
  • Age is in Years; Fractional if Age less than One (1) If the Age is Estimated, it is in the form xx.5
  • With respect to the family relation variables (i.e. sibsp and parch) some relations were ignored. The following are the definitions used for sibsp and parch.
  • Sibling: Brother, Sister, Stepbrother, or Stepsister of Passenger Aboard Titanic
  • Spouse: Husband or Wife of Passenger Aboard Titanic (Mistresses and Fiances Ignored)
  • Parent: Mother or Father of Passenger Aboard Titanic
  • Child: Son, Daughter, Stepson, or Stepdaughter of Passenger Aboard Titanic
  • Other family relatives excluded from this study include cousins, nephews/nieces, aunts/uncles, and in-laws. Some children traveled only with a nanny, therefore parch=0 for them. As well, some traveled with very close friends or neighbors in a village, however, the definitions do not support such relations.

Import Libraries

In [1]:
# imports libraries
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Data Extraction

In [2]:
# tdf will represent our DataFrame
# Reads file train.csv and assigns it to tdf.
# The .dropna method will drop rows where "all" column have NaN values
tdf = pd.read_csv('train.csv').dropna( how = 'all') 
In [3]:
tdf.head(3) # diplay top three rows
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th… female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S

Exploratory Data Analysis

First, let’s have a general view of our data. We are looking for missing values and correct data types for columns. The .info() method is the perfect tool for this purpose.

In [4]:
# The .info() method provide us with with all the column names and their respective count
# of non null values, and  data type, which consist of floats and objects. At the end also 
# shows total memory consumption.
tdf.info() 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 90.5+ KB

Things We Need to Do:

  1. By examining our data columns, the Survived, Pclass, Sex and Embarked columns, should be changed to a more suitable data type such as category.
  2. The Age column contains missing data (714 non-null values out of 891). We will replace all NaN values with the mode( most frequent age) of our current values.

Changing Columns to Correct Data Types

In [5]:
# The .astype() method allows us to change the data type of columns in a Data Frame.
tdf['Survived'] = tdf['Survived'].astype('category')
tdf['Pclass']   = tdf['Pclass'].astype('category')
tdf['Sex']      = tdf['Sex'].astype('category')
tdf['Embarked'] = tdf['Embarked'].astype('category')
In [6]:
# We use .info() to preview our changes. Survived, Pclass, Sex and Embarked columns have 
# been change to category data type.
tdf.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null category
Pclass         891 non-null category
Name           891 non-null object
Sex            891 non-null category
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null category
dtypes: category(4), float64(2), int64(3), object(3)
memory usage: 66.2+ KB

Dealing with Missing Data ( AKA NaN Values )

We have missing data in our age column. As a result, we will replace NaN values with the mode of our current values as follows.

In [7]:
# The .isnull() method shows a Serie of boolean values for the Age column rows where
# True indicates NaN values and False not NaN.
tdf['Age'].isnull().head(30)
Out[7]:
0     False
1     False
2     False
3     False
4     False
5      True
6     False
7     False
8     False
9     False
10    False
11    False
12    False
13    False
14    False
15    False
16    False
17     True
18    False
19     True
20    False
21    False
22    False
23    False
24    False
25    False
26     True
27    False
28     True
29     True
Name: Age, dtype: bool
In [8]:
# age_null_bool is assigned the serie of null boolean values for column Age.
age_isnull_bool = tdf['Age'].isnull() 
In [9]:
# The following shows the exact rows with exact NaN values
tdf.ix[age_isnull_bool, 'Age']
Out[9]:
5     NaN
17    NaN
19    NaN
26    NaN
28    NaN
29    NaN
31    NaN
32    NaN
36    NaN
42    NaN
45    NaN
46    NaN
47    NaN
48    NaN
55    NaN
64    NaN
65    NaN
76    NaN
77    NaN
82    NaN
87    NaN
95    NaN
101   NaN
107   NaN
109   NaN
121   NaN
126   NaN
128   NaN
140   NaN
154   NaN
       ..
718   NaN
727   NaN
732   NaN
738   NaN
739   NaN
740   NaN
760   NaN
766   NaN
768   NaN
773   NaN
776   NaN
778   NaN
783   NaN
790   NaN
792   NaN
793   NaN
815   NaN
825   NaN
826   NaN
828   NaN
832   NaN
837   NaN
839   NaN
846   NaN
849   NaN
859   NaN
863   NaN
868   NaN
878   NaN
888   NaN
Name: Age, dtype: float64

Calculating the Mode (Most Frequent Age)

In [10]:
mode_age = tdf['Age'].mode()
mode_age
Out[10]:
0    24.0
dtype: float64

Filling NaN Values with the Mode

In [11]:
# We filled or replaced all NaN values with the mean_age
tdf['Age'].fillna(mode_age.ix[0], inplace= True)
In [12]:
# We use .info() method and now the Age column contains no missing values.
tdf.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null category
Pclass         891 non-null category
Name           891 non-null object
Sex            891 non-null category
Age            891 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null category
dtypes: category(4), float64(2), int64(3), object(3)
memory usage: 66.2+ KB

Visual Analysis

In [13]:
# setting colors for visuallization palette

colors = [ '#ffff68', '#e74c3c', '#3498db','#FF69B4', '#34495e', '#2ecc71']
sns.set_palette(colors)

Who Were the Passengers of the Titanic?

Visualization of Passengers by Sex Type?

In [14]:
# chart of passengers by sex type
sns.countplot(x="Sex", data=tdf)
Out[14]:
<matplotlib.axes._subplots.AxesSubplot at 0x14122a8f860>
In [15]:
# Actual count of sex types( male and female)
tdf['Sex'].value_counts()
Out[15]:
male      577
female    314
Name: Sex, dtype: int64
  • Actual numbers: male: 577, female: 314.
  • Conclusion: The majority of passengers were males.
In [16]:
# Displays a population comparison between social classes of males and females independently (separated by gender).
sns.countplot(x='Sex', hue='Pclass', data=tdf);
In [17]:
tdf.groupby(['Sex', 'Pclass']).size()
Out[17]:
Sex     Pclass
female  1          94
        2          76
        3         144
male    1         122
        2         108
        3         347
dtype: int64
  • Conclusion: The majority of passengers in the titanic were from social class 3 (lower class) in the female and male counterpart, followed by 1st and 2nd class respectively.
  • Actual Numbers: Displayed above by class with the .groupby() method.
In [18]:
# Displays a population comparison between male and female base on class in barchar format.
sns.countplot(x='Pclass', hue='Sex', data=tdf);
In [19]:
tdf['Pclass'].value_counts().sort_index()
Out[19]:
1    216
2    184
3    491
Name: Pclass, dtype: int64

Conclusion: In all the classes( 1st, 2nd, and 3rd) male population dominated.

Creating a function to classify minors from adults

We will create a new column called maturity, which classifies minors or adults. The function bellow separates children and adults by sex type.

In [20]:
def minor_adult_function(passenger):
    age, sex = passenger
    if age <= 16:
        return sex + ' child'
    else:
        return sex + ' adult'
In [21]:
tdf['Maturity'] = tdf[ ['Age', 'Sex']].apply(minor_adult_function, axis = 1)
In [22]:
# Displays a population comparison between male and female base on class in barchar format.
sns.countplot(x='Pclass', hue='Maturity', data=tdf);
In [23]:
tdf['Maturity'].value_counts()
Out[23]:
male adult      526
female adult    265
male child       51
female child     49
Name: Maturity, dtype: int64

Conclusion: There was a total of 100 children in the titanic, assuming a child is someone of age 16 or bellow.

In [24]:
tdf.hist( column='Age', bins=60,  figsize=(8,4))
Out[24]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x000001412348D898>]], dtype=object)
In [25]:
tdf['Age'].value_counts().head()
Out[25]:
24.0    207
22.0     27
18.0     26
19.0     25
28.0     25
Name: Age, dtype: int64

Conclusion: Our visualization shows that the majority of the population in the titanic was age 24. However, this data is not reliable. We filled 177 null values with number 24.