Data Analysis of Suicides in India from the Year 2000 to 2012. Why, How, When, Where full analysis report.
More than seven million people commit suicide from 2001 to 2012 in India… and every year it increases dramatically.
In this project, we will analyze why people commit suicide and what are the reasons, their professional profile, and their marital or social status.
The dataset was get from the Kaggle website and this dataset contains 237,519 rows and 7 columns.
First, let me explain what are the columns that the dataset has because it was a bit confusing when I try to understand. I take a couple of samples from the dataset to understand.
- State: Contains the name of each State and Union Territory in India
- Year: Contains every year from 2001 to 2012
- Type Code: It contains the value that explains the what the Type is like if Type explains about the causes for suicide, Type code contain Causes heading
- Type: This column contain, why the people die, what are their educational qualification, about their social status, and professional profile
- Gender: This column contains values about their gender Male or Female
- Age Group: This column contain values about their age
- Total: This column contains, the total value of how many people suicide in a particular type_code.
For understanding the data, we are going to use some of the libraries or frameworks, or modules in Python.
- opendatasets: opendatasets is the tool to download the Kaggle dataset from the website with one click
- NumPy: NumPy is used for numerical computing in the dataset
- Pandas: Pandas is the most popular tool for analyzing the Tabular Data and Pandas was built on NumPy
- Matplotlib: Matplotlib the basic tool to visualize the database, This can be useful for basic plotting
- Seaborn: Seaborn is the advanced tool for visualizing the data with easy steps, Seaborn is built on Matplotlib
There are more ways to download a dataset but in this post, I will use opendatasets
You have to provide the dataset URL to download.
It will ask for Your Kaagle username and Your Kaagle Key or API key to download. You can get your Kaagle username and Kaagle Key in your Kaagle account once you create your Kaagle account
It will automatically download and save it to your computer or if you working on Google collab or Binder or repl it will save the dataset in your cloud space.
Data Preparation & Cleaning:
This is the most painful step and most important step in the analysis process. In this step, we will look at the overview of the dataset and then checking if there any incorrect values and Nan values in the datasets and customizing the dataset for our next steps.
Let’s start by importing the
We can see the dataset by using the
As I said before, this Dataset contains 7 columns and 237,519 rows.
If you have any doubt about your dataset you can check that with the
Same rows and same columns
And we can get the info of the Data frame like a list of columns, how many Non-null values, what are the type of Data in the columns (Dtype) and Usage of the memory…
Now let’s check whether the State column has any duplicate values because there is the possibility that column values can incorrect.
We can using
Total(Uts). This represents the total value of all over India, and for every state and every Union Territory.
This will be useful but when we do a sum in the
Total the column gives higher values.
So, removing this will be great for our analysis, but what if we need this for later, so let’s create a separate data frame for these three values and remove that in our original data frame.
For creating the new data frame we use a boolean expression to filter,
Now our next step is to remove the rows that contain
Total(All India), Total(States), Total(Uts). For deleting we use
Inside the drop method, I use a data frame index method to specify the rows that I want to delete if
inplace=Trueremoves the rows and return Nothing.
If we want a little more understanding about the dataset we can use the
Working with Incorrect Data
This also one of the important sections in Data Preparation and Data Cleaning, because if we use the incorrect data our result will be wrong, modifying incorrect data is important.
In this dataset, I found two error in the Type column, let me show you,
value_counts() returns with the total number of unique_values in the Type column. By default, it will show the value in descending order, which means the first element is the most frequently occurring element.
In the above output, you can see the
Bankruptcy or Sudden change in Economic has
3850 values and in the same name
Bankruptcy or Sudden change in Economic Status has
350 values. The only difference is
Status.Same like that,
Not having Children(Barrenness/Impotency has
3850 values and `
Not having Children (Barrenness/Impotency has
350 values. The only difference is a single whitespace between Children and Barrenness.
We can change that
dataframe.replace(to_replace = exisiting value, value= new value, inplace= True, False)
You can check the value is replaced or not by calling the
value_counts() method like before. And you can check every column is there any mistake or not. I check for every column to make sure and you can get the sample values for confidence in this.
Exploratory Analysis and Visualization.
This is an important part of Data Analysis, we have to visualize the data to the client. It’s pretty important than compared to what we are doing before. Because we are going to deep dive into the data using visualization
Let’s start by importing the modules.
I find this answer in StackOverflow for
%matplotlib inline :
With this backend, the output of plotting commands is displayed inline within frontends like the Jupyter notebook, directly below the code cell that produced it. The resulting plots will then also be stored in the notebook document
But in Nutshell, if you use
%matplotlib inline your graph doesn’t pop in your Jupyter Notebook, if you do not use this command may be a pop-up will be shown in your Notebook.
sns.set_style('darkgrid'): It set the theme or grid in a slightly dark background.
matplotlib.rcParams['font.size'] = 14 : It set the font size to 14
matplotlib.rcParams['figure.figsize'] = (9, 5) : It set the figure size to 9 by 5 ratio, but I usually use 12, 8. You can change your figure size whenever you want.
matplotlib.rcParams['figure.facecolor'] = '#00000000' :
Number of Suicides in Every State and Every Union Territory
Here, we have suicide for every state and every union territory, so let’s calculate the total number of suicide committed in every state and union territory.
In the above cell, we use the groupby method. This method is used to group a large number of data and perform a computing operation on these groups. For more resources check out pandas documentation. It returns a groupby object, that has the information about the group.
Then we pass the groupby object variable name in
pd.DataFrame(grouby_object_variable_name) it creates a dataframe with a groupby object.
reset_index() gives the index to our dataframe then we sort the values with
Total column and set the ascending to False, it shows the highest total elements first.
By seeing the output in the table form we understand
Maharashtra, West Bengal, Tamil Nadu, Andhra Pradesh, Karnataka, Kerala but it will be easy if we use the chart.
figsize is the width and height of the chart figure.
plt.title(Title for your chart) is gives the title in the top of the chart
plt.xticks(rotation= Your angle) is gives the rotation in the x-axis because if it’s in 0 every state will be overlapping each other, it will look ugly and we don’t see the state's name clearly, so rotating to 70 degrees will be good.
sns.barplot(x='Column1', y='Column2', data=dataframe); In this line, we use the barplot from the Seaborn library, On x gives the column header for plotting in the x-axis. On y, use another column header for plotting in the y-axis. On data, gives your data frame variable name to specify the dataframe.
Let’s look at which gender commit suicide the most in overall India. And we use mostly the groupby function to get the groupby dtype and we pass in the pandas dataframe to create a new dataframe. reset_index is used to give the numeric index so I hope I don’t want to explain every time.
Above dataset shows that Male people commit suicide more than Female, I think it’s because of the work pressure or some problem in the family and maybe lending situation
For this type of small rows, we can use the pie chart to have a clear look,
Now we have a clear view in a pie chart than the simple table, that’s the power of visualization. The Pie chart shows Male percentage is 30 percent higher than the female percent…
Total suicide with every state and every year
Let’s view the suicide commit total for every year and every state… For this type of use case, we can use the Heatmap to get a clear view. Before using heatmap we have to make our data frame in matrix form.
pivot(index=None, columns=None,values=None): Return reshaped DataFrame organized by given column values. For more check pandas documentation.
We create the data frame in matrix form, but if it doesn’t sort from low to high or high to low we don’t get a clear view in Heatmap so we have to sort the value. But now we don’t have the stable values to sort the data frame, so let’s create a new column ‘sum’ that stores the total value of each State and sort the Dataframe concerning the sum.
Our new column ‘sum’ is created and sorted and deleted we don’t need those anymore. Our dataframe will look like below.
Now our data frame had every year of suicide count and it points to the states clearly, and our data frame is ready for the heatmap.
Creating a heatmap is very easy, first pass the dataframe, if annot is True, values of every block will be displayed in the heatmap. fmt is a string format code, linewidth gives the space between every block, cmap is the theme for the heatmap, there is a lot of themes for you to use. For more details, you can check the Seaborn Docs.
The darkest color shows the highest suicides. We sort those values so we can see clearly which is the highest and which is the lowest for every year.
Because of the dark color in the heatmap, we know that the top 6 states that have high suicide death are Maharastra, West Bengal, Tamil Nadu, Andhra Pradesh, Karnataka, Kerala.
What are the causes
There must be causes for every person that ends their life. If we know that this will be super useful for our analysis. Let’s see How many reasons and What are the reasons that present in the dataset.
So there are 26 reasons, we now ignore
Other Causes because we don’t know what are those reasons are.
Just like before, we filter the dataframe which Type code contains ‘Causes’. Then use the groupby function to the dataframe and pass the groupby object to Pandas Dataframe.
As shown in the barplot, most people suicide because of family problems. And the total number of suicides from 2001 to 2012 in India because of Family Problems is 341,952. With this, we can clearly think, most of the families in India have so many problems. And 341,952 is not the smallest number as far as I know.
Let’s check the Social Status of the people, who commit suicide most in India from 2001 to 2012.
As we see in the barplot, we have a solid understanding that Married people commit suicide than others, now we have a little bit of connection between suicide. And the suicide rate of Married people is huge like two times higher when compared to Never Married people.
Let’s see the professional profile of the people commits suicide in India from 2001 to 2012.
It shows Others in the first place because they don’t know their job or what they are doing or maybe they more and more job profiles so they take the common one to categories these
See, there is a connection between all the plot, like here is the most person that commits suicide is House Wife. They are married people, they may have problems in their family.
And the other highest one is Farming/Agriculture Activity, Now this makes more sense than before.
This section shows how did they dead… what are the ways they used for suicide in India from 2001 to 2012. If we know this, we are closer to prevent suicide.
The most used way to commit suicide is BY Hanging… and By Consuming Insecticides and By Consuming Other Poison… By Fire/self-immolation and By Drowning… We clearly see that this have more number than others. If we stop these suicide case will be decreased by half.
We have to know, are the educated people suicide most or uneducated people suicide the most… If we know this, we also know whether education is used to prevents suicide.
Primary, Middle, No Education, and Matriculation/Secondary people are the people who commit suicide more than Graduated, Diploma, and PG Degree holders.
We understand if you do are a graduate or diploma or PG, you will get the knowledge of taking our life from us is no use and the best way to live is to face the problem.
Let’s see which age group people commit suicide the most compare to another Age Group and compare the relationship to know why.
So, as we see, Most are from the age of 15 to 29…
- It’s because of schools and college, 15 years old boy/girl just finishing the 10th examinations, it’s all starting with the 10th class result. Teacher’s and Parent’s always says “This result is your life, study well, if you fail or get low marks your life won’t be good”. We have to make them understand “single sheet of paper doesn’t decide your life, knowledge is everything. Not the Marksheet”
- Another Reason for commit suicide is love, affair. It’s common to fall in love, but not the end of the world.
So the second largest is 30 to 44…
- This was come because of family problems, marriage problems, and money more. They have to get the motivation and love from the right side.
Asking and Answering Questions
We analyze and visualize the data in multiple charts, plots and now we have some understanding of what is the relationship between every column and rows… So let’s try to ask some common question and answer in the detailed visualization and explanation
Q1. As we see married people commit suicide most than others, but which year this suicide death number’s become the most?
As we see the table and barplot, both shows every year number of suicide is keep increasing from the start to end. I wonder is it still increasing until now.
Q2. How many people commit because of physical abuse? compare both male and female and show for every state from 2001 to 2012.
- Now we can see clearly, that Madhya Pradesh has a very high number of suicides because of physical abuse.
- Madhya Pradesh, Maharashtra, Chattisgarh, Gujarat, West Bengal, Andhra Pradesh have a high number for female physical abuse
- Uttar Pradesh, Tamil Nadu have a nearly equal number of male and female physical abuse
- Punjab, Assam, Karnataka these states have male abuse more than female abuse…
- The Government of the each state understand why this happening and prevent these kind of monstrous behavior to happening
Q3. Which age group people suicide mostly because of physical abuse? and show with gender, Male and Female.
- If you see the above barplot, females from age of 15 to 29 teenage people are the most victims…
- I consider this is the biggest crime, that one can do, this type of activity not only hurt physically, this also affects their mental health.
- So please consider to product women and men from this kind of activities
- I heard government take actions in this situation, but that’s not enough the punishment must be cruel than any other punishment in the world
Q4. How many children suicide because of failure in the examination? in each state and show for every gender.
- As we see the above huge barplot,
Tamil Nadu.and lot more
- Every Teacher’s and Parent’s must stop repeating the useless examination is your life, you have to pass it. Instead of repeating this useless words, they have to repeat ‘Knowledge is everything’ and help the student to discover his/her interest
Q5. We know people suicide mostly because of
Family Problems but which gender are the victims? show for every year.
Now we have the clear view, Male are the victim and they are the people who commit suicide for family problems for every week
- This will happen because of responsibilities
- They may be unable to finance their family so they may be done this to themself…
Inferences and Conclusion
The main goal of the analysis project is to analyze the suicide cases, which means why, who, how, when, where these suicides happen. We have most of the answers.
Let’s see some of the important points that we understand from our Datasets.
- More than seven million people commit suicide from 2001 to 2012.
- And the total number’s suicide victims are dramatically increased in every state and for every year… There is no decrease in the number’s
- We concluded the pie chart that shows male was the overall percentage of suicide that happens all-over India.
- We see the causes for suicide, and we understand Family Problems was the most common cause that takes more than three hundred forty thousand peoples life and we also see most of the victims are married, Male.
- We also see how important is education to avoid suicide because most of the graduates, diplomas, and postgraduate degree holders suicide total is very lower than compared to primary and No Education background. So having an education is not going to waste.
- We also see suicide percentage in every age and compare that with Causes that have Failure in Examination to understand 15 to 29 age group people are suicide most because of failure in schools and college examination and we give feedback to Teachers and Parents ‘Don’t teach your child to follow the marks, teach them to follow the knowledge’.
- And we see how many people are suicide because of physical abuse and we see for every gender, every state, every age group, then find out age group from 15 to 44 female victims are commit suicide most. And we give the feedback to make the punishment is cruel than anything because this won’t affect the person’s body and health it affects their mental health. So punishment must be cruel.
References and Future Work
- For Data Analysis with Python Course — Jovian.ai
- For Numpy — Numpy Documentation
- For Pandas — Pandas Documentation
- Matplotlib — Matplotlib Documentation
- Seaborn — Seaborn Documentation
- DataSet — Kaggle
- Want to know about the difference between 2001 to 2012 and 2013 to 2021 — What is it changed, is there any decrement in suicide total or it still increments.
- Want to explore every field or causes why this happens and how we can prevent it in future
- Want to work in the women safety regarding to physical abuse and prevent that from happening in future
If you want to check my Project Jupyter Notebook, check in this Jovian.ml project notebook.
Analyzing Suicides in India from 2001 to 2012. This is my first project in Data Analysis, I hope this post will help you.
If you have any though and anything you want to say you can give a response to this post. Your responses are always welcome…