Murder Accountability Project - Part 1 Data Prep and Map
The first job I ever had out of college was as an autopsy tech assistant at the Fulton County Medical Examiner's Office. I helped out first as an intern at DeKalb County on death scene investigations, and later after college in Fulton County helping the techs perform autopsies and conduct organ procurement. That was one of the best professional experiences of my life providing me the opportunity understand the relationship between law enforcement and medicine, specifically forensic pathology. So, this dataset really was interesting to me, and while the below is a pretty basic data science post, I'll continue to work on this throughout the year to see if I can generate some new insights.
The Data
This data is from Kaggle, available here: https://www.kaggle.com/murderaccountability/homicide-reports. A little overview about it:
The Murder Accountability Project is the most complete database of homicides in the United States currently available. This dataset includes murders from the FBI's Supplementary Homicide Report from 1976 to the present and Freedom of Information Act data on more than 22,000 homicides that were not reported to the Justice Department. This dataset includes the age, race, sex, ethnicity of victims and perpetrators, in addition to the relationship between the victim and perpetrator and weapon used.
Part 1 - Data Prep
data = pd.read_csv(".../database.csv")
print("There are", len(data), "records.")
This dataset has 638,454 records, consisting of 18 object variables and 6 integer variables a total of 24 variables.
print(data.dtypes.value_counts())
print(data.dtypes)
object 18
int64 6
dtype: int64
Record ID int64
Agency Code object
Agency Name object
Agency Type object
City object
State object
Year int64
Month object
Incident int64
Crime Type object
Crime Solved object
Victim Sex object
Victim Age int64
Victim Race object
Victim Ethnicity object
Perpetrator Sex object
Perpetrator Age object
Perpetrator Race object
Perpetrator Ethnicity object
Relationship object
Weapon object
Victim Count int64
Perpetrator Count int64
Record Source object
dtype: object
print(data.head())
How many unique variables for each of those 18 object variables? Let's get the unique category names of each object variable, using the .nunique() command
print("Number unique Agency Name vars", data['Agency Name'].nunique())
Number unique Agency Name vars 9216
Number unique Agency Code vars 12003
Number unique Agency Type vars 7
Number unique City 1782
Number unique State 51
Number unique Month 12
Number unique Crime Type 2
Number unique Crime Solved 2
Number unique Victim Sex 3
Number unique Victim Ethnicity 3
Number unique Perpetrator Sex 3
Number unique Perpetrator Age 191
Number unique Perpetrator Race 5
Number unique Relationship 28
Number unique Weapon 16
Number unique Record Source 2
Let's look at Record ID 3. Here, Record ID 3 was a 30 yo female victim Native American/Alaska Native killed in March 1980 whose Perpetrator demographics are completed unknown
print(data.loc[2])
Record ID 3
Agency Code AK00101
Agency Name Anchorage
Agency Type Municipal Police
City Anchorage
State Alaska
Year 1980
Month March
Incident 2
Crime Type Murder or Manslaughter
Crime Solved No
Victim Sex Female
Victim Age 30
Victim Race Native American/Alaska Native
Victim Ethnicity Unknown
Perpetrator Sex Unknown
Perpetrator Age 0
Perpetrator Race Unknown
Perpetrator Ethnicity Unknown
Relationship Unknown
Weapon Unknown
Victim Count 0
Perpetrator Count 0
Record Source FBI
Name: 2, dtype: object
In a different case, Record ID == 5000, this is a 21 yo White Male victim of Murder or Manslaughter in Fairfield, CT. Killed in October 1980 by a 26 yo White Male, a stranger, who used a handgun
print(data.loc[4999])
Record ID 5000
Agency Code CT00158
Agency Name Westport
Agency Type Municipal Police
City Fairfield
State Connecticut
Year 1980
Month October
Incident 1
Crime Type Murder or Manslaughter
Crime Solved Yes
Victim Sex Male
Victim Age 21
Victim Race White
Victim Ethnicity Unknown
Perpetrator Sex Male
Perpetrator Age 26
Perpetrator Race White
Perpetrator Ethnicity Unknown
Relationship Stranger
Weapon Handgun
Victim Count 0
Perpetrator Count 0
Record Source FBI
Name: 4999, dtype: object
Change Year to datetime. Convert to categorical: 'Agency Type', 'State', 'Month', 'Crime Type', 'Crime Solved', 'Victim Sex', 'Victim Race', 'Victim Ethnicity', 'Perpetrator Sex', 'Perpetrator Race', 'Perpetrator Ethnicity', 'Relationship', 'Weapon', 'Record Source', and Convert 'Perpetrator Age' to integer.
data2 = data.copy()
data2['Year'] = data2['Year'].astype(str)
data2['Month'] = data2['Month'].astype(str)
print(data2['Year'].unique())
print(data2['Month'].unique())
['1980' '1981' '1982' '1983' '1984' '1985' '1986' '1987' '1988' '1989'
'1990' '1991' '1992' '1993' '1994' '1995' '1996' '1997' '1998' '1999'
'2000' '2001' '2002' '2003' '2004' '2005' '2006' '2007' '2008' '2009'
'2010' '2011' '2012' '2013' '2014']
['January' 'March' 'April' 'May' 'June' 'July' 'August' 'December'
'November' 'February' 'October' 'September']
data2['Date'] = data2['Month'] + " "+ data2['Year']
Convert the date into a single month-year
data2['Date'] = data2['Month'] + " " + data2['Year']
data2['Date_'] = pd.to_datetime(data2['Date'], format='%B %Y')
print(data2.dtypes)
Record ID int64
Agency Code object
Agency Name object
Agency Type object
City object
State object
Year object
Month object
Incident int64
Crime Type object
Crime Solved object
Victim Sex object
Victim Age int64
Victim Race object
Victim Ethnicity object
Perpetrator Sex object
Perpetrator Age object
Perpetrator Race object
Perpetrator Ethnicity object
Relationship object
Weapon object
Victim Count int64
Perpetrator Count int64
Record Source object
Date object
Date_ datetime64[ns]
dtype: object
Make sure it converted to only year datetimes
data2['Date_'].loc[:10]
0 1980-01-01
1 1980-03-01
2 1980-03-01
3 1980-04-01
4 1980-04-01
5 1980-05-01
6 1980-05-01
7 1980-06-01
8 1980-06-01
9 1980-06-01
10 1980-07-01
Name: Date_, dtype: datetime64[ns]
Convert variables into categories
def ascategory(cols, df):
for col in cols:
df[col] = df[col].astype('category')
all_cols = ['Agency Type',
'State',
'Month',
'Crime Type',
'Crime Solved',
'Victim Sex', 'Victim Race',
'Victim Ethnicity',
'Perpetrator Sex',
'Perpetrator Race',
'Perpetrator Ethnicity',
'Relationship',
'Weapon',
'Record Source'
]
ascategory(all_cols, data2)
print(data2.dtypes)
Record ID int64
Agency Code object
Agency Name object
Agency Type category
City object
State category
Year object
Month category
Incident int64
Crime Type category
Crime Solved category
Victim Sex category
Victim Age int64
Victim Race category
Victim Ethnicity category
Perpetrator Sex category
Perpetrator Age object
Perpetrator Race category
Perpetrator Ethnicity category
Relationship category
Weapon category
Victim Count int64
Perpetrator Count int64
Record Source category
Date object
Date_ datetime64[ns]
dtype: object
data2['Perpetrator Age'] = data2['Perpetrator Age'].astype(str)
data2['Perpetrator Age'] = data2['Perpetrator Age'].str.replace(' ', '0')
data2['Perpetrator Age'].unique()
Histogram of all integer dtypes - Record ID isn't relevant.
(data2.select_dtypes(include=['int64'])).hist(figsize=(12,12))
s1 = data2[['Record ID', 'Relationship', 'Weapon', 'Victim Age', 'Perpetrator Age', 'Crime Type', 'Crime Solved']].copy()
s1.head()
g = sns.factorplot(x='Relationship', y='Victim Age', hue='Crime Type', data=s1, kind="point", size=8, aspect=2)
g.set_xticklabels(rotation=30)
Shows that older people are the victims of murder by family (in-law, mother, family), whereas children are victims or murder by family as well (Boyfriend/Girlfriend, Common-law Wife, Son, Stepdaughter, Stepson).
Part 2 - Mapping Murders in States
Display on a map, the percentage of each Crime Type out of All Crimes for each state. Build one using the Python package folium.
import folium
Prepare two data sets and read in FIPS file and merge with data2. Get the FIPS data from: https://www.census.gov/geo/reference/ansi_statetables.html
fips = pd.read_csv("/Users/catherineordun/Documents/data/fipscodes.csv")
fips.rename(index=str, columns={'Name':'State'}, inplace=True)
fips.head()
Now, merge with data2 to get the state abbreviations
map_data = pd.merge(data2, fips, on="State", how='left')
Get the total records for each state.
map_counts = pd.pivot_table(map_data,index=["USPS_State"],values=["Record ID"],aggfunc=[len])
#reshape
mc1 = map_counts.reset_index()
mc1.columns.droplevel()
mc1.columns = ['State', 'Total']
mc2 = map_crime.reset_index()
mc2.columns = mc2.columns.droplevel()
mc2.columns = ['State', 'Crime Type', 'Count']
mc2['crime'] = pd.factorize(mc2['Crime Type'])[0]
mc2.head()
crimecount = pd.merge(mc2, mc1, on='State')
crimecount['perc'] = crimecount['Count'] /
crimecount['Total']
How does'Murder or Manslaughter', compare to 'Manslaughter by Negligence', for each state based on the total percentage of all records during this time frame?
sns.set(style="ticks")
g = sns.FacetGrid(crimecount, col="State", col_wrap=5, size=3.5)
g = g.map(sns.barplot, "crime", "perc")
You need to download the us-states.json file from here: https://github.com/python-visualization/folium/tree/master/examples/data
state_geo = r'/Users/catherineordun/Documents/data/us-states.json' ((crimecount.loc[(crimecount['crime']==1)]).sort_values(by ='perc', ascending=False)).head(10)
Let Folium determine the scale
map_2 = folium.Map(location=[48, -102], zoom_start=3)
map_2.choropleth(geo_path=state_geo, data=(crimecount.loc[(crimecount['crime']==1)]),
columns=['State', 'perc'],
key_on='feature.id',
fill_color='YlGn', fill_opacity=0.7, line_opacity=0.2,
legend_name='perc')
States by percentage of total as being murders
display(map_2)