Exploring crime in Philadelphia

This is a large and intersting dataset and has data points stretching back over 10 years. Several explorations have pointed out that crime seems to be seasonal and I wanted to explore this with a time series. Assuming that seasonal trends might repeat themselves, I am exploring this using the forecast package and using linear regression to predict trends.

suppressPackageStartupMessages({
  library(data.table)
  library(forecast)
  library(knitr)
})

Data size and structure.

mydt <- fread("~/data/crime.csv", showProgress = FALSE)
mydata <- as.data.frame(mydt)
dim(mydata)
## [1] 2174774      14
str(mydata)
## 'data.frame':	2174774 obs. of  14 variables:
##  $ Dc_Dist           : int  18 14 25 35 9 17 23 77 35 23 ...
##  $ Psa               : chr  "3" "1" "J" "D" ...
##  $ Dispatch_Date_Time: POSIXct, format: "2009-10-02 14:24:00" "2009-05-10 00:55:00" ...
##  $ Dispatch_Date     : IDate, format: "2009-10-02" "2009-05-10" ...
##  $ Dispatch_Time     : chr  "14:24:00" "00:55:00" "15:40:00" "01:09:00" ...
##  $ Hour              : int  14 0 15 1 0 12 14 18 1 20 ...
##  $ Dc_Key            :integer64 200918067518 200914033994 200925083199 200935061008 200909030511 201517017705 200923006310 200977001770 ... 
##  $ Location_Block    : chr  "S 38TH ST  / MARKETUT ST" "8500 BLOCK MITCH" "6TH CAMBRIA" "5500 BLOCK N 5TH ST" ...
##  $ UCR_General       : int  800 2600 800 1500 2600 600 800 500 2600 2600 ...
##  $ Text_General_Code : chr  "Other Assaults" "All Other Offenses" "Other Assaults" "Weapon Violations" ...
##  $ Police_Districts  : int  NA NA NA 20 8 13 16 NA NA NA ...
##  $ Month             : chr  "2009-10" "2009-05" "2009-08" "2009-07" ...
##  $ Lon               : num  NA NA NA -75.1 -75.2 ...
##  $ Lat               : num  NA NA NA 40 40 ...

Extract the month as a new column

mydata$Dispatch_Date_Time <- as.POSIXct(mydata$Dispatch_Date_Time)
mydata$Month <- cut(mydata$Dispatch_Date_Time, breaks= "month")

Crimes by month

We can clearly see a downward trend in overall crime rates and also the fact that there seem to be seasonal peaks and declines.

bymo <- mydt[order(Month), .N, by=Month]
dts <- ts(bymo$N, start = c(2006,1), frequency = 12)
dts_decomp <- stl(dts, s.window = "period", robust = TRUE)
plot(dts,ylab="Total Crimes", main = "Monthly crimes with trend")
lines(dts_decomp$time.series[,2], col="tomato")

Seasonal component extracted from the timeseries.

How seasonal is the data?

This autocorrelation shows a very high correalation every 12 months.

Acf(dts, main = "ACF of crime")

Forecast with a linear model

The red line shows the model’s prediction against the actual numbers in black. The model seems quite close.

f_crime <- tslm(dts ~ trend + season, lambda = 0.5)
ff_crime <- forecast(f_crime, h = 12)
plot(ff_crime)
lines(fitted(ff_crime), col = "red")

Residuals from model

This shows the residuals - these are quite low.

  res <- residuals(ff_crime)
  plot(res, ylab="Residuals",xlab="Year", main = "Residuals") 
  summary(res)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -19.4127  -3.6751  -0.6624   0.0000   3.3571  17.2318

Predictions

If such a thing were possible - here are the predicted overall crime numbers.

ff_crime
##          Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
## Nov 2016       13236.78 12244.74 14267.46 11728.32 14836.48
## Dec 2016       12642.30 11673.23 13650.01 11169.13 14206.70
## Jan 2017       12455.98 11496.12 13454.31 10996.91 14005.90
## Feb 2017       11074.65 10170.69 12017.10  9701.46 12538.71
## Mar 2017       13602.72 12598.79 14645.12 12075.92 15220.37
## Apr 2017       14174.98 13149.75 15238.68 12615.45 15825.36
## May 2017       14987.78 13933.02 16081.02 13382.88 16683.54
## Jun 2017       14890.58 13839.31 15980.32 13291.04 16580.98
## Jul 2017       15276.38 14211.33 16379.91 13655.67 16987.95
## Aug 2017       15501.24 14428.24 16612.71 13868.30 17225.03
## Sep 2017       14087.42 13065.43 15147.90 12532.86 15732.85
## Oct 2017       14033.05 13013.06 15091.51 12481.58 15675.38

Next steps

  • This took into account overall numbers but breaking down by category/Code and district might reveal other patterns.