Update 10/21: I calculated some more conventional statistics and weighed the data to account for days of the week. See more here.
Disclaimer: I am not a statistician. This is just a little experiment I ran based
on the data that I could get my hands on. I’ve tried to give a comprehensive explanation
of my methodology, so if you notice any flaws, please let me know!
A recent discussion with a friend left me wondering how much truth there is to the
rumor that the police give more tickets towards the end of the month to meet quotas.
As a fan of the Steven Levitt / Stephen Dubner team behind Freakonomics I decided to
find some data and see what it had to say.
The data I’m using is from the City of Baltimore’s Open Data Catalog. 
It contains information on almost two million traffic citations issued in the last decade.
Is the data comprehensive?
The first thing I wanted to know about the data is whether or not it was comprehensive.
It was obvious that certain earlier years weren’t complete; there is a single citation listed for
Unfortunately, I have not been able to deduce whether or not the set is complete. This brings
me to my big assumption:
The data is either complete, or a sufficiently representative sample of all tickets in 2009, 2010, and 2011.
I don’t like making that assumption, but I’m not writing a paper for journal publication here, so I’ll
just work with what I can get.
Does it accurately represent the question?
The question here is whether or not the police give out more citations towards the end of the month. This
data includes citations given by automated speed cameras, but I can remove those from the data set. I think
that the remaining citations will give an accurate record of manually issued citations.
Processing the data
There are several things that need to be corrected for before the raw data will show us what we want to know:
- The start and stop points of the data are a little blurry.
- The data includes citations issued by automated cameras.
- There aren’t as many 31sts, 30ths, or 29ths in a year as other days of the month.
We’ll start with a graph of all of the data in the set:
Limiting the data to a certain date range
Now we want to firm up the edges of the data set. I’m going to alter my processing script so that in only counts citations issued
Removing automated cameras from the data
Now we want to skip citations issued by automated cameras. To do this, I filtered out violation codes 32 and 33, i.e. fixed
and mobile speed cameras.
Only a small number of tickets were removed by filtering out speed cameras, so the graph is virtually unchanged.
Correcting for more frequent dates
In the update below, I also corrected for the effects of more frequent days of the week, so many of the charts have changed. Some points made in the analysis may now be moot, but I’m leaving this part in its original form. Make sure you read the update too!
Only eight months in a year have a 31st. To account for this change among different days of the month, I used the following formula to figure out what an “even” distribution of citations would look like for a given day of the month:
(((appearances in year * 100) / # days in year ) * total citations) / 100 = expected # of citations
Here’s what an even distribution of citations would look like:
Now, to compare the expected distribution with the actual distribution, here’s a chart of the difference between
the two for each day of the month:
When the bar is positive, more tickets were given than would be expected, and when it’s negative, less.
Here’s the same graph recalculated with the citations from 2010 and 2009:
Finally, I’ve averaged out the data from those three years. I first normalized the data for each year to account
for different total numbers of citations in each year with this formula:
(actual - expected) / total * 1000 = normalized number
This is the key graph here. It shows that the citation rates from the 10th to the 27th are, for the most part, lower than we’d expect with an even distribution, and that on the 28th, 29th, and 30th those rates jump up to a much higher level than we’d expect. Interestingly, it also shows that the rates drop to well below their expected levels on the 31st.
There are a few things about this graph that I find interesting:
- The rates jump on the 28th, 29th, and 30th.
- The rates dive back down for one day on the 31st.
- The first nine days have higher-than-expected rates.
- The period from the 10th to the 27th has lower-than-expected rates.
I can only guess what the cause of these patterns might be. Since we want to to know about quotas, let’s see if we can explain things with that.
- “I need to have X tickets written in 4 days!”
- No good explanation
- “I don’t want to be rushing like that again this month.”
- Falls behind after the fear from the last rush fades
So what does this mean? Are the police giving out more tickets at the end of the month to meet quotas? Not exactly. It may look like a quota system would explain the patterns we see in the data, but you could probably come up with another explanation that seems to fit the data too. It’s also worth noting that a quota system wouldn’t really explain the drop on the 31st, and my quota-based explanation for the high rates in the first week of the month is kind of thin.
Well, it’s been fun playing economist, but I don’t know if I really have enough here to draw a conclusion. Do the police in Baltimore give out more tickets at the end of the month to meet quotas? It seems plausible, but I don’t know more than that.
Update 10/21: I’ve done a little studying, and I’d like to provide some more conventional statistics.
Here’s some data from each year:
||Mean Diff. From Expected
Z-score is a statistic used to normalize data. It is calculated with this formula:
Z-score = (data point - mean) / standard deviation
Here’s a graph of the average of Z-scores for the three years:
Several dates — such as the 6th and 28th — appear very different in this graph than in the other average.
Here we can directly compare the two. The graph from the old-formula has been re-scaled, so the numbers along the y-axis are not accurate for that data.
Day of week
Several people suggested that I take days of the week into consideration, since it’s likely that the number of tickets given out on weekends is very different from those on weekdays, and that if any day of the month should fall on a certain day of the week more often than others, it would skew the results.
To adjust for this, I’m using a formula suggested by one of those readers:
Day Weight = Total on day / Total in year
Here are the weights for days of the week in each year expressed as percentages of the total number of tickets for that year. They’ve been rounded, so they may not add up to 100%:
It looks like there are far fewer tickets on weekends than weekdays in this data. How will I account for this?
Remember that chart that showed the expected tickets vs day of month? Here’s a little reminder:
Well I’m going to re-calculate the expected tickets for each day based on the distribution of those days among days of the week. Here’s the result:
Now I’m going to recalculate the differences based on these new expectations. Here’s the unadjusted chart, followed by our newly adjusted chart:
The first thing that I notice on this chart is that the previous drop that occured on the 31st is now reversed. When we adjust for the effects of different days of the week, each of the last four days of the month is above what you would expect from an even distribution of tickets.
That doesn’t necessarily change my conclusion, but it does let us look a little closer at the true effects of day of the month on ticketing rates.
2 The earliest data point in the set is acually a red
light violation from 1999, but it’s the only citation listed before 2002.
3 The data set is labeled “Parking Citations,” which is a little confusing because it seems to include
several moving violations, such as citations issued by automated speed cameras. In any case, I don’t see any reason
to assume that this would skew the data distribution among days of the month.
4 These charts were generated with this tool.
5 I used “# days in year” rather than specifying 365 to account for leap years.