Updated: May 19, 2020
If you do not know how to ask the right question, you discover nothing. - W. Edwards Deming
We often get people in our classes who are coming to data analysis for the first time. We love having them in class, but they are often intimidated by data and those who use it to make their arguments. For them, it's a black box, or a dark art.
Getting through that intimidation isn't easy, but over time, we've come to understand something fundamental both about this latent fear common in our culture and about data analysis in general because the first step isn't getting them comfortable with Excel, it's getting them to first think about the problem.
The Limits of Data
I see a lot of people coming to data, whether to a data set on their own or to a data analyst they're working with, and wanting to know "why":
"Why are unarmed black men being shot in our communities?"
"Why are our schools so bad?
"Why hasn't my street been fixed after I've made so many complaints?
To answer "why" questions we have to not only see and account for all the factors that make a trend happen, we have to understand the reason for those factors to be occurring. This knowledge is often locked up in the hearts and minds of the people with the power and authority to influence those factors.
And peering into the hearts and minds of people is beyond anything data can do.
What data can do is tell us "how many", "where", "what kind", "when", and "to what effect." These are things we can see and count. They don't tell us the why, but using the outcome, we can infer the intent and guess at the why.
For example, if a public works agency is receiving a large number of complaints about a broken street lamp or pothole and not doing anything about them, then there is someone making the decision fixing the problem isn't worth their time or resources. Even if the problem is a technical one (no one is sharing the received complaints with the supervisors), there is a manifest choice being made not to have a reliable and consistent system of relaying public complaints within the agency.
We can never know why they are making their decision, only that they clearly have. In an open society, we have the opportunity to have them account for that choice and make clear why they made it through our elected representatives.
If the outcome of their choice doesn't align with the intent, then those with power and authority have the option of adjusting their intentions and actions to produce the proper outcomes. They can better prioritize tasks to meet the public need or fix their reporting system to use complaints to schedule crews.
I can come up with whatever version of "why" that I want to, but if the problem is fixed, none of that matters because the work is done.
If you tell me your intention is to reduce or even end the use of deadly force against unarmed civilians, then the outcome should be fewer officer-involved shootings of unarmed civilians. If this doesn't happen then your actions aren't in line with your stated intentions. Otherwise, the needle would move and change would occur. If action is taken and there isn't a change, then those probably weren't the effective actions to take (or they may need longer to have an effect). Right intentions require effective actions to achieve meaningful impact.
Crafting an Analytical Question
So we're left with the challenge of how to craft a question that can be answered with data. Taking the example of "why are police using deadly force against unarmed civilians, particularly people of color," we can start looking at the data to ask:
How many unarmed civilians were killed by police officers?
What is the distribution of race, ethnicity, age, and gender of the victims of officer-involved shootings?
What is the profile of those officers who've been involved in these incidents (age, race, ethnicity, gender, time on the force, past service history, etc.)?
We can also look at the factors involved in these incidents such as:
How were these incidents initiated (911 call, patrol stop, etc.)?
What time of day are these incidents likely to occur?
Where are these incidents occurring?
As you can see, this exercise doesn't require that we've ever looked at the available data. In fact, I believe this process can be harmed if we have looked at the data or, at the very least, are too familiar with it. We can easily get fixated on what's available and not see the other important pieces of information we may need in order to meaningfully analyze the issue.
This is called anchoring bias, and it's why we say that data analysis doesn't begin with the data but with the question to be answered.
Doing the Analysis
Only when we have the key questions identified do we want to go into the data, beginning with the simple task of counting what's in the data to answer our questions above.
Once we have those counts, we can start looking for patterns, such as:
Who is likely to be the victim of an officer-involved shooting?
What types of calls are more likely to result in an officer-involved shooting?
Where in the city is an officer-involved shooting more likely to happen?
At this point, we'll also have a clear idea of what's not in the data that we want to know. We'll see that we don't have important information on the victims or the officers involved. We won't know key things about the incident.
Our choice then is to try and find the data by looking at other data sets or come up with a good estimate of the data using some reliable method. For example, it's possible, based on name, to guess someone's gender, though this is much less reliable for guessing someone's race. When it comes to the profile of police officers, the salary and basic employment history of all city employees, including police officers, is part of the public record and often easily available. It can be a challenge to match this data, but that gives some information on an officer's history with the police force.
Looking for Trends
Understanding these trends, we can then move to talking about things that correlate to the trends we see.
We can use Census data about our communities to explore questions like:
Are these incidents more likely to happen in areas with more people of color living there?
Are these incidents more likely to happen in areas with higher levels of poverty?
The key in this analysis is to not only look at the communities impacted by these incidents but also look at who lives in the communities where this ISN'T happening. If we say this is an issue of communities of color, then we should see it not being a problem in communities where there aren't large numbers of people of color AND in other communities with large numbers of people of color. If that isn't the case, then we need to explore other key factors that may be more important to understanding what is happening.
This doesn't mean that poverty or race and ethnicity aren't a factor, but it may not be the only factor and having the comparison will tell us how important of a factor each of them are.
We can also look at the activities of the police themselves by asking questions like:
Are these incidents more likely to happen in areas with lower or higher frequency of police patrols or other police forms of presence?
Does a police officer's background have an impact on whether they are likely to be involved in an officer-involved shooting?
This data can be harder to find, but as The Center for Data Science and Public Policy recently showed in their work with the Charlotte-Mecklenburg Police Department, the background and service history of an officer can be a major determinant of whether an officer is likely to be involved in an officer-involved shooting. This has implications not only for the community but the police force as well.
So to summarize, we start with the questions we want to answer, paying careful attention to whether those questions can be answered with data. If not, then we need to break down our questions into parts that can be answered with data, starting with the key words of:
when (to understand time frame)
where (to understand location)
what (to understand specific categories of high, low, less, more, frequent, not frequent, representative, not representative, common, or infrequent)
how many (quantity)
how much (magnitude)
Then we go into the data, answering what we can, identifying what we can't answer, and then looking at what the answers mean by looking at trends and correlating those trends to other factors.
At the end of this process, we may not be any closer to understanding why, but we've at least gathered what we need to start holding decision makers accountable for the impact of their actions. We then have a way to then ensure that whatever changes they implement are held up to scrutiny for whether they have a meaningful impact in the way we want them to. Otherwise we're left with just words and a continuation of the same problems without any reliable answers.