It may sometimes feel as though data journalism is inherently more objective than other types of reporting. Numbers can’t lie, right?
There are lots of ways of tricking your audience or even yourself when working with data. It needn’t even be malicious. Having spent the past year studying data journalism, I’ve had plenty of opportunities to discover first-hand that it’s all too easy to make mistakes that skew your results completely.
So without further ado, here are the four biggest problems I’ve encountered with bad data journalism over the past year.
1. A lack of context or proportion
Numbers are meaningless without some context. This rarely becomes more obvious than in news reports on spending, where this problem crops up on a regular basis.
“Taxpayers paying more than $1 billion for illegal immigrant children,” headlines yell out. “Benefits spending up £6.4 billion.” The figures sound outrageous, astronomical even. It’s tempting to want to splash on them. But public spending figures have a tendency to be, well, astronomical. Put it into context: Break it down per person and you may find that in fact they’re totally reasonable.
What’s the lesson here? Proportions tell us more than absolute numbers, to be sure. But they’re not always the right way to go, either. Think about your data and how to represent it most faithfully.
Guardian data journalist James Ball recommended in a lecture that all data journalists put together some basic figures to avoid making stupid mistakes and have an easier time spotting what’s reasonable and what isn’t: How many people of working age are there in the UK? What’s the average salary? What’s the employment rate? Et cetera. Not a bad suggestion.
2. Correlation does not equal causation
If you know one thing about statistics, it’s likely to be this. Correlation and causation are two very different things.
However, this is also something that newsrooms ignore all the time. But just because you have two variables that correlate – don’t automatically assume you’ve got a scoop. This could equally be caused by some other, underlying variable. Or just be a total coincidence.
The relationship between Internet Explorer’s market share and the murder rate is a personal favourite. Check out Spurious Correlations for more (don’t blame me when you realise you’ve wasted an afternoon there, though!).
3. Not knowing how to visualise it
Okay, this really deserves a post of its own. Or several. But for now, this will have to do.
You’ve done your data analysis, you’ve got a cracking story. But a poor visualisation may leave viewers confused. Or worse, misled.
Maybe you’re using line charts to show discrete data (don’t). Maybe you’re trying out some funky 3D pie charts (DON’T). Or maybe you’re just becoming part of that eternal debate on whether it’s ever, ever okay to truncate the y-axis.
Data visualisation’s both an art and a science, and there are many potential pitfalls. Here are some good guides on how to avoid them:
- The Functional Art, by Alberto Cairo
- Data Visualization – Principles and Practice, by Alexandru Telea
4. Forgetting the narrative
This is the most important point, in my opinion:
Data journalism gives us the power to explore topics quantitatively. But it’s still journalism, which means it’s still storytelling. If you’re just tossing out a bag of random figures, you’re not doing your job properly. They’re just the starting point. Now, you need to guide your readers through the story. You need to make them understand why those figures are important and how.
As Tanveer Ali puts it in the Columbia Journalism Review:
“Numbers are a means of storytelling – not the story itself.”
Network analysis is a nifty area of data journalism that can show you how people are connected. This can be any type of connection really – Swedish data journalist Jens Finnäs mapped Eurovision voting data to see what countries vote for each other most often – but increasingly, it means looking at social networks like Facebook and Twitter.
Since I’m just starting to get the hang of network analysis myself, I thought I’d share what I’ve picked up so far with this tutorial.
You can visualise your own Facebook friends, or find out who the most influential users are on a certain hashtag. This last example is what I’m going to go through in this tutorial, using the Excel extension NodeXL and the network visualisation tool Gephi.
- Download NodeXL and Gephi.
- Open a blank NodeXL template through your Start menu.
- A network graph basically consists of nodes (in this case Twitter users) and the connections between them, which are called edges.
- We’re going to import nodes and edges for all the users who’ve tweeted using the #ddj (data journalism) hashtag. We could do this using something like R, but thanks to NodeXL it’s all got a whole lot less complicated.
- Click on Import and choose “Import from Twitter search network”.
- This is where you fill in what hashtag network you want to analyse. In theory, you can import the last 18,000 tweets on this hashtag, but in practice you’re unlikely to ever get quite that many, because of Twitter’s age-limits.
- Once you click on that OK button, sit back, relax and let NodeXL work its magic. In fact, maybe go make yourself a coffee. If there are a lot of tweets to import, it’s going to take a little while.
- This is what you should end up with:
Every vertex, or node, is a Twitter user, and NodeXL has logged all the ways in which it has connected with other vertices using the #ddj hashtag. The tool fetches plenty of useful information about the tweets and the tweeters – from number of followers and bios to whether something is a retweet. Have a peek through the columns.
- This is now ready to be exported to Gephi. Select Export as GraphML file.
- Now we’re getting into the really fun stuff. Fire up Gephi and click on Import Graph File in the top left.
- When asked in the pop-up window, choose a directed rather than undirected graph. This will vary depending on what type of network you’re analysing. For Twitter, that has asymmetrical follow relationships, directed makes more sense, but for Facebook, where a friendship can only go both ways, undirected would be better.
- Okay. We’ve now got a graph that looks like so much grey chaos.
- To fix that, we’re going to start by changing the layout. Choose Force Atlas 2, and set the scaling to something like 3.0. The scaling will affect how closely the nodes in your network are drawn to each other. Click on Run and see magic start to happen.
- Next step: On the right-hand side, you’ve got a number of settings. Click Run on Modularity and Eigenvector centrality. Eigenvector centrality is a pretty funny word, but it’s also a useful way of measuring how important a node is within its network. There are a number of different centrality measures, and which one you need will depend on your network. This post is a good introduction.
- We’re going to do two more things to get a more useful layout. First, colouring the nodes based on their groupings. To do this, click on Partition in the top left and choose Modularity as the partition parameter. Click Apply.
- Next, we want the nodes’ size to reflect how important they are to the network. Click on Ranking and select Size by Eigenvector centrality. I went for a minimum of 10 and maximum of 150.
- We’ve now got different sized nodes depending on how central they are – but they’re all jumbled up. Go back to the layout tab and select Prevent overlap. Run that again until the graph is more evenly spaced out.
- Labels: Click the T icon under the graph to show your nodes’ labels. Click on the A next to it and select Scale labels by node size.
- Your graph is practically done! Go to the Preview tab where you can fine-tune the actual appearance some more, if you want. Otherwise, just admire your creation. It’s ready to export!
Phew! That’s it, we’ve made a basic network graph of people tweeting about data journalism. Can you find yourself there?
Any questions or thoughts, let me know on @cguibourg or here.
It’s been rather quiet on the blogging front lately. I’m now more than half-way through my MA at City, and let’s just say the coursework is starting to pile up.
To that end, I’ve read books, gone through video tutorials, and even went to a real live workshop. I’ve got to be honest: It’s been a challenge. If the first baby steps I took while learning to code were easier than I’d been expecting, I’m definitely deep in the Desert of Despair now.
[On a side note – every coding noob should read this blog post, with the perky title ”Why learning to code is so damn hard”. It cheered me up no end.]
Back when I whizzed through my first HTML tutorial at Codecademy, I was picking up badges and gold stars and generally feeling pretty pleased with myself. I stupidly thought to myself: ”I’ve totally got this coding thing covered.”
Now, every tutorial seems to bring up another slew of things that I need to understand. Things that I ought to understand. Things that I really, really don’t understand and that quite frankly make my head hurt. But hey, apparently that’s normal.
So it has been a little painful so far, but I’m now slowly starting to feel the possibilities that D3 could bring. I mean, sure, so far I’m happy if I manage to make a bar chart, so I may not have reached the Upswing of Awesome quite yet. But HYPOTHETICALLY. The possibilities are definitely there.
(You can have a peek at the interactive version of the chart here.)
Anyway, this is all a rather roundabout way of saying that I wrote a longer piece about the pros and cons of different D3 tutorials, posted on the Interhacktives blog. Have a read if you, like me, want to learn more about it.
|#||Comments out a line of the script|
|/||Divide. Division with integers rounds down to the nearest integer.|
|%||Modulos. Gives the remainder of a division.|
|+=||Shorthand for ”plus itself”. Eg. x = x + y is the same as x += y|
|<=||Greater than or equal to|
|>=||Less than or equal to|
|=||Assigns value to variable. NB! Not a comparator.|
|variable||A storage location holding some data, assigned some identifying name.|
|integer||Whole digit numbers|
|float||Numbers with decimals.|
|string||A sequence of characters, eg text.|
|boolean||A variable with two possible values: True or False.|
|concatenation||Joining several strings together end to end.|
|formatters||Allows you to format strings with variables within them.|
|\||Escapes certain characters from the script.|
|”””||Three quotes allow you to write strings as long and as many lines as you want|
|raw_input()||Prompts user to input something.|
|int(raw_input()||Converts user input to integer|
|import||Import modules from other Python libraries to your script. This enables you to import just those bits that you need for your script, keeping it small and neat.|
|argv||List of command line arguments for a Python script.|
|open()||Opens an external file in your script. The default is read-mode, but you can pass an optional argument to the method to open it in write or annotation mode. Eg. [file].open(‘w’) opens in write mode.|
|read()||Reads opened file.|
|close()||Closes a file|
|write()||Writes a string to the file. Syntax fileobject.write( str )|
|truncate()||Clears file, partially or completely. An optional size argument truncates the file to that size.|
|seek()||Sets the file’s current position at the offset. Default position is 0, but there is an optional whence argument.|
|readline()||Reads one line of your file|
|len()||Returns the length of an object|
|functions||A named section of your script that performs a certain task when called.|
|def||Used to define new functions in Python. Syntax: def function_name:|
|return||Returns a value from a function|
|exists||Checks if a file exists|
|pydoc||Python documentation. Access through Powershell via ”python -m pydoc [method]”|
Good news, everyone!
I’ve just launched the news website Project Ada, together with my fellow Interhacktives Sam, Keila and Ashley. (Mainly I think I’m just on a roll, purchasing domain names left, right and centre, after discovering how incredibly easy it was when I created my own website.)
Project Ada covers women in technology, which handily combines two of my greatest interests: feminism and all things geeky and techy. We fannishly borrowed the name from the original geek goddess and first ever computer programmer, Ada Lovelace.
Tech industries have a massive gender problem, with the percentage of women in the field not just low but actually dropping. Which may have something to do with the tragically old-fashioned ideas about tech being somehow… un-feminine that are still being bandied about. Here’s Decoded’s Kathryn Parsons, who I interviewed for Project Ada last week:
“People still say to me ‘women’s brains don’t really work that way’. It happens every week. I won’t stop until I never hear that phrase again.”
We want to report on these issues, and also showcase the many role models in the industry that are obviously out there.
So why am I doing this?
What, apart from the fact that I get to report on an important topic for me?
I want to deepen my knowledge of this niche, so that I can become a better reporter on this particular beat, and get to know people in this field. And heck, what better way to learn than by just getting out there and doing it?
I want to be covering this beat anyway, so this is a perfect opportunity to practice and build contacts in this field, while showcasing some of the stories that I’m proud of.
Also, although I’ve worked as a web editor for a few years, this is actually my first experience of building a news site completely from scratch, and building a community alongside it, and I look forward to experimenting both with content, site and how to build an audience.
If nothing else, I’ve got high hopes that reporting for Project Ada will get me off my butt and into attending a lot of inspiring hackathons and events. Not a bad reason in itself.
Getting readers into the editorial process at the click of a button? This was the idea that caught the judges’ attention at the recent Build The News hackathon.
I took a moment to speak to the Wall Street Journal’s Elliot Bentley, one of the team members behind winning entry, Crowdtip, about what makes their idea stand out.
“Often the comments can be a big cesspool of readers shouting at each other, or at the writer. It’s difficult for journalists to gather meaningful feedback from vast quantities of comments,” he said.
To solve that, Crowdtip works as a widget embedded into articles, allowing readers to vote on which direction the newsroom should take in future coverage. A vote is submitted at the click of a button.
Why go to a hackathon?
So, after a hectic 48 hours, does he think it was worth giving up a weekend to participate in a hackathon? Elliot Bentley, who’s been to one other hackathon before Build The News, is convinced of its value.
“What’s great is that you create a space to experiment in and come up with ideas, without worrying about day-to-day work. There are often a lot of practical reasons why these things couldn’t get done, so it’s nice to put these limitations to the side.”
Melding journalists and coders
Most newsrooms today are still treating journalists and developers as slightly separate species, with their workspaces physically and mentally separated from each other. Events like Build The News bring them closer together, Elliot Bentley said.
“These events create an understanding between these two groups – actually, I don’t like to think of them two groups, personally. They start melding.”
Incidentally, he rather neatly embodies this idea himself, having moved from student media to front-end developing. The Guardian once described the Wall Street Journal graphics editor as a “new breed of journalist-cum-coder”.
“I really like coding, and I really like web development. For me it was natural to develop these skills I was interested in,” he said, adding that we should all be aspiring for ‘technological literacy’.
“I wouldn’t say that everyone aspiring to be a journalist should learn to code. That would be ridiculous. But journalists should get a better understanding of what’s going on behind the scenes as computers become more and more essential.”