We’re now a few months into our ESRC-funded ‘Civil Society Data Partnership’ project, which is using open data to answer key questions about the voluntary sector. I thought I’d give two researchers working on the project – Maria Pikoula here at NCVO and Charlie Rahal at the Third Sector Research Centre (University of Birmingham) – to give an overview of what we’re doing.
The project isn’t just about answering research questions, we’re also in the business of ‘data resource construction’. This involves more than just bringing open data together, we’re also describing the process, the pitfalls, and hoping to produce guides for others who use the same data.
Maria (NCVO) – data from 360Giving
Based on data collected for the UK Civil Society Almanac, grants are a major source of income for voluntary organisations, allowing them to pursue their charitable activities. Charities receive grants from government (although increasingly less so), other charities and the National Lottery.
Charity accounts provide us with a good overview of how much income they make from grants, but there is very little granularity in this data. Grant-making foundations and Lottery-backed funders receive hundreds of thousands of applications for their grants each year, and make awards ranging from a few thousand pounds to millions.
Until recently, it has been impossible to collect and analyse data on a single-award level. Enter 360giving: founded by Fran and William Perrin of the Indigo Trust, 360Giving is an organisation that seeks to encourage and help grantmakers to publish their data, by providing the necessary infrastructure. This includes coming up with and maintaining a data standard as well as hosting the data on their website.
This is only the beginning of the effort by 360Giving, and so far only 24 funders have released some or all of their award data, amounting to a total of 260,000 unique awards. This includes data that often doesn’t comply with the 360 standard and is missing important details, such as unique identifiers for the recipient organisations (for example their company or charity numbers). Nevertheless, it’s a fantastic first step, without which this Data Partnership would not be possible.
We’ve started processing this data and turning it into useful information for the sector. We’ll be looking at what makes grant recipients different to other charities, and how recipients fare in the years after their grants end.
Charlie (TSRC) – data from local authorities
One of the main goals of the Data Partnership is to open up and analyse expenditures and grants from local authorities (LAs) which are required to be made public by the Local Transparency Code (updated for 2015). While, by and large, the majority of LAs do to some extent make this data publicly available, the first fundamental challenge lies in the variety of unique ways this data is presented.
Despite the provision of a spreadsheet template by the Local Government Association (in addition to guidance on the mandatory and recommended information required by the Code), the data formats seen range from plain text to .pdfs, .csvs, and .json files, published in a labyrinth of subdomains of LA websites at a range of different frequencies and coverage.
We overcame these issues by writing a set of scripts and functions to accumulate, merge, clean and parse all of the varied datasets which we could find at each of the LA websites. From the 326 LAs, we believe that 304 of them both provide data, and provide it in a way that is easily machine readable (with the amount of persuasion required for a machine to read it successfully depending on the LA in question!). 187 LAs currently make separate datasets available which specifically pertain to grants made to the voluntary and community enterprise (VCSE) sector.
In all, the first stage leaves us with about 40 million expenditure transactions and over 50,000 unique grants to analyse. The second challenge lies in the fact that once the data has been attained, parsed, cleaned and merged, accounting standards do not mandate (or involve the oversight of) any specific level of accuracy.
This often results in it being extremely difficult to identify the grant beneficiaries or expenditure suppliers. Our approach was to match our huge list of recipients with the Charity Commission and Companies House databases using a variety of computational name matching techniques using both distributed and local methods on exact and approximate strings.
Our current estimate is that we can match upwards of 65% of the transactions – a promising start! The next (imminent) step along this tangent is to rigorously debug the algorithms to ensure the matches are sufficiently accurate (while maintaining a good match frequency), and then to merge these matches with other key pieces of information on each of the suppliers and recipients.
In the short to medium term, our goal is to incorporate additional datasets on the funding of the third sector through public sources, such as information on contracts (through LA websites) and Clinical Commissioning Groups (through the NHS), with the final intention of making our datasets freely available to the public through a data repository, along with the tools to analyse them.
More to come
I hope this gives you an overview of where we are so far. There’s more to come – in particular in the new year we’ll be sharing and collaborating with the data through events and more.