APC Aggregation: Data Model and Analytical Usage (by Richard Jones)

Starting from the spreadsheets developed in the Jisc Collections TCO project, which aimed to identify and collect key APC information, we have been working to create a generalised data model which would allow us to build a coherent aggregation upon which to build useful analytics.

These spreadsheets are focussed on an individual institution’s knowledge of the APCs they have paid or contributed to, the funds they paid from (e.g. RCUK, COAF), the funders of the work, and some basic bibliographic information. Inherently spreadsheets have limitations in the data they can represent – as a tabular format they struggle to represent repeatable entries (e.g., multiple authors per article) or hierarchical information (e.g. multiple funders each with grant codes).

In our work we aimed to produce a format that could handle such structures, and also to allow for a combination of ad hoc and formal identification of entities. For example, authors may be identified by name – an unreliable identifier – but they may also be identified by email address or ORCID; similarly institutions may have organisational identifiers in Ringgold or ISNI.

Furthermore we wanted to describe the data – wherever possible – with existing standards or vocabularies, and to that end relied heavily on the RIOXX profile and – where RIOXX did not already provide a term – the DCMI terms. Wherever fields did not appear in either of these places we have taken the liberty of inventing our own terms, on the assumption that standardisation for these does not already exist, since we are ourselves breaking new ground.

A full representation of the data model at time of writing can be found here, and we won’t dissect it in detail in this post. The key things to note about it are:

  • An article may have an arbitrary number of identifiers
  • The model is agnostic as to whether the published work is actually an article – in theory other objects such as book chapters can equally well be described, should that become necessary
  • We have taken some liberties with the details of RIOXX in order to create a readable JSON formatted document structure, although the top-level elements conform fully to the profile
  • The central piece to the model is the APC data itself (jm:apc in the model), which allows us to theoretically aggregate records from individual institutions for the same identifier, to gain a more complete picture of the payments (and funds from which they were paid) for a given article.

In order to build an aggregation consisting of objects which conform to this data model, we have also designed a very basic (which is ideal) REST API, the state of which at time of writing can been seen here. Again, we will not go into this in detail; suffice to say that it supports the primary functions of a good content management API: Create, Retrieve, Update and Delete of records by the client (e.g. an institution’s APC management system, as being investigated by another part of the Jisc Monitor project).

At the moment, though, our main source of data are the TCO spreadsheets, and another principle that we created the model upon was that it should always be possible to convert one of the spreadsheets unambiguously into a model object. This requires us to build and maintain a crosswalk between the tabular form of the data and a sub-set of the full model. This is documented in detail here.

We have developed software which can take the TCO spreadsheets, convert them into the model format, and load them into the database which underpins the aggregation. This means that we can now effectively aggreagate data from participating institutions, and so far we have just over 2000 records which are available for analysis. These are available on our test server, though be aware this URL will unlikely persist beyond the end of the project.  The following thumbnail is a screenshot which illustrates the breakdown of institutions and publishers in the dataset:

AggregationScreenshot (2)

We have worked with both Jisc Collections and individual institutions in capturing their requirements for reports based on an aggregation (i.e. reports which are predicated on the knowledge of APCs from more than one institution). Their requirements are recorded here. In summary:

Jisc Collections, as an organisation responsible for negotiating with publishers over subscription costs:

  • Expenditure, Avg, High and Low per publisher
  • Expenditure on hybrid
  • Number of APCs paid vs Total Cost per institution
  • Avg APC against journal ranking
  • Correct licences applied?

Institutions, meanwhile, have some similar and some different kinds of interests:

  • Gold publications for which Green was possible
  • By Fund (paid from, e.g. COAF/RCUK) by Institution
  • Discount vs Non-Discount cost per publisher per institution
  • Number of APCs paid vs Total Cost per institution
  • Total Expenditure on individual journals
  • Avg APC against journal ranking

The next stage for us is to come up with some example reports for the most important of these. We have made a preliminary ranking of these points, with a view to implementing at least one from Jisc Collections and one from the institutions, and will be going on to work on the following:

  1. Expenditure, Avg, High and Low per publisher. You would be able to view these results for a given publisher across the whole data set (i.e. the whole sector), or to select an individual institution.
  2. Number of APCs paid vs Total Cost per institution. You would be able to specify some search constraints (TBC), and for those constraints view the total number of APCs and the total cost of those APCs, broken down by institution. Probably a highlight report of the top results, and a table of the complete result set.
  3. Gold publication for which Green was possible. Against the total number of APCs in the system, we would display the total number for which there was a green OA route.

This will form the deliverables from this second part of the project, and will be used in deciding where we go for the remainder of the project.

Current versions of resources

Throughout this document we’ve provided links to resource at time of writing. Since these resources are in version control, they will change with time. To see the latest versions, use the following links:

3 thoughts on “APC Aggregation: Data Model and Analytical Usage (by Richard Jones)

  1. Owen Stephens

    Hi Chris,

    Yes – this is a subset of the Jisc TCO data so far, as this is intended as a proof-of-concept that we have got the underlying data model correct, we can import data that fits the template used for the TCO work, and that we can generate useful reports from the result.

    Now we’ve got this working it should be trivial to add in new data from institutions and I’ll see if we can add the Sussex data – but please bear in mind that this is not meant to be a working system at the moment – just a proof-of-concept intended to explore the usefulness of such a system.

    Thanks again for taking the time to comment


  2. Chris Keene

    Hi Owen

    Thanks for the reply.
    Ah, thanks that reassured me it was just a subset – I get paranoid when I think our data is missing from some it should be.

    Of course, if you do plan to add more, then pleased if Sussex is on the list, but don’t go out of your way otherwise 🙂



Leave a Reply

The following information is needed for us to identify you and display your comment. We’ll use it, as described in our standard privacy notice, to provide the service you’ve requested, as well as to identify problems or ways to make the service better. We’ll keep the information until we are told that you no longer want us to hold it.
Your email address will not be published. Required fields are marked *