Publication & License Harvesting – Development Update from Richard Jones

We began this sprint by focussing on running the DOAJ article data through, since this seemed likely to present the most challenges, and therefore require the most time in this section of the project.

We made a partial clone of the Directory of Open Access Journals (DOAJ) data, containing 100,000 articles which have Digital Object Identifiers (DOIs); approximately, 50% of the DOAJ articles do not have DOIs.  This is a large enough subset of the 1.7 million articles in DOAJ that we can get a representative feel for the datal it also presents us with scalability challenges which we can work to overcome to ensure that the full dataset is ultimately processable.

It was also necessary for us to create a new client library for, which was geared specifically towards high volume, long running requests for data from the service. Since downloads content and analyses that content, it can take a long time to process all the identifiers (in the order of many hours or even days), so the client library would have to be robust enough to handle this kind of operation. The current version can be found here: (

In an initial run of 5000 of these identifiers, our results showed 50% of the articles licences could not be detected, 30% were CC BY, 15% were Free to Read, and the remaining 5% were other CC licences.  This is only a small sub-set of the data, and therefore may not be representative of the bigger picture, but it gives us an early indication, as well as points us in the direction of some issues that we will need to resolve with

  • we need to look into ways of disambiguating the kinds of failure.  There are 3 ways that a licence can fail to be found: because there was a technical problem with the publisher’s website, because the publisher does not provide licence text, or because does not know how to interpret that publisher’s page.
  • we need to ignore DOIs which resolve directly to PDFs, because cannot currently analyse PDFs for licence statements.

Our next priorities, then, include:

  1.  Make changes to the service to improve failure detection and overall success rate
  2. Continue to improve the client library so that it can operate at the scale required by the project
  3. Set up a full testing environment containing a full clone of DOAJ articles and journals
  4. Get a first pass attempt to pull bibliographic metadata from either Entrez or EuropePMC

In addition to this technical work, the Jisc Monitor Technical Team has continued to run fortnightly webinars as part of the user engagement, focused on four use cases.  A recording of the session is available here: 6th August Jisc Monitor Webinar.  The next webinar is scheduled for the 20th of August from 10:00-11:00.


Leave a Reply

The following information is needed for us to identify you and display your comment. We’ll use it, as described in our standard privacy notice, to provide the service you’ve requested, as well as to identify problems or ways to make the service better. We’ll keep the information until we are told that you no longer want us to hold it.
Your email address will not be published. Required fields are marked *