How Long Does It Take to Text-Mine 55,000 Publisher Web Pages?: A Technical Update from the 3rd of September 2014 Jisc Monitor Webinar from Richard Jones

Running up to the end of the first quarter, we have been working on detecting licences of open access articles from the Directory of Open Access Journals (DOAJ).  There are 1.7 million articles in the DOAJ, and approximately 50% have DOIs (and so are processable), so the scale of the task is not insignificant.  Even if we constrain the data set to only journals published in the UK, there are around 160,000 articles we may want to process.  This poses a significant technical challenge, as in order to detect the licence, we need to resolve each DOI, download the content, and mine the text for information which is time consuming and error-prone.

We are using the Open Article Gauge (OAG) service to perform the heavy-lifting in this process. It has been developed by Cottage Labs with funding from PLOS  to solve the problem of determining what licence end-users of articles are actually provided when reading the full-text (as opposed to what licence the publisher asserts in their terms and conditions an article will have).  Detailed diagrams on its internal workflow and API are available from the website via the links attached.

Its approach is to receive requests for lists of DOIs (or PubMed IDs), and to check its cache for known licences associated with those identifiers, and if it fails to find anything in its database launches a process to download and detect the licence conditions.  For the user of the API this means that any request will result in a combination of licence information for some identifiers and instructions to check back later for others.  The process which actually detects licences runs asynchronously from the web application, and employs a queueing system, so new identifiers are added to the end of the queue, and they are processed when they get to the front; once a licence has been detected (or an error received) then the result is written to the cache, ready for when the user of the API checks back for an update.

The consequence of this approach – which is necessary for the scale of data involved – is that the client which uses the OAG API must also be relatively sophisticated, and understand how best to send and re-send identifiers and to interpret the results.  To this end, we have written some software which batches the DOIs from DOAJ up into “jobs” (each consisting of a few thousand identifiers), and has rules for each job on how regularly to check for updates and how long to try before giving up.  We then trigger those jobs at sensible intervals (say, every 15 minutes), and leave them to run.  In the first instance we ran 100,000 identifiers in 50 jobs, and the system ran for 4 days before we shut it down again, retrieving 55,000 licence statements in the process.  We have also designed the system to be robust to being shut down (an important consideration for systems which can run for so long), so if we were to start it again today, it would go back and look for the 45,000 identifiers it hasn’t yet got.

In the process of running this huge batch of identifiers through OAG, we identified a number of critical issues that will need to be resolved in order to make the system scale to Jisc Monitor requirements:

  • Performance: the computational intensity of the jobs means that the server OAG operates on has to work very hard to get results.  We were achieving something like 10,000 – 15,000 lookups per day, but with nearly a million DOIs in DOAJ alone, it would be nice if the system could operate faster
  • Error handling: we found several interesting edge and corner cases where OAG didn’t adequately handle errors in licence detection, meaning some lookup jobs never completed
  • Unnecessary downloads: some DOIs resolve to PDFs and at this moment in time OAG only looks at text files (e.g. HTML).  Since PDFs are large, they take extra time to download and once they have been downloaded they are no use.  We need to add code to OAG to prevent it even trying the download in future, which will significantly raise performance.
  • Disambiguation between different kinds of failure: it is a legitmate outcome of a licence detection that OAG may fail to detect a licence, and it will record that failure.  But there are a variety of reasons that licence detection may fail, and it would be useful for Jisc Monitor to be able to disambiguate between them.

We have created a milestone in the OAG issue tracker and have started work to address these issues.  The goal is to be able to run all of the DOIs from UK publications through OAG, which we can begin as soon as we have a system which is up to the task!

There are a number of directions we could now go in, beyond the technical development planned for OAG:

  • Add more detection features to OAG: as OAG is essentially a large asynchronous job queue, which loads plugins to perform tasks on content, we have the option to add more tasks or more plugins to detect other kinds of things about the content
  • Compare detected licences with asserted licences: in many cases publishers advertise their licence policy, and with detected licences from OAG, we can compare and report on whether publishers are providing the promised service, or if they are not sufficiently advertising it on individual articles
  • Add more data sources for identifiers: DOAJ is a strong OA dataset, but it could be of value to consider other datasets of hybrid journals, such as those at JournalToCs 

For the remainder of this sprint, though, we will be finishing up the work on pulling together a dataset from DOAJ, enhancing it with records from EuropePMC and running everything from the UK through OAG, to produce an enhanced dataset for Jisc Monitor.

In addition to this technical work, the Jisc Monitor Technical Team has continued to run fortnightly webinars as part of the user engagement, focused on four use cases.  A recording of the 3rd of September session is available here:  The next Jisc Monitor workshop is scheduled for the 19th of September in London from 10:00-16:00. Registration details are available at  Follow-on webinars will then take place again every two weeks in October.  Details will be forthcoming but contact Frank Manista ( if you would have any questions.

Leave a Reply

Your email address will not be published. Required fields are marked *