Repository Fringe Day Two LiveBlog

Welcome back to day two of Repository Fringe 2015! For a second day we will be sharing most of the talks via our liveblog here, or you can join the conversation at #rfringe15. We are also taking images around the event and encourage you to share your own images, blog posts, etc. Just use the hashtag and/or let us know where to find them and we’ll make sure we link to your coverage, pictures and comments.

This is a liveblog and that means there will be a few spelling errors and may be a few corrections required. We welcome your comments and, if you do have any corrections or additional links, we encourage you to post them here. 

Integration – at the heart of  – Claire Knowles, University of Edinburgh; Steve Mackey, Arkivum

Steve is leading this session, which has been billed as “storage” but is really about integration.

We are a company, which came out of the University of Southampton, and our flagship Arkivum100 service has a 100% data integrity guarantee. We sign contracts in the long term, for 25 years – most cloud services sign yearly contracts. But we also have a data escrow exit – so there is a copy on tape that enables you to retrieve your data after you have left. It uses all open source encryption which means it can be decrypted as long as you have the key.

So why use a service like Arkivum for keeping data alive for 25+ years. Well things change all the time. We add media all the time, more or less continually… We do monthly checks and maintenance updates but also annual data retrieval and integrity checks. There are companies, Sky would be an example, that has a continual technology process in place – three parallel systems – for their media storage in order to keep up with technology. There is a 3-5 year obsolescence of services, operating systems and software so we will be refreshing hardware, and software and hardware migrations.

The Arkival appliance is a CIFS/NFS rpresentation which means it integrates easily to local file systems. There is also a robust REST API. There is simple administration of users permissions, storage allocations etc. We have a GUI for file ingest status but also recovery pre-staging and security. There is also an ingest process triggerd by timeout, checksum, change, manifest – we are keen that if anything changes you are triggered to check and archive the data before you potentially lose or remove your local comment.

So the service starts with original datasets and files, we take copy for ingest, via the Arkivum Gateway on Appliance, we encrypt and also decrypt to check the process. We do check sums at all stages. Once all is checked it is validated and sent to our Archive on the Janet Network, and it is also archived to a second archive and to the escrow copy on tape.


We sit as the data vault, the storage layer within the bigger system which includes data repository, data asset register, and CRIS. Robin Taylor will be talking more about that bigger ecosystem.

We tend to think of data as existing in two overlapping cycles – live data and archive data. We tend to focus much more on the archive side of things, which relates to funder expectations. But there is often less focus on live data being generated by researchers – and those may be just as valuable and in need of securing as that archive data.

In a recent Jisc Research Data Spring report the concept of RDM Workflows is discussed. See “A consortial approach to building an integrated RDM system – “small and specialist”” and that specifically talks about examples of workflows, including researcher centric workflows that lays out the process for the research to take in their research data management. We have examples in the report include those created for Loughborough and Southampton.

Loughborough have a CRIS, they have DSpace, and they use FigShare for data dissemination. You can see interactions in terms of the data flow are very complex [slides will be shared but until then I can confirm this is a very complex picture] and the intention of the workflow and process of integration is to make that process simpler and more transparent for the researcher.

So, why integrate? Well we want those process to be simpler and easier to encourage adoption and also lower cost of institutional support to the research base. It’s one thing to have a tick box, it’s another to get researchers to actually use it.  We also, having been involved multiple times, have experience in the process of rolling RDM out – our work with ULCC on CHEST particularly helped us explore and develop approaches to this. So, we are checking quality and consistency in RDM across the research base. We are deploying RDM as a community driven shared service so that smaller institutions can “join forces” to benefit from having access to common RDM infrastructure.

So, in terms of integrations we work with customers with DSpace and EPrints, with customers using FigShare, and moving a little away from the repository and towards live research data we are also doing work around Sharegate (based on archiving Sharepoint), iRODS and QStar; and with ExLibris Rosetta and archivematica. We really have yet to see real preservation work being done with research data management but it’s coming, and archivematica is an established tool for preservation in the cultural heritage and museums sector.

Q1) Do you have any metrics on the files you are storing?

A1) Yes, you can generate reports from the API, or can access via the GUI. The QStar tool, and HSM tool, allows you to do a full survey of the environment that will crawl your system and let you know about file age and storage etc. And you can do a simulation of what will happen.

Q2) Can I ask about the integration with EPrints?

A2) We are currently developing a new plugin which is being driven by new requirements for much larger datasets going into EPrints and linking through. But the work we have previously done with ULCC is open source. The Plugins for EPrints are open source. Some patches created were designed by @mire so a different process but after those have been funded they are willing for those to be open source.

Q3) When repositories were set up there was a real drive for the biggest repository possible, being sure that everyone would want the most storage possible… But that is also expensive… And it can take a long time to see uptake. Is there anything you can say that is helpful for planning and practical advice about getting a service in place to start with? To achieve something practical at a more modest scale.

A3) If you use managed services you can use as little as you want. If you build your own you tend to be into a fixed capital sum… That’s a level of staffing that requires a certain scale. We start as a few terabytes…

Comment – Frank, ULCC) We have a few customers who go for the smallest possible set up for trial and error type approach. Most customers go for the complete solution, then reassess after 6 months or a year… Good deal in terms of price point.

A3) The work with Jisc has been about looking at what those requirements are. From CHEST it is clear that not all organizations want to set up at the same scale.

Unfortunately our next speaker, Pauline Ward from EDINA,  is unwell. In place of her presentation, Are your files too big? (for upload / download) we will be hearing from Robin Taylor.

Data Vault – Robin Taylor 

This is a collaborative project with University of Manchester, funded by JISC Research Data Spring.

Some time back we purchased a lot of kit and space for researchers, giving each their own allocation. But for the researcher the workflow the data is generated, that goes into a repository but they are not sure what data to keep and make available, what might be useful again, and what they may be mandated to retain. So we wanted a way for storing that data, and that needed to have some sort of interface to enable that.

Edinburgh and Manchester had common scenarios, commons usages. We are both dealing with big volumes of data, hundreds of thousands or even millions of files. It is impossible to use mechanisms of web interfaces for upload. So we need that archiving to happen in the background.

So, our solution has been to establish the Data Vault User Interface, that interacts with a Data Vault Broker/Policy Engine that interacts with Data Archive, and the Broker then interacts with the active storage. But we didn’t want to build something so bespoke that it didn’t integrate with other systems – the RSpace lab notebooks for instance. And it may be that the archive might be Arkivium, or might be Amazon, or might be tape… So our mechanism abstracts that in a way, to create a simple way for researchers to archive their data in a standard “bag-it” type way.

But we should say that preservation is not something we have been looking at. Format migration isn’t realistic at this point. At the scale we are receiving data doesn’t make that work practical. But, more important, we don’t know what the suitable time is, so instead we are storing that data for the required period of time and then seeing what happens.

Q1) You mentioned you are not looking at preservation at the moment?

A1) It’s not on our to do list at the moment. The obvious comparison would be with archivematica which enables data to be stored in more sustainable formats, because we don’t know what those formats will be… That company have a specific list of formats they can deal with… That’s not everything. But that’s not to say that down the line that won’t be something we want to look at, it’s just not what we are looking at at the moment. We are addressing researchers’ need to store data on a long term basis.

Q1) I ask because what accessibility means is important here.

A1) On the live data, which is published, there is more onus on the institution to ensure that is available. This hold back wider data collections

Q2) Has anyone ever asked for their files back?

A2) There is discussion ongoing about what a backup is, what an archive is etc. Some use this as backup but that is not what this should be. This is about data that needs to be more secure but maybe doesn’t need to have instant access – which it may not be. We have people asking about archiving, but we haven’t had requests for data back. The other question is do we delete stuff that we have archived – the researchers are best placed to do that so we will find out in due course how that works.

Q3) Is there a limit on the live versus the archive storage?

A3) Yes, every researcher has a limited quantity of active storage, but a research group can also buy extra storage if needed.  But the more you work with quotas, the more complex this gets.

Comment) I would imagine that if you charge for something, the use might be more thoughtful.

A3) There is a school of thought that charging means that usage won’t be endless, it will be more thoughtful.

Repositories Unleashing Data! Who else could be using your data? – Graham Steel, ContentMine/Open Knowledge Scotland

Graham is wearing his “ContentMine: the right to read is the right to mine!” T-shirt for his talk…

As Graham said (via social media) in advance of his  talk, his slides are online.

I am going to briefly talk about Open Data… And I thought I would start with a wee review of when I was at Repository Fringe 2011 and I learned that not all content in repositories was open access, which I was shocked by! Things have gotten better apparently, but we are still talking about Gold versus Green even in Open Access.

In terms of sharing data and information generally many of you will know about PubMed.

A few years a blogger/Diabetes Researcher, Jo Brodie and regular tweeter asked why PubMed didn’t have social media sharing buttons. I crowd-sourced opinion on the issue and sent the results off to David Lipman (who I know well) who is in overall charge of NBCI/PubMed.  And David said “what’s social media ?”.

It took about a year and three follow ups but PubMed Central added a Twitter button and by July 2014, sharing buttons were in place…

Information wants to be out there, but we have various ways in which we stop that – geographical and license restricted streaming of video for instance.

The late great Jean-Claude Bradley saw science as heading towards being led by machines… This slide is about 7 years old now but I sense matters have progressed since then !


But at times, is not that easy to access or mine data still – some publishers charge £35 to mine each “free” article – a ridiculous cost for what should be a core function.

The Open Knowledge Foundation has been working with open data since 2004… [cue many daft Data pictures about Star Trek: The Next Generation images!].

millennium falcon

We also have many open data repositories, figshare (which now has just under 2 million uploads), etc. Two weeks back I didn’t even realize many universities have data repositories but we also want Repositories Unleashing Data Everywhere [RUDE!] and we also have the new initiative, the Radical Librarians Collective…


Les Carr, University of Southampton (


The Budapest Open Access Initiative kind of kicked us off about ten years ago. Down in Southampton, we’ve been very involved in Open Government Data and those have many common areas of concern about transparency, sharing value, etc.

And we now have which enables the sharing of data that has been collected by government. And at Southampton we have also been involved recently in understanding the data, the infrastructure, activities, equipment of academia by setting up That is a national aggregator that collects information from open data on every institution… So, if you need data on, e.g. on DNA and associated equipment, who to contact to use it etc.

This is made possible as institutions are trying to put together data on their own assets, made available as institutional open data in standard ways that can be automatically scraped. We make building info available openly, for instance, about energy uses, services available, cafes, etc. Why? Who will use this? Well this is the whole thing of other people knowing better than you what you should do with your data. So, students in our computer science department for instance, looked at building recommended route apps, e.g. between lectures. Also the cross with catering facilities – e.g. “nearest caffeine” apps! It sounds ridiculous but students really value that. And we can cross city bus data with timetables with UK Food Hygiene levels – so you can find where to get which bus to which pub to an event etc. And campus maps too!

Now we have a world of Open Platforms – we have the internet, the web, etc. But not Google – definitely not open. So… Why are closed systems bad? Well we need to move from Knowledge, to Comprehension, to Application, to Analysis, to Synthesis and to Evaluation. We have repositories at the bottom – that’s knowledge, but we are all running about worrying about REF2020 but that is about evaluation – who knows what, where is that thing, what difference does that make…

So to finish I thought I’d go to the Fringe website and this year it’s looking great – and quite like a repository! This year they include the tweets, the discussion, etc. all in one place. Repositories can learn from the Fringe. Loads of small companies desperate for attention, and a few small companies who aren’t bothered at all, they know they will find their audience.

Jisc on Repositories unleashing data – Daniela Duca, Jisc


I work in the research team at Jisc and we are trying to support universities in their core business and help make research process more productive. And I will talk about two projects in this area: UK Research Data Discovery Service and the second Research Data Service.

The Research Data Discovery Service (RDDS) is about making data more discoverable. This is a project which is halfway through and is with UK Data Archive and the DCC. We want to move from a pilot to a service that makes research data more discoverable.

In Phase 1 we had the pilot to evaluate the Research Data Australia developed by ANDS, with contributions from UK data archive, Archeology data centre, and NERC data centres. In Phase 2 Jisc, with support from DCC and UKDA are funding 9 more institutions to trial this service.

The second project, Research Data Usage and Metrics comes out of an interest in the spread of academic work, and in the effectiveness of data management systems and processes. We are trying to assess use and demand for metrics and we will develop a proof of concept tool using IRUS. We will be contributing to and drawing upon a wide range of international standards.

And, with that, we are dispersing into 5 super fast 17 minute breakout groups which we hope will add their comments/notes here – keep an eye on those tweets (#rfringe15) as well!

We will back on the blog at 11.15 am after the breakouts, then coffee and a demo of DMAOnline – Hardy Schwamm, Lancaster University.

And we are back, with William Nixon (University of Glasgow) chairing, and he is updating our schedule for this afternoon which sees our afternoon coffee break shortened to 15 minutes.

Neil and I will be talking about some work we have been doing on linking research outputs. I am based at the British Library working as part of a team working on research outputs.

Linking Data – Neil Chue Hong, Software Sustainability Institute; Rachael Kotarski, Project THOR

Rachael: Research is represented by many outputs. Articles are some of the easier to recognise outputs but what about samples, data, objects emerging from research – they could be 100s of things… Data citation enables reproducibility – if you don’t have the right citation, and the right information, you can’t reproduce that work.

Citation also enables acknowledgement, for instance of historical data sets and longitudinal research over many years which is proving useful in unexpected ways.

Data citation does also raise authorship issues though. A one line citation with a link is not necessarily enough. So some of the work at DataCite and the British Library has been around linking data and research objects to authors and people, with use of ORCID alongside DOIs, URLs, etc. Linking a wide range of people and objects together.

THOR is a project, which started in June, on Technical and Human Infrastructure. This is more work on research objects, subject areas, funders, organisations… really broadening the scope of what should be combined and linked together here.

So the first area here is in research – and understanding the gaps there, and how those can be addressed. And bringing funders into that. And we are also looking at integration of services etc. So one thing we did in ODIN and are bringin into THOR is about connecting your ISNI identifier to ORCID, so that there is a relationship there, so that data stays up to date. And the next part of the work is on Outreach – work on bootcamps, webinars, etc. to enable you to feed into the research work as well. And, finally, we will be looking at Sustainability, looking at how what we are doing can be self-funded beyond the end of the project, through memberships of partner organisations: CERN, DataCite, ORCID, DRYAD, EMBL-EBI, ands, PLoS, Elsevier Labs, Panomia(?). This is an EU funded project but it has international scope and an international infrastructure.

So we want to hear about what the issues are for you. Talk to us, let us know.

Linking Software: citations, roles, references and more – Neil Chue Hong

Rachael gave you the overview, I’m going into some of the detail for software, my area. So we know that software is part of the research lifecycle. That lifecycle relies on the ability to attribute and credit things and that can go a bit wrong for software. Thats because our process is a little odd… We start research, we write software, we use software, we produce results, and we public research papers. Now if we are good we may mention the software. We might release data or software after publication… rather than before…

A better process might be to start the research, to identify existing software, we might adapt or extend software, release software (maybe even publish a software paper), use software, produce results, might release data and public data paper, and then we publish research paper. That’s great but also more complex. Right now we use software and data papers as proxies for sharing our process.

But software is not that simple, the boundaries can be blurry… Is it the workflow, the software that runs the workflow, the software that references the worksflow, the software that supports the software that references the workflow, etc? What’s the useful part? Where should the DOI be for instance? It is currently at programme level but is that the right granularity? Should it be at algorithm level? At library level? Software has the concept of versioning – I’d love our research to be versioned rather than “final” but that’s a whole other talk! But the versioning concept indicates a change, it allows change… but again how do we decide on when that version occurs?

And software also has the problem of authorship – which authors have had what impact on each version of the software? Who has the largest contribution to the scientific results in a paper? So for a project I might make the most edits to a code repository – all about updating the license – so the biggest contribution but would the research community agree? Perhaps not. Now I used to give this talk and say “this is why software is nothing like data” but now I say “this is why software is exactly like data”!

So, the different things happening now to link together these bits and piece. GitHub, Zenodo, FigShare and Institutional Repo looked at “package level” one click deposit, with a citable DOI. There has been work around sword deposit which Stuart Lewis has been looking at too. So you can now archive software easily – but that’s easily – but it’s the social side that needs dealing with. So, there is a brand new working group led by Force11 on Software Citation – do get involved.

And there are projects for making the roles of authors/contributors clearer: Project Credit is looking at Gold/Silver/Bronze levels. But Contributor Badges is looking at more granular recognition. And we also have work on code as a research object, and a project codemeta that is looking at defining minimal metadata.

So that brings us to the role of repositories and the Repository Fringe community. Imperial College for instance is looking at those standards and how to include them in repositories. And that leads me to my question to you: how does the repository community support this sort of linkage?

Q1 – William Nixon) Looking at principles of citation… how do you come up with those principles?

A1 – Neil) Force11 has come up with those citation principles and those are being shared with the community… But all communities are different. So it is easy to get high level agreement, but it is hard to get agreement at implementation details. So authorship changes over time, and changes version to version. So when we create principles for citation do we create all collectively and equally, or do we go the complex route of acknowledging specific individual contributions for a particular version. This causes huge debate and controversy in the open source community about who has the appropriate credit etc. For me, what do we need to deposit? Some information might be useful later in reward lifecycle…. But if I’m lead author will that be a priority here?

Q2 – Paul Walk) My internal hippie says that altruism and public good comes into open source software and I wonder if we are at risk of messing with that sordid research system…

A2 – Neil) I would rebutt that most open source contribution and development is not altruistic. It is people being rewarded in some way – because doing things open soure gives them more back than working alone. I wouldn’t say altruism is the driving force or at least hasn’t been for some time.. It’s already part of that research type system.

Comment) For me this is such a high level of where we are, you are talking about how we recognise contribution, citation etc. but just getting things deposited is the issue for me right now… I’d love to find out more about this but just convincing management to pay for ORCID IDs for all is an issue even…

A2 – Rachael) We do need to get work out about this, show how researchers have done this and the value of those will help. It may not just be through institutions but through academic societies etc. as well..

A2 – Neil) And this is back to the social dimension and thinking about what will motivate people to deposit. And they may take notice of editors… Sharing software can positively impact citations and that will help. Releasing software in the image processing community for instance also shows citations increase – and that can be really motivating. And then there is the economic impact for universities – is there a way we can create studies to show positive reputation and economic impacts on the institution that will prove the benefit for them.

Q3) A simple question – there are many potential solutions for software data… but will we see any benefits from them until we see REF changing to value software and data to the same extent as other outputs.

A3 – Neil) I think we are seeing a change coming. It won’t be about software being valued as much as papers. It will be about credit for the right person so that they are value. What I have seen in research council meetings is that they recognise that other outputs are important. But in a research project credit tends to go to the original writer of a new algorithm perhaps, not the developer who has made substantial changes. So where credit goes matters – the user, implementer, contributor, originator, etc? If I don’t think I will get suitable credit then where is the motivation for me to deposit my software?

EC Open Data Pilot, EUDAT, OpenAIRE, FOSTER and PASTEUR4OA – Martin Donnelly, Digital Curation Centre

I was challenged yesterday by Rachael and by Daniela at Jisc to do my presentation in the form of a poem…

There once was a man from Glasgee
Who studied data policy
In a project called FOSTER
Many long hours lost were
And now he?ll show some slides to ye?

So I will be talking about four European funded projects on research data management and open access that are all part of Horizon 2020. Many of you will be part of Horizon 2020 consortia, or will be supporting researchers who are. It is useful to remind ourselves of the context by which these came about…

Open Science is situated within a context of ever greater transpareny, accessibility and accountability. It is both a bottom up issue: the OA concept was coined about 10 years back in Budapest and was led by the high energy physics community who wanted to be more open in sharing their work, and to do so more quickly.  And it has also been driven from the top through government/funder support, increasing public and commercial engagement in research. To ensure better take up and use of research that has been invested in.

Policy wise in the UK the RCUK has seven Common Principles on Data Policy, size of the RCUK funders require data management plans. That is fitting into wider international policy moves. Indeed if you thought the four year EPSRC embargo timeline was tight, South Africa just introduced a no more than 12 month requirement.

Open Access was a pilot in FP7, this ran from August 2008 until the end of FP7 in 2013. It covers parts of FP7, but it is covers all of FP8/Horizon 2020 although that is a pilot process intended to mainstream by FP9 or whatever it is known by. The EC sees real economic benefit to IA by supporting SMEs and NGOs that can’t afford subscriptions to latest research. Alma Swan and colleagues have written on the opportunity costs which provides useful context to the difference Open Access can make.

Any project with H2020 funding have to make any peer-reviewed journal article they publish in an openly available and free to access, free of charge, via a repository – regardless of how they publish and whether green or gold OA.

H2020 also features an Open Research Data pilot – likely to be requirement by FP9. It applies to data and metadata needed to validate scientific results which should be deposited in a dedicated data repository. Interestingly, whilst data management plans needs to be created 6 months into project, and towards the end, they don’t require them to be filed with the EU at the outset.

So, lastly, I want to talk about four projects funded by the EU.

Pasteur4OA aims to simplify OA mandates across the EU – so that funders don’t have conflicting policy issues. That means it is a complex technical and diplomatic process.

OpenAIRE aims to promote use and reuse of outputs from EU funded research

EUDAT offers common data services through geographically distributed resilient network of 35 European organisations. Jisc and DCC are both working on this, integrating the DCC’s DMP Online tool into those services.

The FOSTER project sis supporting different stakeholders, especially younger researchers, in adopting open access in the context of the European Research Area and to make them aware of H2020 requirements of them – with a big carrot and a small stick in a way. We want researchers to integrate open acces sprinciples and practice in their current research workflow – rather than asking them to change their way of working entirely. We are doing train the trainer type activities in this area and also facilitating adoption, reinforcement and of OA policies within and beyond the EC. Foster is doing this work through various methods, including identifying existing content that can be reused, repackaged, etc.

Jisc Workshop on Research Data Management and Research at Risk Activities, and Shared Services – Rachel Bruce, Daniela Duca, Linda Naughton, Jisc

Rachel is leading this session…

This is really a discussion session but I will start by giving you a very quick overview of some of the work in research at ris as well. But this is a fluid session – we are happy to accommodate other topics that you might want to talk about. While we give you a quick overview do think about an RDM challenge topic you might want to take the chance to talk about.

So, in terms of Research at Risk this is a co-design challenge. This is a process we take forward in Jisc for research and development, or just development end of the spectrum, but to address sector challenges. The challenges facing the sector here is about the fragmented approach to research data and infrastructure. Because of that we are probably not reaching all the goals we would wish to. Some of that relates quite closely to some of what David Prosser was saying yesterday about open access and the benefits of scale and shared services. So, we have been asked to address those issues in Research at Risk.

Within Research at Risk we have a range of activities, one of the biggest is about shared services, including in the preservation and curation gap. You have already heard about discovery and research data usage, also the Research Data Spring.

So, the challenges we want to discuss with you are:

  1. The Shared services for RDM – yesterday there was discussion around the SHERPA services for instance. (Rachel will lead this discussion)
  2. Journal research data policy registry (Linda will lead this session)
  3. Business case and funding for RDM – articulating the role of RDM (Daniela will lead this session)
  4. But also anything else you may want to discuss… (Varsha will lead this group discussion)

So, Shared Services… This is an architecture diagram we have put together to depict all of the key services to support a complete data management service, but also linking to national and international services. And I should credit Stuart Lewis at UoE and John Lewis (Sheffield?) who had done much of this mapping already. We have also undertaken a survey of respositories around potential needs of HEIs. Some responses around a possible national data repository; a call for Jisc to work with funders on data storage requirements for them to provide suitable discipline specific data storage mandate.

Linda: I will talk a bit about the Journal Research Data Policies Registry – you can find out more on our blog and website. We want to create a registry that allows us to turn back time to see what we can learn from OA practices. The aim is to develop best practice on journal policies between publishers and other stakeholders. We want to know what might make your life easier in terms of policies, and navigating research data policies. And that input into this early stage work would be very valuable.

Daniela: The business case and costings for RDM is at a very early stage but we are looking at an agreed set of guidance for the case for RDM and for costing information to support the business case in HEIs for research data management. This reflects the fact that currently approaches to funding RDM services and infrastructure vary hugely, and uncertainty remains… And I would like to talk to you about this.

Rachel: we thought we would have these discussions in groups and we will take notes on the discussions as they take place, and we will share this on our blog. We also want you to write down – on those big post it notes – the one main challenge that you think needs to be addressed which we will also take away.

So, the blog will be going quiet again for a while but we’ll try and tweet highlights from groups, and grab some images of these discussions. As Rachel has said there will also be notes going up on the Jisc Research at Risk blog after today capturing discussions… 

Cue a short pause for lunch, where there was also be a demo taking place from: DMPonline – Mary Donaldson and Mick Eadie, University of Glasgow.

Our first talk of this afternoon, introduced by William Nixon, is:

Unlocking Thesis Data – Stephen Grace, University of East London

This project is for several different audiences. For Students it is about bridging to norms of being a career research, visability and citations. Helping them to understand the scholarly communication norm that is becoming the reality of the world. But this also benefits funders, researchers, etc.

We undertook a survey (see: and we found several already assigning DOI’s to theses, but others looking to do more in this area. We also undertook case studies in six institutions, to help us better understand what the processes actually are. So our case studies were for University of East London; University of Southampton; LSE; UAL; University of Bristol; and University of Leicester. Really interesting to see the systems in place.

We undertook test creation of thesis DOIs with University of East London and University of Glasgow, and University of Southampton undertook this via an XML upload so a slightly more complex process. In theory all of that was quite straightforward. We were grateful for the Jisc funding for that three month project, it didn’t get continuation funding but we are keen to understand how this can happen in more institutions and to explore other questions: for instance how does research data relate to the theses, what is it’s role, is it part of the thesis, a related object etc?

So questions we have are: What systems would you use and can they create/use persistent identifiers? Guidance on what could/should/must be deposited? One record or more? Opportunities for efficiencies?

On the issue of one record or more, a Thesis we deposited at UEL was a multimedia thesis, about film making and relating to making two documentary films – they were deposited under their own DOIs. Is that a good thing or a bad thing? Is that flexibility good?

Efficiencies could be possible around cataloguing theses – that can be a repeated process for the repository copy and for the library’s copy and those seem like they should be joined up processes.

We would love your questions and comments and you can find all project outputs.

Q1) What is the funder requirement on data being deposited with theses?

A1) If students are funded by research councils, they will have expectations regardless of whether the thesis is completed.

Q2) Have you had any feedback from the (completed) students whose work has been deposited on how they have found this?

A2) I have had feedback from the student who had deposited that work on documentary films. She said as a documentary film maker there are fewer and fewer ways to exhibit those documentary films. As a non commercial filmmaker seeing her work out there and available is important and this acts as an archive and as a measure of feedback that she appreciates

Q3) On assigning ORCID IDs to students – I struggle to think of why that would be an issue?

A3) Theoretically there is no issue, we should be encouraging it.

Comment: Sometimes where there is a need to apply an embargo to a thesis because it contains content in which a publisher has copyright – it may be useful to have a DOI for the thesis and separate DOIs for the data, so that the data can be released prior to the thesis being released from embargo. [Many thanks to Philippa Stirlini for providing this edit via the comments (below)].

IRUS UK – Jo Alcock, IRUS UK

We are a national aggregation service for any UK Institutional Repositories which collects usage statistics. That includes raw download data from UK IRs for all item types within repositories. And it processes raw data into COUNTER compliant statistics. And that aggregation – of 87 IRs – enables you to get a different picture than just looking at your own repository.

IRUS-UK is funded by Jisc. Jisc project and service manage IRUS-UK and host it. Cranfield University undertake development and Evidence Base at Birmingham City University undertake user engagement and evaluation.

Behind the scenes IRUS-UK is a small piece of code that can be added to repository software and which employs the “Tracker Protocol”. We have patches for DSpace, Plug-ins for Fedora, and implementation guidelines for Fedora. It gathers basic data for each download and sends it to the IRUS-UK server. The reports are Report 1 and Report 4 COUNTER compliant. We also have an API and SUSHI-like service.

At present we have around 400k items covered by IRUS-UK. There are a number of different reports – and lots of ways to filter the data. One thing we have changed this year is that we have combined some of these related reports, but we have added a screen that enables you to filter the information. Repository Report 1 enables you to look across all repositories by month – you can view or export as Excel or CSV

As repositories you are probably more concerned with the Item Report 1 which enables you to see the number of successful item download requests by Month and Repository Identifier. You can look at Item Statistics both in tabular and graphical form. You can see, for instance, spikes in traffic that may warrant further investigation – a citation, a news article etc. Again you can export this data.

You can also access IRUS-UK Item Statistics which enable you to get a (very colourful) view of how that work is being referenced – blogged, tweeted, cited, etc.

We also have a Journal Report 1 – that allows you to see anything downloaded from that journals within the IRUS-UK community. You can view the articles, and see all of the repositories that article is in. So you can compare performance between repositories for instance.

We have also spent quite a lot of time looking at how people use IRUS-UK. We undertook a number of use cases around the provision of standards based, reliable repository statistics; reporting to institutional managers; reporting to researchers; benchmarking; and also for supporting advocacy. We have a number of people using IRUS-UK as a way to promote the repository, but also some encouraging competition through newsletters etc. And you can find out more about all of these use cases from a recent webinar that is available on our website.

So, what are the future priorities for IRUS. We want to increase the number of participating repositories in IRUS-UK. We want to implement the IRUS tracker for other repository and CRIS software. We want to expand views of daya and reports in response to user requirements – for instance potentially alt metrics etc. We also want to include supplementary data and engage in more international engagement.

If you want to contact us our website is; email; tweet @IRUSNEWS.

Q1) Are the IRUS-UK statistics open?

A1) They are all available via a UK Federation login. There is no reason they could not technically be shared… We have a community advisory group that have recently raised this so it is under discussion.

Q2) How do data repositories fit in, especially for text mining and data dumps?

A2) We have already got one data repository in IRUS-UK but we will likely need different reporting to reflect the very different ways those are used.

Q3) If a data set has more than one file, is that multiple downloads?

A3) Yes.

Q3) Could that be fixed?

A3) Yes, we are looking at looking at separate reporting for data repositories for just this sort of reason.

Sadly Yvonne Howard, University of Southampton, is unable to join us today due to unforeseen circumstances so her session, Educational Resources, will not be going ahead. Also the Developer Challenge has not been active so we will not have the Developer Challenge Feedback session that Paul Walk was to lead. On which note we continue our rejigged schedule…

Recording impact of research on your repository (not impact factors but impact in REF sense!) – Mick Eadie & Rose-Marie Barbeau, University of Glasgow; 

Rose-Marie: Impact is my baby. I joined Glasgow specifically to address impact and the case studies. The main thing you need to know about the impact agenda is that all of our researchers are really stressed about it. Our operating landscape has changed, and all we have heard is that it will be worth even more in future REFs. So, we don’t “do” impact, but we are about ensuring our researchers are engaging with users and measuring and recording impact. So we are doing a lot of bridging work, around that breadcrumb trail that explains how your research made it into, e.g. a policy document…

So we have a picture on our wall that outlines that sort of impact path… showing the complexity and pathways around impact. And yet even this [complex] picture appears very simple, reality is far more complicated… When I talk to academics they find that path difficult: they know what they do, they know what they have to show… so I have to help them understand how they have multiple impacts which may be multiple impacts, it might be be by quite a circuitous route. So for instance in a piece of archeological work impacted policy, made Time Team, impacted the local community… Huge impact, extension international news coverage… But this is the form for REF processes…

But my big message to researchers is that everything has changed: we need them to engage for impact and we take that work seriously. It’s easy to say you spoke to schools, to be part of the science festival. We want to capture what these academics are doing here professionally, things they may not think to show. And we want that visible on their public profile for example. And we want to know where to target support, where impact might emerge for the next REF.

So, I looked at other examples of how to capture evidence. Post REF a multitude of companies were offering solutions to universities struggling to adapt to the impact agenda. And the Jisc/Coventry-led project establishing some key principles for academic buy in – that it needed to be simple and very flexible – was very useful.

And so… Over to the library…

Mick: So Rose-Marie was looking for our help to capture some of this stuff. We thought EPrints might be useful to capture this stuff. It was already being used and our research admin staff were also quite familiar with the system, as are some of our academics. We also had experience of customising EPrints. And we have therefore added a workflow for Knowledge Exchange and Impact. We wanted this to be pretty simple – you can either share “activity” or “evidence”. There are a few other required fields, one of which is whether this should be a public record or not.

So, when an activity/evidence is added the lead academics have can be included, as can any collaborating staff. The activity details follow the REF vocabulary. We include potential impact areas for instance… And we’d like for that record to be linked to other university systems. But we are still testing this with research admin staff.

We still have a few things to do… A Summary page; some reporting searching and browsing functionality – which should be quite easy; link to other university systems (staff profiles etc); and we would like to share this with the EPrints community.

Q1) What about copyright?

A1 – Rose-Marie) Some people do already upload articles etc. as they appear. The evidence repository is hidden away – to make life easier in preparing for the next REF – but the activity is shared more publicly. Evidence is

Q2 – Les) It’s great to hear someone talking about impact in a passionate and enthuastic way! There is something really interesting in what you are doing and the intersection with preservation… In the last REF there was evidence lost that had been on the web. If you just have names and URLs, that won’t help you at the end of the day.

A2 – Rose-Marie) Yes, lack of institutional memory was the biggest issue in the last REF. I speak a lot to individuals and they are very concerned about that sort of data loss. So if we could persuade them to note things down it would jog memories and get them in that habit. If they note disappearing URLs that could be an issue, but also I will scan everything uploaded because I want to know what is going up there, to understand the pitfalls. And that lets me build on experience in the last REF. It’s a learning process. We also need to understand the size of storage we need – if everyone uploads every policy document, video etc. It will get big fast. But we do have a news service and our media team are aware of what we are doing, and trying to work with them. Chronological press listings from that media team isn’t the data structure we would hope for so we are working on this.

William) I think it is exciting! As well we don’t think it’s perfect – we just need to get started and then refine and develop that! Impact did much better than expected in the last REF, and if you can do that enthusiastically and engagingly that is really helpful.

A2 – Rose Marie) And if I can get this all onto one screen that would be brilliant. If anyone has any questions, we’d love to hear them!

Impact and Kolola – Will Fyson, University of Southampton

I work for EPrints Services but I also work for Kolola, a company I established with co-PhD students – and very much a company coming out of that last REF.

The original thinking was for a bottom up project thinking about 50 or 60 PhDs who needed to capture the work they were doing. We wanted to break down the gap between day to day research practice and the repository. The idea was to allow administrators to have a way to monitor and plan, but also to ensure that marketing and comms teams were aware of developments as well.

So, our front page presents a sort of wall of activity, and personal icons which shows those involved in the activity. These can include an image and clicking on a record takes you through to more information. And these records are generated by a form with “yes” or “no” statements to make it less confusing to capture what you have done. These aren’t too complex to answer and allow you to capture most things.

We also allow evidence to be collected, for instance outreach to a school. You can also capture how many people you have reached in this activity. We allow our community to define what sort of data should be collected for which sort of activity. And analytics allow you to view across an individual, or a group. That can be particularly useful for a large research group. You can also build a case study from this  work – useful for the REF as it allows you to build up that case study as you go.

In terms of depositing papers we can specify in the form that an EPrints deposit is required when certain types of impact activities are recorded – and highlight if that deposit has been missed. We can also export a Kolola activity to EPrints providing a link to the Kolola activity and any associated collections – so you to explore related works to a particular paper – which can be very useful.

We’ve tried to distribute a research infrastructure that is quite flexible and allow you to have different instances in an organisation that may be tailored to different needs of different departments or disciplines. But all backed up by the institutional repository.

Q1) Do you have any evidence of researchers gathering evidence as they go along?

A1) We have a few of these running along… And we do see people adding stuff, but occasionally researchers need prompting (or theatening!), for instance for foreign travel you have to be up to date logging activity in order to go! But we also saw an example of researchers getting an entry in a raffle for every activity recorded – and that meant a lot of information was captured very quickly!

(Graham Steel @McDawg taking over from Nicola Osborne for the remainder of the day)

Demo: RSpace – Richard Adams, Research Space


RSpace ELN presentation and demo. Getting data online as early as possible is a great idea. RSpace at the centre of user data management. Now time for a live demo (in a bit).

Lab note books can get lost due to a number of reasons. Much better is an electronic lab book. All data is timestamped. Who made what changes etc. are logged. Let’s make it easy them use. Here’s the entry screen when you first log in.  You can search for anything and it’s very easy to use. It’s easy to create a new entry. We have a basic document into which you can write content with any text editor. You can drag and drop content in very simply. Once documents have been added they appear in the gallery. Work is saved continuously and timestamped.

We also have file stores for large images and sequencing files.


It’s very easy to configure. Each lab has it’s own file server. Going back to workspace, we’re keen to make it really easy to find stuff. Nothing is ever lost or forgotten in workspace. You can look at revision history. You can review what changes have been made.  Now looking at a lab’s group page. You can look at but not edit other user generated content. You can invite people to join your group and collaborate with other groups. You can set permission for individual users. One question that comes up often is about how to get data out of the system. Items are tagged and contain metadata making them easier to find. To share stuff, there are 3 formats for exporting content (ZIP, XML and PDF).

The community edition is free and uses Amazon web services. We’re trying to simplify RSpace as much as possible to make it really easy to use. We are just getting round to the formal launch of the product but have a number of customers already. It’s easy to link content from the likes of DropBox. You can share content with people that are not registered with an RSpace account. Thanks for your attention.

Q1) I do lot’s of work from a number of computers.

A1) We’re developing an API to integrate such content. Not available just yet.
Closing Remarks and presentation to winner of poster competition – Kevin Ashley, Digital Curation Centre

I’m Kevin Ashley from Digital Curation Centre here in Edinburgh. Paul Walk mentioned that we’ve done RFringe events for 7 years. In the end, we abandoned the developer challenge due to a lack of uptake this year. Do people still care about it ? Kevin said there is a sense of disappointment. Do we move on or change the way we do it ? Les says I’ve had a great time, it’s been one of the best events I’ve been to for quite some time. “This has been fantastic”. Thanks Paul for your input there said Kevin.

David Prosser’s opening Keynote was a great opening for the event. There were some negative and worrying thoughts in his talk. We are good at identifying problems but not solutions. We have the attention of Governmental department in terms of open access and open data. We should maximize this opportunity before it dissapears.

Things that we talked about as experiments a few years ago have now become a reality. We’re making a lot of progress generally. Machine learning will be key, there is huge potential.

I see progress and change when I come to these events. Most in the audience had not been to RFringe before.

Prizes for the poster competition. The voting was quite tight. In third place LSHTM, Rory. Second place. Lancaster. First place, Robin Burgess and colleagues.

Thanks to all for organizing the event. Thanks for coming along. Thanks to Valerie McCutcheon for her contribution (gift handed over). Thanks to Lorna Brown for her help too. Go out and enjoy Edinburgh ! (“and Glasgow” quipped William Nixon).


I am Digital Education Manager and Service Manager at EDINA, a role I share with my colleague Lorna Campbell. I was previously Social Media Officer for EDINA working across all projects and services. I am interested in the opportunities within teaching and learning for film, video, sound and all forms of multimedia, as well as social media, crowdsourcing and related new technologies.

Tagged with: , , , , , , , , , , , , , , , ,
Posted in LiveBlog
One comment on “Repository Fringe Day Two LiveBlog
  1. says:

    Unlocking Thesis Data – Stephen Grace, University of East London
    [Q4/A4 was about embargoes but I’m afraid I missed it]
    The comment was, that sometimes where there is a need to apply an embargo to a thesis because it contains content in which a publisher has copyright – it may be useful to have a DOI for the thesis and separate DOIs for the data, so that the data can be released prior to the thesis being released from embargo.

Leave a Reply

Your email address will not be published. Required fields are marked *



Latest Tweets