Research Data Management as a national service

The volume of data stored in research institutions is growing, and the rate at which it is growing is accelerating.

Modern research practices and equipment are generating huge amounts of data. In science there are many desktop devices that can create TeraBytes of data every day – they’re becoming more prolific; more compact; more affordable, and they’re popping up everywhere. This is great of course – more data allows the research community to answer bigger questions more quickly, but the flood of data is creating a management problem.

Driven mostly by university libraries, a lot of great work has been done to establish and encourage best practice in the management of data relating to published research, but this accounts for only a small percentage of all data that is currently held in research institutions – the very tip of the iceberg. I’ll come back to published data in a moment, but for the time being I’d like to focus on the rest of the data – the bulk of the iceberg: live data (that which is being collected, processed and analysed as part of current research work that is not yet ready to be published), or ‘old’ data (that relates to completed or published research, but has not itself been published or made accessible to the broader community).

The Gurdon Institute (a medium-sized research department within the University of Cambridge undertaking research into cancer biology and developmental biology) has a filestore that now accommodates over 200 million files, in approximately 3.5 PetaBytes of hardware. In common with most other institutions, we strive to embrace the principles of Open Science and Data Re-Use, and to ensure that our data is kept securely and remains accessible for at least ten years. It’s worth noting here that the typical life expectancy of storage hardware is around 5-7yrs, therefore a 10yr retention period requires two rounds of infrastructure investment over the course of the data lifecycle.

Distribution of GB universities

There are hundreds of research institutions around the country. In the map above, the red dot shows the location of the Gurdon Institute – shared with two Universities, The Sanger Institute, MRC LMB, Babraham Institute and many others. And all of these institutions, all over the country, are building and managing their own storage systems. Within each of these institutions there are individual departments that are building and managing their own local storage systems. At huge expense, all of these organisations are locked into cycles of buying, building and maintaining unique systems, despite all sharing a common goal. There are no published standards or best practice models that I’m aware of, and I dare say that the quality of provision across all of the institutions is inconsistent.

All of our research institutions are building separate but similar systems, to provide similar services to a single community.

Published data lives in a separate infrastructure all of its own – at another conference recently I heard someone suggest that there are tens of thousands of repositories around the world. I don’t know how many there are in the UK, but I think it’s a safe bet that most of the hundreds of institutions in the country have at least one repository running on local hardware, independent of their live data storage. And all of these repositories ALSO share a common goal – to curate and to disseminate published data and resources for the benefit of the entire research community. 

It seems completely nuts to me, that so much expensive resource is being multiplied all over the country, to provide exactly the same service to the same community – it’s hugely extravagant and wasteful. But that’s only part of the problem. Another problem is in the traditional (or popular) methods of organising data within these filestores.

It’s in here somewhere…

One of our research group leaders came to speak to me a while ago and told me that she felt she was ‘losing control of her data’ – she knew the files she wanted to find were somewhere in our filestore, but she just couldn’t find them. There’s a lot of great advice and guidance available about using robust file naming conventions, and well-organised directory structures, so I started describing some strategies, but she stopped me and said: “I’m pretty sure we’re doing all that!”. I looked over her shoulder as she logged in to the filestore and, sure enough, I found that she is using naming conventions, and she does have a good directory structure – the problem is that the directory structure is now more than 20 levels deep, and their filenames have become super-complex, reflecting the number and the complexity of the systems that have created the data. It seems that filenames have changed from being uselessly simple ten years ago, to being unhelpfully cryptic today. And the process of trying to find data created by someone who left the lab years ago, by rummaging through directories and speculatively double-clicking on files, is becoming increasingly hopeless. This highlights the value of a good metadata-driven repository system of course, with its massively more effective search tools.

So. Spending and effort and resources are being duplicated needlessly, and the systems they’re building don’t even serve their purpose very well. And given the deluge of data that’s now being generated, I just don’t think this approach is sustainable. We’re not quite at breaking point yet but the trajectory is clear, and I think now is a good time to start thinking about a different approach. I’ve become interested in the idea that we should replace all of this – or at least as much as possible – with a single, joined-up, national infrastructure for research data management.

A single, national infrastructure for research data management

My proposal is that a single infrastructure (comprised of at least three very large data centres) should accommodate all research data, for all disciplines and institutions, throughout the entire data lifecycle – from creation to publication – irrespective of how researchers move around between institutions.

One of my responsibilities is to help research groups migrate their data out of the Gurdon Institute and into their new institutions, and it really is a painful task. It’s time-consuming and can be very complicated because every institution that has its own infrastructure, also has its own policies and capacities that might not be compatible with the structure of the volume of data that has to be moved. And it shouldn’t be necessary. In the joined-up infrastructure shown above, the data stays put while people move around it.

Jane Smith would move from a university in Aberdeen to another in Manchester, and as soon as she steps into her new office, she could log into the same system, using the same credentials, and access the same data. The same benefit applies to operation and analysis, and collaboration and publication – there’s no need for the data to be moved in order to satisfy all of these functions – they can all be undertaken and managed remotely.

I would propose that the storage platform should be presented through a streamlined, lightweight repository interface using an ORCID login, and would automatically harvest a bunch of metadata from the file, and from the researcher’s ORCID profile, and from a minimal number of keywords that would be required in the submission form – the aim being to make the initial submission process for newly-created, live data almost as easy as saving a file on the researcher’s own computer. Persistent identifiers would be created and attributed to each new dataset immediately, and a revision history created – researchers would access their data using their ORCID credentials, and other stakeholders, collaborators, and eventually the whole community would access it via the persistent identifier – either shared by the researcher, or published in a paper, or referenced in the research documentation.

Every research discipline has different requirements and expectations of storage platforms and repository systems, and one size will not fit all. But this is where I think there are some interesting opportunities. My proposal is that we should build this as a basic, discipline-agnostic, universally-relevant storage platform – something that’s equally useful to all researchers in all disciplines – but design it in such a way that the developers and commercial partners can build and grow a library of interface skins that will provide specialist toolsets for those different research communities. I’m keen to stress here that this is not a solution for science disciplines only – this is a system for every researcher, in every faculty, in every institution – science and arts and humanities alike. All operating within their own familiar or specialist environments, but underneath those interfaces, all sharing a single, common infrastructure.

And this highlights one of the great additional benefits of a shared infrastructure for storage – it creates a framework for rapid deployment of other innovations to the entire community– all disciplines, in all institutions, simultaneously.

The benefits of a single, joined-up approach are clear:

• Single platform for everyone
• Democratized access to resources
• Common experience, processes, quality and culture
• Easy cross-discipline collaboration
• Economy of scale
• Reduced local complexity and support burden
• Easy retrieval via persistent identifier, repository search or documentation search
• Reduction of duplication and movement of data

Quick, easy and standardised publication of any dataset, irrespective of size
• Rapid deployment of innovation to the entire community

But the challenges – technological and cultural – are equally clear. I’m sure you can think of a hundred reasons why something like this would probably never work, but if there is a will, then all of these obstacles could be overcome by clever people. I’m guessing we can continue to adapt and stretch our current practices as best we can for maybe 10-15yrs. But a system like this will not evolve out of current practices – I think it will require to be deliberately built by a bold and forward-thinking governing body, and that could take the best part of ten years to achieve.

There have been some high-level discussions in the past within the academic community about a national approach to RDM. I was chatting recently with a colleague from Cambridge, now retired, who chaired a national committee about 15yrs ago to consider the potential benefits of a national data management service (without actually proposing any particular model). Their conclusion was that the idea was good, but 10yrs ahead of its time because there was insufficient interest or support within the community for it to gain traction. But here we are, 15yrs later, and now there is a very active and energetic community that is very engaged with data management practices and policies, and I think we do have a real chance to build the conversation to a level that will start to influence research councils, funding bodies, and other policymakers.

I’d love to hear your thoughts. And if you agree that our current approaches to research data management are unsustainable, then please feel free to share/discuss this idea in your own communities.

11 thoughts on “Research Data Management as a national service

    1. Thanks Andy – this is JISC’s repository offering, for published data. It’s a great example of the great work that is being done to drive forwards the Open Research agenda, but it doesn’t address the live data issue – the bulk of the iceberg!

      Like

      1. Hi Andy and Alastair (and Rory),

        As has been mentioned/implied, Jisc’s Open Research Hub could form the core of a national offering. It’s been built around open standards and has connectivity/interoperability at its heart. Whilst it’s true to say, Alastair, that it’s currently used for published data, that’s primarily because that was the most pressing use case when the Minimum Viable Product was being specified. It was always intended to be used in a wider context and has been designed to be extensible with a view to being used with any type of data at any stage in the life cycle in mind.

        The technology is there. We could start almost immediately. But how to overcome the inertia in the current system(s). The resistance to change? I think perhaps that the biggest problems to over-come in order to make this a reality are not technological. They’re ones relating to hearts and minds and sustainability… …particularly the latter. It’s easy to say “it could be done” and “it should be done”. Less easy to say “we’ll pay for it”.

        PaulS

        Like

      2. Hi Paul, and thanks very much for this. It’s an interesting situation – I think the business case is easy to see, but I absolutely agree that the scale of the project makes it difficult to imagine. That’s one of the reasons why I thought the Research Notebooks idea might be helpful – it’s a much more modest prospect, but from a technical point of view it’s very similar and offers many of the same benefits and would be a great proof-of-concept. Crucially, it would require provision of a small amount of data storage, and it would establish relationships and a new business model with commercial partners who would develop bolt-on specialist functionality and interfaces.

        Like

    1. Hello Rory – hope you’re well! It’s great to broaden the conversation, and thank you very much for chipping in. I’m not *quite* as pessimistic as I perhaps seemed in the original post, about the possibility of a paradigm shift. I’m also aware of the more evolutionary option of developing existing infrastructures into a collection of federated, connected, standards-controlled systems, but I think that would miss some of the key advantages – notably the reduction of proliferation of identical systems, and challenging the idea of personal or institutional ‘ownership’. I’m not sure we’ll ever get to where we should be through evolution – I think we need to build something new from scratch. However! The important thing right now is to propagate the message – that what we’re doing right now is not sustainable – to the widest audience possible, and hope that some of those people, who have a bigger-picture perspective, will start thinking about the future beyond 5yr funding cycles. Thank you for helping to make that happen.

      Best wishes,
      Al

      Like

  1. Thanks, Al. These issues were debated and explored at a fantastic event that took place earlier this week; the Open Cloud Workshop 2020 https://massopen.cloud/events/2020-open-cloud-workshop/#Schedulenbsp. Several high level themes stood out for me: (1) The importance of compute as an infrastructural element complementing storage;(2) the need for a strong application layer to broaden the appeal and make the service competitive with AWS, Azure and Google; and (3) the value add that commercial partners — e.g. Red Hat — can bring to the design, implementation and delivery of the service.

    Like

    1. For a large community infrastructure designed to store all research data for all HE institutions, I think it’d be nuts to build around a 3rd party cloud service like AWS/MS/Google. Disengagement from those services would become too difficult to contemplate. But I am very interested in the commercial partner element and would be interested in your view – if the entire research community could have access to your product as a lightweight bolt-on interface to the basic platform, and could easily experiment with it, without any anxiety about disruption, disengagement or data quotas, that would be a huge advantage for you, right?

      Like

  2. Just to be clear, the open cloud being developed by the New England group of universities is not built around a 3rd party cloud service like AWS/MS/Google. To the contrary it is an OPEN cloud, i.e. using open technologies, analogous to open source software. There are two related clouds. The original Mass Open Cloud is a test bed for experimentation with new cloud technologies. The newer New England Research Cloud, which is the process of being launched, is intended to be a production cloud on which a layer of applications will be deployed, including early guinea pigs like Dataverse and RSpace. The purpose of this is exactly as you describe it, i.e. the entire research community can have access to these products as a lightweight bolt-on interface to the basic platform, and easily experiment with them, without any anxiety about disruption, disengagement or data quotas, and yes, it’s attractive as you say.

    Both the MOC and the NERC are built on top of a large data center (the MGHPCC) and a storage layer on top of that (the New England Storage Exchange).

    Scott Yockel of Harvard has created a useful graphic depicting this entire ecosystem. I don’t seem to be able to reproduce the graphic here, so I will send it to you in an email.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s