Open Data in perspectives: an account of the Open Data Day 2020 Rio at the National Archives
A year ago I was at one of the two Open Data Day Rio de Janeiro events, which was organized by the Arquivo Nacional, the National Archives of the Brazilian federal government. The event was a day early, on Friday, considering that the Open Data Day is always on a Saturday, because it would work best for the institution to host the event on a work day.
I was invited to give a talk there about the Open Data Day itself: what it is, why is it important, and how have some previous ODD events been like. The other presentations at the event also showed other perspectives on open data. Otávio Neves, Director of Transparency at the Office of the Comptroller-General (CGU), showed the current direction of the national open data policy. Luana Sales, Coordinator-General of Access and Diffusion at the National Archives, talked about research data management and archival practices for data. Then, Patricia Henning, professor at Unirio, explained the FAIR principles of data management and the GOFAIR initiative. Finally, there was a table discussion on the release of the book on open research data.
These are my notes the event along with with a few remarks on how things have progressed since then. They are my own words and do not represent the views or even precisely what each speaker said at the event. For that, you are advised to watch he recording of the whole event (in Portuguese) which is available on Youtube. It is split between the morning and afternoon parts.
The event opened with an introduction from the National Archives Director Neide de Sordi, speaking remotely from Brasília, followed by a discussion along with Otávio Neves from the CGU and Vanessa Jorge from Oswaldo Cruz Foundation (Fiocruz).
Ms. de Sordi explained that the National Archives is responsible for the document management policies for the whole of the Brazilian federal government and it is time for the institution to support and put in practice open data. This was the first time the National Archives had hosted an ODD event. Vanessa Jorge told how Brazil was one of the founders of the Open Government Partnership and how the independently reviewed Action Plan process works.
The Open Data Day and its history in Brazil
My talk, whose slides are available on SlideShare, had two main topics: tips on how to organize an Open Data Day event (at the time, mostly in-person meetings), and the history of the Open Data Day in Brazil.
After a brief introduction on my own background with the open data movement and my involvement in bringing open data as a public policy to the Brazilian federal government, I showed the Open Data Day website, how anyone could translate the content to their own language if they have the time to contribute and, more importantly, how to register your event there.
Then I mentioned how on January 14th I had participated in a live discussion hosted by the Embaixadoras project from Open Knowledge Brazil (OKBR), in which we discussed tips on how to organize an Open Data Day event. We had so many good ideas that soon after OKBR did compile them into an ODD event organization guide.
The Open Data Day is a collection of decentralized events around the world, all happening on the same day or at least close to the official date, on the theme of open data. Activities can range anywhere from overall discussions about open data to hands-on experience to do something useful with them. Getting together people interested in data can bring interesting discussions, ideas and collaborative products. It’s clear that the synergy of getting together and the exchange of ideas brings results that are larger than the sum of their parts.
Then I went on to tell as much as I could find out about the history of the Open Data Day events in Brazil and interesting things that have been build in those. However, that and some advice on how to organize events will be shown in detail on another blog post of mine.
I also went on to illustrate some organizations that support the Open Data Day: the Open Knowledge Foundation, the sponsors of the mini-grant scheme and other organizations that host the local decentralized events (such as the National Archives, in this case).
Registering public domain documents on Wikimedia Commons
After my talk, Neide de Sordi asked Diego (I do not know his last name, which was not mentioned at the event) from the press service of the National Archives to show a project that uploads many historical and public domain documents of National Archives to Wikimedia Commons. That has helped Wikipedia editors in finding primary sources to cite in Wikipedia articles.
The National Archives not only uploaded a great number of digitized historical documents, but also promoted a contest with prizes for the editors who contribute the most to add such images to illustrate and to serve as source of information for Wikipedia articles.
The state of the Open Data Policy
The next talk was by Otávio Neves of the Office of the Comproller-General (CGU), who is responsible for the Brazilian federal government open data policy since July of the year before, when Decree no. 9,903/2019 reassigned it from the Ministry of Economy to the CGU.
Mr. Neves situated open data in context of the Open Government Partnership (OGP), which in Brazil is coordinated by the CGU. He mentioned that open data has some characteristics in common with open government and that he intended to explore those synergies to make progress in both at the same time. He remembered how Brazil’s Access to Information Law did establish in Article 8th requirements for open data in the active transparency activities.
He then went on to explain the history of the open data policy, something I was a part of. The creation of the National Infrastructure for Open Data (INDA) in 2012, that works not necessarily as a data infrastructure itself, but a tool for management and fostering the many public administration organizations to open up data. To know more about the history of INDA’s committee, please see my blog post where I tell about it in detail. He also stressed out the importance of the creation of the open data portal as a central point where citizens can find and use data from many governmental organizations.
Next, he talked about the concept of open government, being a culture where government works in collaboration with society in a transparent manner. Brazil was one of the founders of the OGP in 2012. He cited as an example how Waze used to share data with authorities to manage traffic. The other way around is when government releases open data that is then used by the private sector to create solutions and services. He mentioned there are many studies indicating that open data does have an aggregate positive effect in the economy.
Then, in 2016, Decree no. 8,777 reinforced the open data policy by mandating that every federal government institution create their own plans detailing which datasets they would open in the span of the next two years. The Ministry of Planning did the management and capacity building, while the role of the CGU was to monitor each one of them for compliance with the bylaws. Finally, in 2019, Decree no. 9,903 placed the whole of the open data policy under the banner of the CGU, along with the open government and transparency which were already there.
He compared the different ways to make information available. Open data is very flexible as people can reuse it in many different ways. But it’s not easily accessible for everyone, as many people are not able to handle them, cross-reference data, build visualizations, etc., for a lack of technical skills. Classical transparency, on the other hand, is ready for non-technical people to see and understand right away. According to him, these people would rather the government builds the data visualizations directly and show it to them. So different target audiences would prefer it one way over the other.
Some areas the CGU intends to develop the open data policy are capacity building for civil servants, building networks of people, like promoting workshops to connect people e.g. with data analysis skills to people with knowledge about public policy.
Next, some examples were shown of civil society organizations that used access to information requests, open data and collaborative practices to build useful applications, cases of data driven journalism, etc.
Finally, Mr. Neves showed the priority areas that the CGU intended to work with the open data policy for the next two years. One year later, I am yet to see significant progress in any one of them. But, to be fair, many of those ideas were present in form of the Action Plan approved in a Steering Committee meeting in February 2021 – even though that committee has not yet been formalized after being disbanded by Decree no. 9,759/2019, Article 5. See the blog post about the Committee I wrote last December for more on that. Also, as far as I know, to this day the Action Plan approved in February has not been made available to the public yet.
Update (2021-03-09): Open Knowledge Brazil has succeeded, through an access to information law request, in making available the contents of the Action Plan for 2021-2022, approved in February. You can also read their review of the plan (the article is in Portuguese).
Research data management and data archival
The afternoon presentations started with Luana Sales, from the National Archives. She explained the concept of open science, having reproducibility as one of its pillars. Reproducibility is the possibility to reproduce a scientific experiment under the same conditions present in the original research, so that they can be peer reviewed. That includes not only transparency for the methodology used in research, but also availability of the computer code and datasets used. Access to the results as academic papers is also essential for reproducibility, so it is important to have open access to publications and not have them locked up behind paywalls from the scientific publishers.
She spoke of the benefits of the collaboration between researchers using social networks, employing a greater exchange of ideas.
But the sharing of open research data must be mediated by management practices. For example, making sure that personal data is not inadvertently disclosed along with it. According to Ms. Sales, these data management practices has a lot in common with archival practices: archiving, curating, preservation and continued access.
Identifying which data is to be shared as open research data may be challenge in itself, as the data that is relevant is dependent on context. Data that is useful in a context might also be or might not be in another context. For instance, many ship logs that were previously used in a historical research context have been in climatology research, as they contain clues about how the weather and ocean winds were centuries ago.
She showed a definition for open research data that she and Prof. Sayão have proposed. Then, she showed a proposed taxonomy of types of research data and classifications according to the data origin, like observational data from an experiment, for instance. She discussed which elements of the research need to be preserved for reproducibility, e.g., workflows, models, the instruments used, research notebooks, etc. Data must be accompanied by the documentation of the whole chain of data provenance. All that matters for the next researcher to be able to trust the data and use it in new research.
Ms. Sales advocated for the advantages of sharing openly lab notebooks used in research. Since nowadays they are electronic, they can receive directly data from instruments. Sharing helps to audit and peer review their contents. When determining what to preserve, however, the probability of that particular piece being useful in the future is also something to take into consideration, even though sometimes it can be difficult to ascertain.
So, in order to preserve research data for future reuse, we need not only methodology but also infrastructure and data management practices. We need specialized professionals, such as data archivists. Ms. Sales described some of the activities that a data archivist would need to do. One of the challenges organizations and individuals face today when dealing with data, is to sort out personally identifiable or confidential data, in order to avoid disclosing those inappropriately. Those professionals could help facing that challenge.
The FAIR principles
Patricia Henning started by stating that she studied in the Netherlands under professor Luiz Bonino, who is one of the original proponents of the FAIR principles. She explained that what these principles advocate for are nothing new to archivists and librarians. It’s just that they have all been laid out in a list that is easier to follow. They are very much connected with the necessary practices exposed by Luana Sales in the previous presentation.
She exposed the common problems with data that are a barrier to reuse. According to Bonino, 80% of the data is lost forever because they are not properly archived. Also, 80% of the time is spent cleaning up data, leaving only 20% for research and innovation. Only a small portion of research that is publicly funded is reproducible with the research data deposited into data archives.
In order to tackle those problems, the proponents of the FAIR elaborated in 2014 a set of 15 principles so that finding and reusing data would be as easy as to find pages and consume information by using a search engine. FAIR stand for Findable, Accessible, Interoperable and Reusable data.
The FAIR principles became more widely known in 2016 once an article about it was published on Nature magazine. Since then, the G20, G7 and the European Commission have all embraced the principles. Ms. Henning showed many examples of policy documents in Europe related to research funding and data management, all embracing FAIR, including as a requirement in order to receive public funds for research. More recently, in February 2021, I read the Data Strategy for the German Federal government, which also states the goal of following the FAIR principles. That can be seen as evidence that support for FAIR is today as strong as it’s ever been.
So, what are the FAIR principles? Patricia Henning noted that most of them apply to both data and metadata (which, if you are unfamiliar with the term, is data that describe the data). Under findable, FAIR states the following principles (with some of Ms. Henning’s remarks):
- F1. (Meta)data are assigned a globally unique and persistent identifier
- you don’t need to use an expensive scheme like DOI, there are free alternatives to assign unique identifiers to both data and metadata
- F2. Data are described with rich metadata (defined by R1 below)
- F3. Metadata clearly and explicitly include the identifier of the data
- that may seem obvious, but must be guaranteed to be present
- F4. (Meta)data are registered or indexed in a searchable resource
- a data repository may fill that role
Under accessible, the principles are:
- A1. (Meta)data are retrievable by their identifier using a standardised communications protocol
- A1.1 The protocol is open, free, and universally implementable
- A1.2 The protocol allows for an authentication and authorisation
procedure, where necessary
- for example, to write data and metadata to the repository, but also in cases when data access restrictions (e.g. personal data) apply
- A2. Metadata are accessible, even when the data are no longer available
- I1. (Meta)data use a formal, accessible, shared, and broadly
applicable language for knowledge representation
- this is the probably the most costly and challenging of the principles to implement, as many people, even in Europe, do not understand what it means – to use semantic web data standards such as RDF
- I2. (Meta)data use vocabularies that follow FAIR principles
- just like the data and metadata that use the vocabularies, and may also be a challenge
- I3. (Meta)data include qualified references to other (meta)data
- interlinking (meta)data (see Linked Data) may also prove to be challenging and costly
Finally, for reusable:
- R1. (Meta)data are richly described with a plurality of accurate and
- R1.1. (Meta)data are released with a clear and accessible data usage
- under what conditions the data may be used
- R1.2. (Meta)data are associated with detailed provenance
- how was the data obtained, existing versions, etc.
- R1.3. (Meta)data meet domain-relevant community standards
- R1.1. (Meta)data are released with a clear and accessible data usage license
Patricia Henning notes that FAIR data is not the same as open data, as the principles can be applied as much to data that has access restrictions as it can in data that is publicly accessible. She also showed the Open Definition for open data, the 5 stars of Linked Open Data proposed by Tim Berners-Lee. and the set of Creative Commons Licenses.
Finally, she showed the book Turning FAIR into Reality, published by the European Commission, which is a guide on how to implement the FAIR principles. She also presented the GOFAIR initiative, also present in Brazil and led by Luana Sales, which aims to help institutions implement FAIR. She cited Fundação Oswaldo Cruz (Fiocruz) as an example of Brazilian institution that aims to use and support FAIR in health data by elaborating a data management plan.
Release of the book on open data for science publishers
At the end of the event, the book “Topics on open data for science publishers” was released. There was a table discussion with Luana Sales, Michelli Costa, from the University of Brasília, and Sigmar de Mello, president of the Brazilian Association of Science Publishers.
Luana Sales explained that the release of this book is also a milestone from Brazil’s 4th Action Plan in the Open Government Partnership. Michelli Costa stated that one cannot hope to do open science without first explaining some of those issues to science publishers, who the book tries to reach. Sigmar de Mello stated that a survey with science publishers found out that only 20% of them would support an open science alternative in their publications. So, advocacy work like this book is very much needed.
A few of units of the book were handed out to participants in a giveaway after the discussion table.
The ODD 2021 at the National Archives
This year, once again the National Archives organized an Open Data Day event one day early, so as to fall on a working day – Friday, March 5th. Because of the pandemic, this time the event took place online.
The programme had a presentation from John Sheridan, Digital Director of the UK National Archives. He was followed by a talk from Caroline Burle, from the Web Technologies Study Center in Brazil (Ceweb.br).