Hanze Library Guides: Discovering existing research data: Discovering Existing Research Data

Introduction

"Open data is data that can be freely used, re-used and redistributed by anyone - subject only, at most, to the requirement to attribute and sharealike".

What to expect from this Library Guide?

What can you expect from the library guide?

Both in and outside of the Hanze UAS, there is a vast amount of open, readily available datasets covering a variety of disciplines. You may have some questions about finding and using such datasets, as well as how to make your own data openly available. The first matter will be discussed in this library guide, and the latter is discussed in a different library guide called Publishing and sharing.

Nowadays, research institutions, funders, and publishers require researchers to publish their data as openly as possible. That means others can access and use your data, but you can also access and use the data of other researchers. As such, it is important to know where to find existing datasets and how to evaluate their quality and use for your own research. There are various reasons for using existing datasets, for instance:

Time and Cost Efficiency: Collecting new data can be time-consuming and expensive. Using existing datasets saves both time and resources, allowing researchers to focus on analysis and interpretation.
Meta-Analysis: Combining multiple existing datasets through meta-analysis can yield more comprehensive and statistically significant results.
Exploratory Research: Existing datasets can serve as a starting point for exploratory research, allowing researchers to identify interesting patterns and generate new hypotheses.
Study Design: Examining existing data may help you optimise your own study design, by e.g. replicating an earlier study, using different tools and measurements, or following recommendations in the original article.

This library guide covers the following topics:

A glossary to familiarize yourself with concepts related to Research Data Management
The data management lifecycle
How do I find datasets?
How do I evaluate datasets?
How do I cite existing datasets?

For whom is this library guide useful?

The library guide ‘discovering existing research data’ is useful for students and researchers of Hanze UAS and might start the exploration of the search for open data. The guide is not comprehensive, but gives you a first insight in open data search.

Explanation of concepts

Glossary
The National Coordination Point Research Data Management (LCRDM) is a national network of experts in the field of research data management (RDM).

Data Management Lifecycle

In order to be able to use and assess existing datasets, it is important to understand the data management lifecycle. Funding agencies require a data management plan as part of their grants, and both funding agencies and publishers often require authors to share, at minimum, the data necessary to understand and assess a manuscript’s conclusions. However, proper data management is also just beneficial to researchers, as it will help them by;

Improving the efficiency and organization of their research project.
Making their research process more comprehensible to themselves and others
Creating high-quality data, which will likely increase their research’s impact.
Ensuring data can be accessed and reused by themselves and others for future projects.
Preventing data loss and consequences of data loss.

The first step to collecting high quality data is data planning. A research question will guide the data collection plan. For funded research, researchers are required to write a data management plan (DMP). In those plans, the kind of data that will be collected is described, as well as data handling during and after the project. Therefore, it’s important to think about;

What kind of data do I need?
What am I going to do with my data? What do I want to measure? Why am I collecting it?
How will I collect the data? How will it be entered?
What are ethical considerations that might complicate data collection?

In essence, everyone on a research team is responsible for proper data management, but in practice, everyone means no one if responsibility isn’t assigned. Thus, it’s good practice to assign one person on the research team to be responsible for adherence to naming conventions, a minimum level of documentation, version control, and backing up data.

The second step is the collection of the raw data. Again, there are some best practices to keep in mind here;

Variables should be clearly defined so that everyone understands and collects the data in the same way. Cleaning your data can take a large amount of time if data is not collected in a consistent way.
Data collection and entering can be more consistent by using forms with limited options, especially where it concerns categorical data. Don’t forget to code missing data as well. You can store this in codebooks and data dictionaries. These also make it easier to understand and interpret the data for secondary and tertiary parties.

During both the pre-processing and the processing stage, raw data is cleaned up and organized ahead of the analysis phase. During the processing phase, raw data is diligently checked for errors in order to eliminate/fix redundant, incomplete, or incorrect data. Processing data makes it fit for analysis and covers;

Data cleaning – remove or fix incorrect, corrupt, incorrectly formatted, duplicate, and incorrect data within a dataset.
Data wrangling – remove errors and combine complex datasets to make them more accessible and easier to analyse.
Data formatting – Define the structure of the data within a database or filesystem.

Data analysis covers all the computational and statistical techniques for analysing data for some purpose, in order to gain knowledge or insight, build classifiers or predictors, or infer causality. Examples of this are;

Descriptive statistics
Visualizations
Statistical inference
Data mining
Machine Learning, i.e. the algorithms and methods that underlie artificial intelligence (AI)

Be sure to safely and securely store the raw, processed, and final data. Final data are deposited in data repositories, which serve the research data management for further research.

During this phase, you give access to your research data. This consists of publishing your data, obtaining a DOI for it, and relating it to your article or other research output. Data can be published openly or with restrictions (such as a request form).

There are many reasons to share your data, for instance:

Increases Visibility and Citations: Data sharing often leads to increased visibility of research, potentially resulting in more citations and recognition for the original authors.
Enhances Reproducibility: Transparent data sharing allows others to replicate and verify research findings, increasing the reliability and credibility of scientific results.
Accelerates Scientific Progress: Sharing research data enables other researchers to build upon existing work, accelerating the pace of scientific discovery and innovation.
Meets Funding Requirements: Many funding agencies and journals are now requiring or encouraging researchers to share their data as part of the research process
Promotes Collaboration: Shared data fosters collaboration among researchers, allowing for cross-disciplinary and cross-institutional partnerships that can lead to more comprehensive and impactful research.

Your data is ready to be reused by others for further research.

How do I find existing datasets?

There are various resources for finding existing research data.

Before you start your search, it's important to determine the scope. The scope of your research refers to the different dimensions of your research, for example:

the populational dimension (people, adults, children, teenagers, migrants)
the temporal dimension (recent developments, economic history)
the geographical dimension (the Netherlands, the EU, low income countries, etc.)
the type of subjects (large firms or small startups?)

Setting the scope determines to an important extent what sources are available. Sometimes your research question specifically points to a certain scope: it is difficult to research the 1997 Asian financial crisis using only EU data from 2010.

The best place to start looking for data sources is in the scientific literature you use to formulate your methodology. You can also check out the Hanze library for data sources. Other good places to start you orientation towards finding existing datasets are Google and Re3data.org.

If you're already past the orientation phase and looking for specific datasets, it might be good to have a look into data journals and search engines such as OpenAire and Datacite.

If you're looking for large scale research data, larger (inter)national institutions may provide them.

As research institutions, funders, and publishers often require researchers to supply their data nowadays, these are often referenced in the publications. This can be done in various ways, although the most common way is by supplying a DOI link to the data in the so-called supplementary materials of an article. It can be useful to look at the supplementary materials of the articles you use as your background literature in order to see whether these data may be useful to your own study.

A DOI, or Digital Object Identifier, is a string of numbers, letters and symbols used to permanently identify an article or document and link to it on the web. A DOI will help your reader easily locate a document from your citation. Think of it like a Social Security number for the article you’re citing — it will always refer to that article, and only that one. While a web address (URL) might change, the DOI will never change.*

Some data may even be published in a specific data journal, which is a peer-reviewed collection of datasets. Data journals are often topic-specific and are an invaluable tool when looking for existing datasets. Unfortunately, not that many are available yet, so there might not be a data journal pertaining to your area of study yet. if it does exist, it is often easiest to find it via a Google search. Some examples of data journals are listed below**:

At the Hanze UAS, we recommend publishing your data in DataVerseNL, as well as registering your dataset in the Hanze research information portal (PURE).

*Source: https://library.uic.edu/help/article/1966/what-is-a-doi-and-how-do-i-use-them-in-citations

**Source: https://libguides.vu.nl/rdm/data-publication

Some licensed data sources are provided by the Hanze library. In order to access them follow these steps:

Go to Hanzemediatheek - A-Z databanken
At the top of the page, navigate to Database Types and set it to 'statistics'

This will give you a selection of databases containing peer-reviewed data as well as access to these data through the Hanze library.

It can be useful to use a (data) search engine to look for datasets. Google can be a starting point to get an overview or idea of the existing datasets in your field of expertise. It is important to know how to formulate a search query in order to get usable results.

For instance, in the event that we might want to find datasets on bilingualism, searching for only 'bilingualism' will likely not yield the results we want. This is why we use boolean operators (AND and OR). For instance, we may use a query such as 'bilingualism database OR registry OR archive OR dataset OR statistic*' to get a usable result.

As such, Google can be used to find existing data or at least the locations of existing data. However, it is not ideal, as there will likely be many hits, but not many indicators of scientific quality, relevance or accessibility.

Another service offered by Google is Google dataset search, where you may search for a more specific topic and get a reference to either the supplementary material of article or a published dataset.

Other data seach engines include:

Data repositories are platforms where data can be accessed and archived. Many institutions, among which the Hanze UAS, use DataverseNL to archive datasets. Other platforms include:

These data repositories may be searched through and yield datasets through Re3data. Re3data is a registery of research data repositories and it provides access to many different kinds of materials. It doesn't allow for the use of boolean operators (yet), but when searching for a topic, it does have a variety of filtering options on the left side of your screen.

It also provides information about the quality and reliability of a dataset after selecting it using icons in the upper right corner of each result of your query.

Aside from using the filters, Re3data also allows users to browse by subject, country, or content type. It is not a search engine for super specific datasets, but it does provide an overview of repositories.

Large (government) organisations that often collect large datasets tend to share these data via their own platform. Sometimes, these data are then also made available to the public. Examples are:

How do I evaluate a dataset?

If you've found a dataset you might want to use for your own research, it is important to evaluate. When evaluating an existing dataset, there are several important considerations:

Data Quality: Assess the quality of the dataset. Check if the data is consistent, accurate, and complete. Identify missing values, outliers, and inconsistencies that could affect the validity of your results.
Relevance: Determine if the dataset is truly relevant to your research question and objectives. Ensure that the variables and data in the dataset align with the aspects you intend to investigate.
Source and Origin: Understand where the dataset comes from and who the source is. If possible, seek information about how the dataset was collected, the sampling methodology, and any biases that may come into play.
Ethics and Privacy: Ensure that the dataset has been obtained and used in accordance with ethical standards and privacy regulations. If the data contains personally identifiable information, make sure you comply with data protection requirements.
Data Preprocessing: Examine the preprocessing steps that have been applied to the dataset. Understand any transformations, normalizations, or filtrations that have been performed and consider how these steps might impact your analysis.
Representativeness: Evaluate whether the dataset is representative of the population or phenomenon you are studying. If there is selection bias, it could lead to distorted results.
Availability and Access: Check if the dataset is available for use and if there are any restrictions on its use, such as copyright or licensing terms. Many existing datasets are shared under Creative Commons licenses, which specify how the data can be used, shared, and attributed. These licenses provide a clear framework for researchers to understand the permissions and restrictions associated with the dataset, ensuring proper and ethical use.
Data Formats: Review the data format and ensure you have the necessary tools and skills to work with it. This can range from structured data (e.g., Excel, CSV) to unstructured data (e.g., text, images).
Documentation: Look for documentation about the dataset, such as metadata, variable definitions, codebooks, and any issues others have encountered when using the dataset. Metadata describe basic characteristics of the data, such as who created the data and what the data file contains. Metadata make it easier for you and others to identify and reuse data correctly at a later moment. You can provide basic human-readable or advanced machine-readable metadata. In both cases there are standards that you can choose to use.
Capabilities and Limitations: Understand the potential analytical capabilities and limitations of the dataset. Be realistic about what you can achieve with the available data.

How do I cite a dataset?

Thinking before acting is often good advice, especially when you have to do an assignment, and even more when collecting data. Search for suitable datasets that you can use and evaluate the dataset on key elements. Prepare yourself and organise your research results and document where you got hold of the dataset.

In general, the citations of a dataset follow all the same rules as the citations of other types of research output. Below is an overview of the most common types of citations.

APA
Author, A. (Year). Title of the data set (Version number) [Data set]. Publisher Name. DOI.

MLA

Author. Title of dataset. Publisher, Publication Date, Location. Publisher name, Date of publication (format DD Month YYYY), location. doi/url of data

Chicago

Author. Title of dataset. Place: Publisher, Year. URL/DOI.

ACS

Author1; Author2; et al. Title of dataset, ver. ##. Publisher, Published date (format Month Date, Year). DOI/URL

AMA

Author1, Author2. Title of dataset. Name of Website. Published date (format Month Date, Year). Updated date (format Month Date, Year). Accessed date (format Month Date, Year). DOI/URL

CSE

Author1, Author2, et al. Dataset title. Publisher Location City (Location State): Publisher; Year [accessed date (format YYYY MMM DD)]. URL/DOI.

IEEE

Author1, author2, et al., “Title of dataset,” Source, Publication date (Mon. DD, YYYY). [Online]. Available: URL/DOI

HBO Knowledgebase

Make sure your mom
isn’t the only reader of your thesis.

Publish your graduation product in the HBO knowledgebase,
and start off your career like a boss. 

Publish your thesis

[anchornavigation]

Open Data Handbook

Open Data Handbook
Guides, case studies and resources for government & civil society on the "what, why & how" of open data. Published by the Open Knowledge Foundation.

Discovering Existing Research Data