What can you expect from the library guide?
Both in and outside of the Hanze UAS, there is a vast amount of open, readily available datasets covering a variety of disciplines. You may have some questions about finding and using such datasets, as well as how to make your own data openly available. The first matter will be discussed in this library guide, and the latter is discussed in a different library guide called Publishing and sharing.
Nowadays, research institutions, funders, and publishers require researchers to publish their data as openly as possible. That means others can access and use your data, but you can also access and use the data of other researchers. As such, it is important to know where to find existing datasets and how to evaluate their quality and use for your own research. There are various reasons for using existing datasets, for instance:
This library guide covers the following topics:
For whom is this library guide useful?
The library guide ‘discovering existing research data’ is useful for students and researchers of Hanze UAS and might start the exploration of the search for open data. The guide is not comprehensive, but gives you a first insight in open data search.
In order to be able to use and assess existing datasets, it is important to understand the data management lifecycle. Funding agencies require a data management plan as part of their grants, and both funding agencies and publishers often require authors to share, at minimum, the data necessary to understand and assess a manuscript’s conclusions. However, proper data management is also just beneficial to researchers, as it will help them by;
The first step to collecting high quality data is data planning. A research question will guide the data collection plan. For funded research, researchers are required to write a data management plan (DMP). In those plans, the kind of data that will be collected is described, as well as data handling during and after the project. Therefore, it’s important to think about;
In essence, everyone on a research team is responsible for proper data management, but in practice, everyone means no one if responsibility isn’t assigned. Thus, it’s good practice to assign one person on the research team to be responsible for adherence to naming conventions, a minimum level of documentation, version control, and backing up data.
The second step is the collection of the raw data. Again, there are some best practices to keep in mind here;
During both the pre-processing and the processing stage, raw data is cleaned up and organized ahead of the analysis phase. During the processing phase, raw data is diligently checked for errors in order to eliminate/fix redundant, incomplete, or incorrect data. Processing data makes it fit for analysis and covers;
Data analysis covers all the computational and statistical techniques for analysing data for some purpose, in order to gain knowledge or insight, build classifiers or predictors, or infer causality. Examples of this are;
Be sure to safely and securely store the raw, processed, and final data. Final data are deposited in data repositories, which serve the research data management for further research.
During this phase, you give access to your research data. This consists of publishing your data, obtaining a DOI for it, and relating it to your article or other research output. Data can be published openly or with restrictions (such as a request form).
There are many reasons to share your data, for instance:
Your data is ready to be reused by others for further research.
There are various resources for finding existing research data.
Before you start your search, it's important to determine the scope. The scope of your research refers to the different dimensions of your research, for example:
Setting the scope determines to an important extent what sources are available. Sometimes your research question specifically points to a certain scope: it is difficult to research the 1997 Asian financial crisis using only EU data from 2010.
The best place to start looking for data sources is in the scientific literature you use to formulate your methodology. You can also check out the Hanze library for data sources. Other good places to start you orientation towards finding existing datasets are Google and Re3data.org.
If you're already past the orientation phase and looking for specific datasets, it might be good to have a look into data journals and search engines such as OpenAire and Datacite.
If you're looking for large scale research data, larger (inter)national institutions may provide them.
As research institutions, funders, and publishers often require researchers to supply their data nowadays, these are often referenced in the publications. This can be done in various ways, although the most common way is by supplying a DOI link to the data in the so-called supplementary materials of an article. It can be useful to look at the supplementary materials of the articles you use as your background literature in order to see whether these data may be useful to your own study.
A DOI, or Digital Object Identifier, is a string of numbers, letters and symbols used to permanently identify an article or document and link to it on the web. A DOI will help your reader easily locate a document from your citation. Think of it like a Social Security number for the article you’re citing — it will always refer to that article, and only that one. While a web address (URL) might change, the DOI will never change.*
Some data may even be published in a specific data journal, which is a peer-reviewed collection of datasets. Data journals are often topic-specific and are an invaluable tool when looking for existing datasets. Unfortunately, not that many are available yet, so there might not be a data journal pertaining to your area of study yet. if it does exist, it is often easiest to find it via a Google search. Some examples of data journals are listed below**:
Some licensed data sources are provided by the Hanze library. In order to access them follow these steps:
This will give you a selection of databases containing peer-reviewed data as well as access to these data through the Hanze library.
It can be useful to use a (data) search engine to look for datasets. Google can be a starting point to get an overview or idea of the existing datasets in your field of expertise. It is important to know how to formulate a search query in order to get usable results.
For instance, in the event that we might want to find datasets on bilingualism, searching for only 'bilingualism' will likely not yield the results we want. This is why we use boolean operators (AND and OR). For instance, we may use a query such as 'bilingualism database OR registry OR archive OR dataset OR statistic*' to get a usable result.
As such, Google can be used to find existing data or at least the locations of existing data. However, it is not ideal, as there will likely be many hits, but not many indicators of scientific quality, relevance or accessibility.
Another service offered by Google is Google dataset search, where you may search for a more specific topic and get a reference to either the supplementary material of article or a published dataset.
Other data seach engines include:
Data repositories are platforms where data can be accessed and archived. Many institutions, among which the Hanze UAS, use DataverseNL to archive datasets. Other platforms include:
These data repositories may be searched through and yield datasets through Re3data. Re3data is a registery of research data repositories and it provides access to many different kinds of materials. It doesn't allow for the use of boolean operators (yet), but when searching for a topic, it does have a variety of filtering options on the left side of your screen.
It also provides information about the quality and reliability of a dataset after selecting it using icons in the upper right corner of each result of your query.
Aside from using the filters, Re3data also allows users to browse by subject, country, or content type. It is not a search engine for super specific datasets, but it does provide an overview of repositories.
Large (government) organisations that often collect large datasets tend to share these data via their own platform. Sometimes, these data are then also made available to the public. Examples are:
If you've found a dataset you might want to use for your own research, it is important to evaluate. When evaluating an existing dataset, there are several important considerations:
Data Quality: Assess the quality of the dataset. Check if the data is consistent, accurate, and complete. Identify missing values, outliers, and inconsistencies that could affect the validity of your results.
Relevance: Determine if the dataset is truly relevant to your research question and objectives. Ensure that the variables and data in the dataset align with the aspects you intend to investigate.
Source and Origin: Understand where the dataset comes from and who the source is. If possible, seek information about how the dataset was collected, the sampling methodology, and any biases that may come into play.
Ethics and Privacy: Ensure that the dataset has been obtained and used in accordance with ethical standards and privacy regulations. If the data contains personally identifiable information, make sure you comply with data protection requirements.
Data Preprocessing: Examine the preprocessing steps that have been applied to the dataset. Understand any transformations, normalizations, or filtrations that have been performed and consider how these steps might impact your analysis.
Representativeness: Evaluate whether the dataset is representative of the population or phenomenon you are studying. If there is selection bias, it could lead to distorted results.
Availability and Access: Check if the dataset is available for use and if there are any restrictions on its use, such as copyright or licensing terms. Many existing datasets are shared under Creative Commons licenses, which specify how the data can be used, shared, and attributed. These licenses provide a clear framework for researchers to understand the permissions and restrictions associated with the dataset, ensuring proper and ethical use.
Data Formats: Review the data format and ensure you have the necessary tools and skills to work with it. This can range from structured data (e.g., Excel, CSV) to unstructured data (e.g., text, images).
Documentation: Look for documentation about the dataset, such as metadata, variable definitions, codebooks, and any issues others have encountered when using the dataset. Metadata describe basic characteristics of the data, such as who created the data and what the data file contains. Metadata make it easier for you and others to identify and reuse data correctly at a later moment. You can provide basic human-readable or advanced machine-readable metadata. In both cases there are standards that you can choose to use.
Capabilities and Limitations: Understand the potential analytical capabilities and limitations of the dataset. Be realistic about what you can achieve with the available data.
Thinking before acting is often good advice, especially when you have to do an assignment, and even more when collecting data. Search for suitable datasets that you can use and evaluate the dataset on key elements. Prepare yourself and organise your research results and document where you got hold of the dataset.
In general, the citations of a dataset follow all the same rules as the citations of other types of research output. Below is an overview of the most common types of citations.
Author, A. (Year). Title of the data set (Version number) [Data set]. Publisher Name. DOI.
Author. Title of dataset. Publisher, Publication Date, Location. Publisher name, Date of publication (format DD Month YYYY), location. doi/url of data
Author. Title of dataset. Place: Publisher, Year. URL/DOI.
Author1; Author2; et al. Title of dataset, ver. ##. Publisher, Published date (format Month Date, Year). DOI/URL
Author1, Author2. Title of dataset. Name of Website. Published date (format Month Date, Year). Updated date (format Month Date, Year). Accessed date (format Month Date, Year). DOI/URL
Author1, Author2, et al. Dataset title. Publisher Location City (Location State): Publisher; Year [accessed date (format YYYY MMM DD)]. URL/DOI.
Author1, author2, et al., “Title of dataset,” Source, Publication date (Mon. DD, YYYY). [Online]. Available: URL/DOI