Author:
Gabriela Urm

Data Management

When planning research, it is important to consider carefully and document the ways of collecting and processing data during the research project, to specify who has access to these data and who is responsible for them, what will happen to the data after the closure of the project, etc. 

In order to do all this, it is necessary to create the data management plan and to follow it throughout the project. An useful tool for this is DMPonline.

You can read more information about open data and the data management plan from the open materials course "Research Data Management and Publishing" prepared by the UT library.

DATA COLLECTION
  • I'll collect it myself
  • (re)use my previously collected data
  • I use public open data (Estonian Open Government Data Portal)
  • (re)using data collected by others, (re3data)
  • I buy the data
  • keep in mind:
    • which version of data you reuse or purchase
    • what if the author of the data uploads a new version
    • store the version used and the vendor documentation on your server
    • check copyrights, licenses, restrictions (access, reuse)
    • check machine readability and interoperability with the planned information system
  • data types (experiment, observation data, survey data, video files, etc.)
  • how new data integrates with existing data
  • which data deserve long-term preservation
  • if some datasets are subject to copyright or intellectual property rights, show that you have permission to use the data 
  • estimate the data volume at the end of the project. It implicates several aspects:
    • preservation
    • access
    • backup
    • data exchange
    • hardware and software
    • technical support
  • name the existing standard procedures and methods
  • are there any data standards available
  • how to ensure data quality (availability, integrity, confidentiality)
  • how do you handle errors (input errors, problematic values)
  • use open source software when possible
  • open source software keeps hardware and software costs low
  • interoperable with other open source software
  • the software is developed and supported by a large community (higher quality, security and modernization; unfortunately, limited documentation and support)
  • software should allow to repeat the data analyzes carried out 
  • documentation when new software is created
  • provide technical support for tailored software
  • version management system git
  • cloud-based code repository GitHub
  • open source software licenses
  • be systematic and consistent
  • naming files: simple, logical, without abbreviations or with standard abbreviations (countries, languages, units of measurement, methods)
  • abbreviations in one language throughout
  • file organization (options: project name, time, place, collector, material type, format, version)
  • folder structure should be hierarchical, simple, logical, short
  • copying files to multiple locations is not a good practice; store in one location, create shortcuts
  • metadata (who is responsible for adding metadata)
  • article:
DOCUMENTATION AND METADATA
  • use this guide for data documentation:  
    • Siiri Fuchs, & Mari Elisa Kuusniemi. (2018, December 4). Making a research project understandable - Guide for data documentation (Version 1.2). Zenodo. DOI: https://doi.org/10.5281/zenodo.1914401  
  • README text file is included with the data files and should contain as much information as possible about the data files to allow others to understand the data    
    • create one README.txt file for each database
    • always name it as README.txt or README.md (Markdown), not readme, ABOUT, etc.
  • README.txt file should contain the following information:  
    • title of the dataset
    • dataset overview (abstract)
    • file structure and relationships between files
    • methods of data collection
    • software and versions used
    • standards 
    • specific information about data (units of measurement, explanations of abbreviations and codes, etc.)
    • possibilities and limitations of data reuse
    • contact information for the uploader of the dataset
    • Guidelines for creating a README file
ETHICS AND LEGAL COMPLIANCE
  • describe here whether the project collects personal data and how it is processed in accordance with the General Data Protection Regulation and the Estonian Personal Data Protection Act
  • who owns the data (personal and proprietary rights). Data always has an owner, even if it is open data
  • how data is licensed
  • Creative Commons

Excerpts from the intellectual property rights instructions conducted by UT lawyer Reet Adamsoo. These excerpts are recommended to use in data management plan:

  • The data belong to the University of Tartu. Persons employed for filling the grant will assign the proprietary rights to the results of the research (including the data) performed under the grant agreement to the University with the Employment Contract (academic employees) or with another written document (Act of Assignment of the Intellectual Property Rights)
  • Data will be disclosed under the Creative Commons license CC-BY 4.0
  • A third party, whose data have been used for creating the results of the grant, may set restrictions to the usage of the data. In this case those restrictions must be considered while the data are being licensed, i.e. the university can give the license for the data usage only in the scale of rights allowed by the third person (i. e. the scale of rights that university has received from the third persons)
  • If the University or a third person, whose data have been used for creating the results of the grant, wants to submit a patent or a utility model application, the publishing of the data has to be postponed until the submission of the application
STORAGE AND BACKUP
  • goal is to maintain data quality: 
    • availability and accessibility
    • integrity (correctness, completeness and timeliness)
    • confidentiality (only available to authorized persons or systems, key management, storage of log files) 
  • preservation: cloud environments, central servers, sensitive data servers, computer hard disk, external hard disk, mobile devices 
  • files containing personal data may not be stored in cloud environments whose legal headquarters address is outside the European Union (Dropbox, Google)
  • backup: creating a copy of the current status of data and/or programs that, after an security incident, allows you to restore it to its known current state  
    • how often backups are made, how many copies, whether the work process is automated  
    • maintaining and backing up the master file
    • rule 3-2-1 (store your data in 3 copies on 2 different memory devices from which 1 is afar)
    • who is responsible, especially for mobile devices 
  • carry out a risk analysis: what if....  
    • IT systems are down
    • power outages, water and fire accidents
    • the device is lost or stolen
    • malware is discovered in devices
    • a team member leaves or dies, etc.
  • risk weighing (probability and losses) 
  • risk assessment: threats and their likelihood, weaknesses, measures  
  • information security standard ISO / IEC 27001
  • UT Helpdesk
  • UT Cybersecurity
  • Data storage and backup at the UT
  • who is responsible?
  • management of access rights (same for all, contractual rights, temporary labor rights)
  • storing log files
  • pseudonymization, encryption, key management
  • data exchange, personal data, third countries
  • organizational and physical security: training of a new employee, possible problems with the outgoing workers, internal rules of procedure, fire safety, locking the doors
  • who is responsible for information security
LONG-TERM PRESERVATION OF DATA
  • what data has long-term value? Preserving and sharing it for reuse
  • preparing data for sharing, FAIR data
  • repository selection
  • the data have a permanent identifier DOI
  • metadata is in the DataCite registry
  • standard metadata like Dublin Core ore use other standards
  • machine-readable metadata
  • data and relevant metadata are in separate files but linked
  • keywords and subject terms
  • version management 
  • choose the repository where the data is stored
  • which data is open access e. open data
  • which data will remain closed and for what reason
  • metadata must be open even when the data is not open (exceptions like rare species location)
  • technical metadata: required software (version), instrument specifications, software tools 
  • are there any encrypted data  
  • authentication, whom to ask for access rights 
  • is it necessary to create a user account that is linked to certain conditions
  • mainly the task of the repository
  • what data and metadata standards, controlled vocabularies and taxonomies are used
  • description of data types: if not standard, how interoperability is ensured
  • linking to other data, metadata, and specifications  
  • correct reference to the datasets used
  • always add a reference format to your database
  • data exchange standards
  • partly a task of the repository
  • add README.txt file
  • is it raw, cleaned or processed data
  • embargo period, grounds
  • licenses 
  • citing: DataCite Citation Formatter
  • standard metadata, which (domain) standards are used
  • provenance of the data (who, where, what, where, published)
  • which software version is used
  • how long is the data available for re-use
  • data quality assurance (availability, integrity, confidentiality)
  • suggestions who might need this data (in README.txt) 
DATA SHARING
  • is the data shared in a repository, or as a supplementary data of an article, or as a separate data article in a data journal
  • in which repository is the data stored
  • who might find this data useful
  • how do you share your data (open data, or you have to ask for data)
  • when do you share (at once, after publication of the article, after embargo period)
  • is the data linked to a publication
  • link to your ORCID account
  • which data is open access, open data
  • which data will remain closed and for what reason
  • any encrypted data
  • authentication, who gives access rights and concludes contracts
  • contact details of the data owner (think about the long term!)
RESPONSIBILITIES AND RESOURCES
  • by positions 
    • principal investigator (PI): Data Management Policy, DMP, contracts, costs, training
    • researchers: follow and improve DMP, data management, problem solving
    • data manager: training, consulting, information security, backup, hardware and software
    • laboratory assistant, support staff: according to their tasks  
  • by workflow 
    • who is responsible for data collection, documentation, metadata, data security, etc. 
  • example  
  • costs are mainly related to manpower, hardware and software
  • guides, training, lawyer and/or DPO consultation, translation service
  • APC
  • data collection: purchase of data, transcription of recorded interviews
  • digitization and OCR: hardware and software, manpower
  • software development or software purchase, user licenses
  • hardware: computers, servers, instruments, field work equipment
  • data analysis: hardware and software, outsourced services
  • data storage and backup: predictable data volume, rule 3-2-1
  • long-term storage of data: preparation for sharing (formatting), anonymisation
  • data storage in a repository
  • partner meetings, conferences
  • project data manager
  • consideration: 5% of the project budget
Did you find the necessary information? *
Thank you for the feedback!