Chapter 4 Data management
Closely related to organization is data management. Many graduate research assistants have data management responsibilities in addition to data analysis. Oracle defines data management as the “…practice of collecting, keeping, and using data securely, efficiently, and cost-effectively.” Based on these notes from the University of Mass., here are the general components of data management:
Write a data management plan
Specify a explicit set of operating procedures
Create a data dictionary
Determine a data storage location
Develop an analysis plan
Archive all of your procedures
I expand on each of these items below:
4.0.1 Write a data management plan
As soon as you receive a new project, you should establish your organization system (refer back to 2). After this, your next step should be to spend some time contemplating a data management plan. This plan should describe how you will keep track of all of the data so that you can explain each step of your analysis process.
If collected data comes to you (especially if it comes via an Excel spreadsheet), you should immediately make a copy of this data. You should never edit the original data you receive - do not do any data cleaning or re-formatting. Instead, do all of your work on the copy, so that you can always refer back to the original data.
If you are planning a study with yet-to-be- collected data, spend some time thinking about how the data will get to you.
Will the data come from electronic records? If so, save the SQL query/other code used to extract the records.
Will the data come from a chart review (e.g., a doctor will manually collect data from patient records)? If so, I recommend making a template for the doctor (or whoever is doing the chart review) to fill in.
Think about each person who will interact with the data, and have a flow-chart in your mind that traces exactly how this data will come to you. This will help you when you write the “Methods” section of your paper/poster, and will keep you from making mistakes in your analysis downstream.
4.0.2 Specify a explicit set of operating procedures
If you are in a situation where multiple people will handle the data (as in the chart review scenario), you should go beyond ‘having a flow chart in your mind’ – in cases like these, write down who will do what with the data at each step of data collection. Send this written document to everyone on the team, and make sure everyone agrees/is on the same page.
If you are in the chart review scenario, instruct the person doing chart review on how to collect data that is in a consistent form. Remember that the computer will recognize these responses as four different values: “Yes”, “yes”, “y”, ✅.
4.0.3 Create a data dictionary
Create a spreadsheet with all the variables in the study listed in one column, and their definitions listed in the next column. As you write your scripts for analysis, you may even want to add a third column that lists the name you used in the script for that variable. The US Dept. of Agriculture offers some best practices on organizing a data dictionary.
4.0.4 Determine a data storage location
Where you keep your data matters. This is of particular importance when working with data protected by HIPPA (e.g., health records data). As a general principle, don’t store data on your personal computer. Store it either online (e.g., a secure OneDrive folder, a GoogleDrive folder, an AWS location) or in an external drive (e.g., a company or university-owned secure shared drive).
4.0.5 Develop an analysis plan
At the outset of a project, outline the data analysis tools you think you will use based on the kind of data you expect to receive. Begin with the end in mind. Ask your collaborators these questions:
What is your research question?
What kind of interpretations/generalizations you want to be able to make at the end of the project?
4.0.6 Archive all of your procedures
Document what you have done for data cleaning and analysis. This documentation can include:
Well-organized, well-commented code (well-commented does not necessarily mean more comments; in fact comments should not explain what the code is doing).
Stick to your file structure, using informative file names.
Save a copy of every file you send out for others to read. Even as you update versions of a report, save one copy of each version you have shared.
4.1 U. Iowa data sharing policy
This document has the University of Iowa official documentation on data security guidance. I recommend any U. Iowa GRA to become very familiar with this document; the principles may also be helpful for those at other institutions.