“Pre-Data” and Raw Data: Setting Up and Executing Raw Data Collection and Curation Infrastructure for Different Project Types
Defining “Pre-Data” and Raw Data
Pre-data here refers to the foundational work that is done and infrastructure that is set up to prepare your project to receive data. This work is critical and will ensure high quality datasets. It will also help you as you plan for data collection and think through each piece of data you need to collect in your project and why you are collecting it. “Pre-data” work facilitates the receipt and organization of raw data. Data is the information that drives a project and is used for answering a problem, question, or achieving a result.
Raw data is data in its original form - unfiltered, unprocessed, with potential errors, outliers, and duplicates. Without “pre-data” infrastructure, there is no structured repository for raw data. The quality of your pre-data infrastructure depends on how well you have thought through your project and what types of information will be generated through the course of data collection.
Examples of pre-data infrastructure are:
- A digital form for collecting information about samples or sampling sites
- A hard copy data sheet for recording information about soil samples or soil morphology in the field
- An image repository or image organizing system (such as the native “Photos” app on iOS)
- Laboratory notebooks or spreadsheets for recording data generated in the lab
- Literature review protocols
- Literature review tools or apps
- Data platforms such as GEMS
Project Types
Projects come in many different shapes and forms. For the purposes of this book, we will use four main categories of project which should fit most projects happening in our lab. Note that these project types refer to where the major portion of the raw data is coming from:
- Field projects
- Laboratory projects
- Review projects
- Data synthesis projects
Note that all projects may contain some components of several of these categories, but generally speaking, there should be a single category that best fits your project and can be used as a template to set up your pre-data infrastructure. You may also combine tools from multiple categories as appropriate for your project.
Field Projects
I’ll define field projects here as projects that require doing fieldwork to collect novel physical samples or morphological data. These projects require travel of some sort to collect raw data. Raw data for these projects typically will consist of physical samples.
Laboratory projects
The raw data for laboratory projects consists of laboratory analysis of existing soil samples or samples consolidated from collaborators.
Review Projects
The raw data for review projects is literature itself (peer-reviewed, gray or unpublished).
Data Synthesis Projects
Data synthesis projects consolidate published and unpublished data from other scientists and collaborators.
Additional Resources
USGS::Data Management - Quality by Design: Recommended Practices USGS::Data Management - Domain Management: Recommended Practices