Introduction
This workflow is based on an open approach to science. The driving philosophy behind this approach is that every part of your project should be open and transparent to the greatest extent possible - from the very beginning of the project. For example, you may be collecting data on private land that is not acceptable to released to the public until the private partner has agreed and the project is published. However your advisor, your project collaborators, and the partners are stakeholders on your project certainly have a vested interest in understanding what you’ve collected, where you are with data processing and analysis, and accessing the the data in a form that is easy for them. Even though data may not yet be fully public at this stage it is still open science in the sense that you are “showing your work”. A major philosophy behind this workflow is that showing your work is important, and it fosters collaboration and high quality, reproducible science. “Showing your work” in an open and reproducible way at every project stage also allows others in our lab group or program to learn from and build on your data and code.
Understanding this workflow will take some time. It will likely require you to change the way that you work, sometimes in ways that might seem annoying at the outset. For example instead of setting up your own spreadsheet for physical samples that you collected on your local computer, this workflow will ask you to build a survey 123 app through ArcGIS Online for all of your physical sample data logging in the field. It will then ask you to write a small R script that takes that survey 123 data and enters it in a master spreadsheet that we keep for all physical samples taken on all projects for a lab. This will assign every one of your physical samples unique ID number that differentiates it from all of the physical samples taken from all of the projects in our lab. It links each physical sample to a project and a purpose, this helps avoid the many vaguely labelled Ziploc bags and various samples that accumulate, creates early on in the project a physical referenceable sample archive, and ensures that your physical samples are available for you and others to use well into the future. That’s just one example.
Many other portions of this workflow may ask you to learn a skill that you don’t yet have. Those skills may be related to data management, data architecture, understanding and ensuring that metadata is properly curated, and certainly, coding. You can all learn all of these skills. Yes, you! You are fully capable. Even if you have never touched a computer before.
Not only are these skills important for this workflow, but this workflow implements some of the most cutting edge concepts in data science today. The skills that you develop through implementing this workflow can be cited on your résumé and are highly marketable, highly sought after skill sets in any career field in today’s world. In short, I’m asking for your full buy in. It is always tempting to cut corners in the name of speed, but I’m asking you not to do that, and I’m asking you to join me as we make the pledge to do things in a standardized way that will lead to greater efficiencies and work saved in the future.
Workflow Stages
The sections of this book each outline a single component of an integrated workflow that attempts at every turn to use well-documented, open source tools. The major components of this workflow are:
Project Setup
This includes deciding what constitutes a project (seems like it should be an easy thing, but it’s actually somewhat difficult), and how to use a template to generate a file and folder structure for your project before you even begin that is deeply integrated across platforms. This also includes understand what “type” of project yours is - as this dictates how you will set the project up and collect and curate data.
Once you have your project defined and set up along with the folder structure across platforms, the next step is to begin producing documentation for your project. This is called project metadata and it is extremely important. Your project metadata gives you and others a general idea of the scope of the project, the motivation behind the project, where is going, when it is happening, and where the finals results and data will live.
Not all projects that you begin will get to some final stage, however all projects should have a project structure and metadata even if these are empty. In fact, setting up a project with even just a trivial amount of meta-data and data structure encourages you to work more easily on that project in the future. This includes creating a badge or sticker that defines and brands each project. Although this might seem completely trivial, it actually helps you to view your project as an entity that is ready to be discussed and shared - even in its infancy by others, by your collaborators, or by your peers. It’s a fun way to stick your claim on a piece of ground that you are working on. There’s a whole section in this book on how to design and create a project sticker.
“Pre”-Data
Before you start collecting data, and before you get to the point of looking at your raw data that you have collected, it’s important to set up pre-processes that allow you to standardize your data input and make it more easily retrievable in the future. This pre-data workflow include setting up a survey 123 app for data collection for field projects, designing and setting up data sheets, and for non-field projects, setting up meta-analysis structures or data forms.
Raw Data Collection
In this stage, you generate your raw data. As you collect your data, you will use your pre-data structure and tools ensure that all of your raw data is captured, archived, and organized as you are doing it. This is ensures the highest quality data and also serves as an insurance policy against data loss. Avoiding unlabeled bags, pictures, or data sheets or notebooks that doesn’t contain the relevant information. Your pre-data steps and pre-data structure will ensure that this doesn’t happen.
Data Munging
In the data munging stage do you conduct QAQC and further curate your raw data, in a transparent and reproducible manner, and make decisions about how to best process your raw data to feed into your data analysis stage.
Data Analysis and Interpretation
The data analysis stage is probably the longest stage, and in fact it may start before you even have all of your data. The data analysis stage involves taking your pre-processed (“munged”) data and doing analysis on it in a transparent and reproducible way. This means that you will need to learn how to code. In my lab group we primarily use R. R is extremely flexible and widely used in soil science. Other scientific fields prefer python, and you may have a project where a different programming language makes more sense. That’s OK. But if you don’t know where to begin or what do use, I would suggest R. The data analysis step ends when you have produced tables or figures that allow you to begin interpreting your data and your results. Note that your tables and figures may not be final for a very long time - and you will likely be engaged in cycles of data analysis, writing, and further data analysis as you refine your interpretations. But it’s important to start somewhere and if you document your data analysis in a reproducible and transparent way, you won’t duplicate efforts down the road and will learn from all of your dead ends, particularly when it comes to coding.
Setting poorly written or poorly managed code aside for even a week or two can cost you many hours when you come back to it. The goal is to have everything so well documented that you can hop back into a project and within 30 minutes be moving along at the place where you left off. This involves documenting many stages of your data analysis, which can see sometimes to slow you down while you are doing it. However, I can conifdently say that any time spent documenting and logging during data analysis is time saved by an order of magnitude when you inevitably are going back for the 20th 30th or 40th time into your scripts into your data and trying to make sense at all. Do it right the first time and even your dead ends will remain learning experiences that are well documented and allow you to move forward in a different path with your analysis.
Writing
This section is about both the tools that we use for writing, and the philosophy of writing itself. Writing can be one of the most challenging parts of the job. It doesn’t have to be. We know so much about the philosophy of writing, and there are many tools to improve our writing and productivity. We can all learn to be better writers, we can all be more productive writers, and this aims to help you understand the philosophy of getting there. But in more practical sense, I hope to give you some tools that allow you to write in a more reproducible and transparent way to reduce writing and work stress. Part of that means being more efficient at writing down and logging everything.
This can be accomplished through the use of a personal knowledge management system (PKM). My system of choice is Obsidian (digital) and a Bullet Journal (analog). In addition to a PKM, getting in the habit of documenting writing notes and ensuring long-term viability of the notes that you created is extremely important. It means you will write faster, because you will never start with a blank page again. You always start writing with a structured outline text that was generated in the past, notes, thoughts or other things that have been all consolidated and maintained in your PKM - a trusted system where you don’t have to worry about losing thoughts or progress and is available to you whenever you choose to start working on writing in your project.
The reality is writing is not a discrete stage of a project that starts after data analysis. In this workflow writing begins with project definitions and project metadata and it doesn’t stop until the project or manuscript is submitted. Writing is continuous - you will revise things 20 or 30 times. Coding is also writing - in fact the best examples of well written code mean making that code not just readable for machines but also humans.
Most science is collaborative, and nowadays writing will occur likely with collaborators - at a minimum me as your advisor, a group of scientists on a small project, or a large group of people working on a large project. It’s important to manage tools for version control, tools for collaboration in real time, and tools for rendering your documents so that you don’t have to spend time formatting. Formatting can take forever, but with modern tools like markdown language, publishing tools like Quarto, we can create from the same plain text document beautiful websites which can be referenced and cited by doi, we can create journal article submissions in PDF form that include all tables and figures, and we can house that same PKM system, which allows it to be searchable and usable to build on in other projects in the future.
NOTE: these discrete project stages overklap significantly. Except for project definition and project management, for example. Writing and data analysis often take place at the same time because we “write to learn”.
TODO: how to take notes (strategies - smart notes), revisit and improve your metadata, writing philosophies and writing to learn.
NOTE: keep a data analysis log - keep a writing log
Long term data archiving and storage
GitHub is the main repository for code and data from projects in this lab, but also any data generated including spatial data for should go into a major database either hosted or sponsored by a stable entity. Long-term archiving also includes packaging up your entire project structure, which now should be standardized based on this lab work flow, so anybody in our lab, or anybody reading this documentation knows exactly what each folder means where to find certain files how to run code, even 10 or 20 years into the future. This workflow is designed to be as future proof as possible.
Note that although these major components are listed in chronological order, they will likely not necessarily occur in distinct stages in your project. For example data analysis may begin with preliminary data and continue as you are writing drafts of manuscripts, AND writing occurs across all of these components - writing actually begins with project setup as you will see. If you have a project idea, you can start writing! This allows you to never start with a blank page
To execute this standardized workflow as described, it will be necessary for you to be familiar with and utilize the following tools:
- ArcGIS Online: Survey123 (Proprietary - Free account available through UMN)
- Obsidian (Open source)
- Zotero (Open source)
- QGIS (Open source)
- R/RStudio (Open source)
- GitHub/GitHub Desktop (Proprietary - Free account available to anyone)
- Google Drive (Proprietary - Free account available to anyone, unlimited storage available through UMN)
- InkScape (Open source)
- Hack MD (Proprietary - Free account available to anyone)
- note to potentially include social professional sites like ResearchGate and Google Scholar + ORCID + X?
These are not a randomly selected group of tools - they are designed at every turn to allow deep integration and (as much as possible) free and open source usage. Although not all of these tools are open source (ArcGIS Online, GitHub, Google Drive, Hack MD) with the exception of ArcGIS Online, all of these tools have a free base level, which will be sufficient to execute this workflow. Additionally, because of UMN’s enterprise license with ArcGIS Online, all of those tools are available free to you if you are in our graduate or undergraduate programs.
These tools will require you to be fluent in the following languages: