Data Munging -also called data wrangling

Data munging as defined here is the process of moving from raw data to data in a processed form that is usable in analysis. Data munging consists of format standardization and Quality Assurance and Quality Control (QAQC).

Other references for what is included in data munging:

Astera::What is data munging and why is it important?

NOTE: could metadata writing also be considered data enrichment?

Alteryx::Data Munging

NOTE: ADD UNIQUE IDS!!!!! How to have IDs sample vs horizon vs pedon vs site ids

Geeks for Geeks::Data Munging in R Programming

Samantha Oliver - USGS Water Data::Beyond Basic R - Data Munging

NOTE: Gapfilling, imputing, etc

TJ Murphy::Reproducible Data Munging in R

Daniel Dauber::R for Non-Programmers - Data Wranging NOTE: this is VERY good

NOTE: you can do some data munging in the read.csv specific arguments: https://www.geeksforgeeks.org/read-contents-of-a-csv-file-in-r-programming-read-csv-function/

List of questions: Is the file in text format (separated values, preferrably csv)?

Does the first row contain column names?

Are there special characters, spaces or other wierdness in the column names? Renaming column names

Are there missing values in the dataset? If so, decide if/how they should be dealt with.

Specifying data types or column classes.

Factors vs characters and

Documenting Data Munging - Standards

As R is the main language of choice in our lab, the default data munging documentation standard involves an R project or series of R scripts. These should be stored in the src folder of your project directory and curated according to our lab [R code and project style guidelines]#todo.

Data munging documentation should also include metadata for this stage - and could be in the form of part of a project Quarto book published on GitHub Pages. This metadata should document all decisions made in the data munging process, the why and how. Due to the detailed nature of many of these decisions they are too much detail for a standard readme file or a project metadata file. However, a Quarto book with sections, chapters, and space to provide R code blocks would be an excellent way to document these decisions.

Required Tools - Data Munging
  • R/RStudio
  • Google Drive