IPO

Input -> Processing -> Output

_

Reproducible academic workflow protocol

IPO (Input-Processing-Output) Protocol 0.1 for Reproducible Research

To download the protocol folder structure click here.


This is a project folder template pretty much based on the TIER protocol (Teaching Integrity in Empirical Research). TIER “promotes the integration of principles and practices related to transparency and replicability in the research training of social scientists.” (more information in https://www.projecttier.org/).

The implementation of reproducibility in this kinds of protocols is based on generating a self-contained set of files organized in a project structure that can be shared and executed by anyone. In other words, it must have all you need to run and re-run the analysis.

The IPO protocol follows the rationale of TIER, with some innovations:

  • attempts a model easy to memorize and related with the analysis workflow (Input-Processing-Output=IPO), where processing refers to preparing and analyzing data.

  • adds an “Input” folder, which is has a broader scope containing the original “Data” folder but also other possible inputs, such as external images and bibliography files.

  • the Data folder is also simplified, including now only an “Original” and “Processed” structure.

  • it modifies the files to .md (Markdown files) instead of .txt, which is a plain text language that can be then converted to other formats such as pdf and/or html (for instance, when using R / Rmarkdown). But, in the core, they are simple txt files with just another extension.

Files & Folder Structure

To download the protocol folder structure click here.

├── input: external information such as images, documents, bib files
|   ├── data:
│     ├── proc     : processed data files
│     ├── original : original data files & metadata when available
│   ├── images
│   ├── bib: bilbliography files
│   ├── prereg: pre-register files when available
|
├── processing:
│   - preparation.Rmd
│   - analysis.Rmd
│
├── output: tables, graphs, logs, anything produced by the code
│   ├── graphs
│   ├── tables
|
- README.md : the general introductory file
- paper.md / paper.html / paper.pdf: the paper

Basic principles

  • order: work thinking of someone that is not familiar with the project and that should be able to understand it and reproduce it without further instructions. Ot think about yourself in 5 years: will you be able to understand and reproduce this?

  • comment the scripts: record the reasons for any decisions briefly

  • the preparation script should begin loading the original data and nothing else, and should end saving the data in the processed data folder.

Notes

  • as the R/Rmarkdown environment allows combining text, analysis & outputs, the corresponding folders (processing & output) could sound redundant. This is probably the case for brief analysis and short reports. Nevertheless, when working on a paper/thesis/group project it is certainly advisable to separate inputs, analysis and outputs for the sake of order and reproducibility despite working with Rmarkdown.

  • for making the work easier in Rstudio you might want to make the project folder a Rproject folder. This will make that the working directory automatically will be referred to the root, besides activating other functions within Rstudio. For this, just go to File->New project->Existing directory, and point to the project folder. This will create a file with .Rproj extension that when clicked it will open Rstudio and the project folder as a root. In this way it avoids generating local individual working directories (the R command setwd), which do not facilitates reproducibility.

  • besides having one project folder where all information is contained, the technical key for working within this structure is to save and load files located in different places through relative paths (RP), which allows connecting different files within the same project folder. For instance, for loading a data file from the original folder from the preparation.Rmd script:

load("../input/data/original/data.csv")

The ../ characters means “one level up” in the folder structure. In this case, taking as a reference a script within the processing folder, we need to go up to the project or root folder, and from there go down to input/data/original/data.csv

  • for saving tables and any output produced in R use the function sink. For instance, for a stargazer descriptive table from data:
sink("output/tables/table1.txt")
stargazer(data, )
sink()

The rationale is to tell in which file to save or sink what comes next, and stop saving with sink()

Then, to call the output from the paper.Rmd file: <div><object data="output/tables/table1.txt"></object></div>

  • for saving graphs, after producing and viewing the graph:
dev.copy(png,"output/graphs/graph1.png",width=600,
        height=600); dev.off()

Then, to call the graph from the paper.Rmd file:

![](output/graphs/graph1.png)

Further work:

  • IPO-RGit: maintaining the same basic structure, this version is an upgrade for taking advantage of all the reproducibility, collaboration and publication tools offerede by the Rmarkdown/Github work environments. An example of this wor-in-progress implementation can be seen here.

  • Spanish translation

Github activity feed