Course introduction

Teachers: Bas van Gestel en Zoltán Bochdanovits-ten Hove

When you program, you seldom program in isolation. Usually, you are writing analyses, pieces of software or creating visualizations within a collaborative (research) project. Adopting a robust and reproducible workflow is key for managing your own work and that of others. Using a version control system, managing your data and other files in a structured way, deploying your software or analysis pipeline in a robust manner are all good practices you need to apply in your work as a bioinformatician or data analyst, translator or scientist.

0.1 Osiris

Before you do anything else, make sure you sign up for the exam AND the second chance!

0.2 Purpose of this website

This website is a {bookdown} project that forms the reader and learning content for this course

0.3 General course aim and content

In this course we will focus on the tools that are available to adopt such good practices in your own work. Using examples and many exercises and assignments we will help you get to know the tools and WORKFLOWS that are available to you. The focus here will be on the tools that are freely and from open source available. We chose to limit the course tools to the use of open source applications exclusively. This will greatly facilitate the application and use of these tools also outside the course context. Furthermore, open source software is one of the key prerequisites for Open Science. We will start the course by explaining what Open Science is and how it (to our opinion) will lead to better science. We will use this framework to expand our knowledge on how to apply good practices from the software development industry to scientific applications and from there we will gradually start adding tools to your data science toolbox.

The course is build on these fundamental concepts and technologies.

  1. Open Science
  2. File management and formatting following the Guerrilla Analytics Principles
  3. Version control and Project management, using Github.com
  4. Literate Programming using R flavored Markdown (R Markdown)
  5. Relational databases to improve data provenance, computational speed and traceability of observations

This is also (globally) the contents of the course.

0.4 Software, Laptop and RStudio server

In order for you to be able to start applying what you have learned during the course, we require you to install the software needed for this course on your own computer. This means you will get some extra practice with installing packages and dependencies, and you have your laptop installed and ready for your internship next school year.

We do also still offer an account on the RStudio Server that you used during the previous DSFB-2 course (“DAUR”). The server can be found here https://datascience.hu.nl/rstudio/tlsc-dsfb16v-20-21/. If you do not have access to your account any more, please inform your teachers. The list of required software can be found here 0.13.

0.5 Seeking interaction and help

During this course we really invite you to seek interaction on the course contents, the exercises and assignments with fellow course participants and the instructors. To facilitate collaboration and also the possibility for you to build a strong “class of 2024” network, we scheduled 5 hour sessions at the HU. We will of course include breaks. Your teachers may not always be present during the whole session, but quite frequently we will. We will start every class with a plenary meeting.

Note that you will have to work on Workflows outside of class hours.

Presence in class is in principle mandatory! We would have a very hard time teaching you how to collaborate on DSFB projects when you or your classmates aren’t there. You will be graded partly on class participation.

0.6 Exercises and portfolio

0.6.1 Normal exercises

Within the lesson you will encounter a number of exercises.

Please answer all the assignments and exercises in an RMarkdown document. You learned how to make one in DAUR1. Save it somewhere in your course project. Seriously, you may need code from one lesson in one of the following lessons.

0.6.2 Portfolio

At the end of every lesson (and sometimes in the middle), there is a larger portfolio assignment. You will work on this assignment for the rest of the day, after finishing the smaller exercises in the lesson. Save all portfolio assignments in separate Rmd files! Please create a separate RStudio project to complete all the portfolio assignments.

The collection of all the portfolio assignments will form the main part of your portfolio for this course. The portfolio is also the exam for this course. The details are described here. In short: You will have to publish a bookdown project to github pages in your github account. This page will function as a portfolio where you will post all the assignments and output products* of the course and should be a professional looking representation of your DSFB skills at the end of this semester. Bookdown and github will be covered within this course.You will need to pass all portfolio assignments in order to pass the course!. See here for more info.

  • As your work on the projecticum is probably your first real data science / bioinformatics project, your portfolio is the logical place to display it. Therefore, the work you do for the projecticum will also be part of your potfolio. However, the projecticum is only graded for the projecticum.

0.7 Study hours

Please also note that this is a 5 EC course, so count on about 140 hours of work in total. Part of this workload will be spread over the semester, to work on your portfolio. But we scheduled these lessons in such a way that you will be able to finish most lessons in the first 3 weeks of the semester. Use this time to you advantage! You will need the skills from this course within the projecticum.

0.8 Passing or failing the course

This course is about the workflows that are available to adopt good practices in data science for biology. This is a field in which you rarely work alone. Participating and contributing to a data science community is an important skill to practice! Active participation is therefore conditional for getting a grade on your portfolio. So, to summarise: to successfully complete this course, you will need to pass these checks:

  1. Add all required assignments and materials to your online portfolio
  2. Provide the teachers access to your online portfolio before deadline TBA, 23:59h.
  3. Meet the portfolio and assignment minimal requirements
  4. Pass the live partial assignments in class (Two skills are hard to show in a portfolio. We will do live partial assessments for these skills in class. Discussed live at first class meeting)
  5. Participate actively in class
  6. Pass the live assessment that will be held to grade your portfolio.

For more detailed information on the requirements and rules for passing this course see the page on the portfolio.

0.9 A future proof course

No matter where you will go after this course, if you stay in the field of bioinformatics or data science, you will encounter and use many of the concepts that are dealt with in this course, in your daily work. Organizing your work in a WORKFLOW will reduce confusion, time spend debugging or tracing mistakes and will improve the robustness and reproducibility of your science. It will also require you to spend time to learn about these concepts. We believe it is well worth the investment, because it will pay off, not only for you personally, but for science in general. We have done our best to make this course enjoyable and educational at the same time. We sincerely hope you will have fun learning about this important (yet under-appreciated) aspect of data analysis and we hope you will take along many of the lessons learned to your future job. If you can, try to inspire others to adopt a ‘reproducible research’ mindset. If anything, at least you will have a nice portfolio to show off to your (possible) future boss.

Of course, we have carefully put together this reader to include example code, inspiring exercises and fulfilling assignments to be added to your portfolio. We have used many external resources to generate the content of this reader. We tried to pay proper credit to all of those resources. If you feel we have made a mistake somewhere, or did not pay proper respect to a source used, please let us know by reaching out to us via email or generate a pull request in the github repo, linked to this project (https://github.com/DataScienceILC/tlsc-dsfb26v-20_workflows). Please notify us of any typo’s and other mistakes you find.

0.10 Get info

The easiest way to get information about the authors is by using the {desc} package from within the reader bookdown project. Run the code below to get email-address information.

#install.packages("desc")
library(desc)
desc <- description$new()
desc$get_authors()

If you would like to cite our work, run:

devtools::install_github(https://github.com/DataScienceILC/tlsc-dsfb26v-20_workflows)
citation("workflows")

0.11 The {bookdown} package

This website was created using the {bookdown} (Xie 2019) package written by Yihui Xie. The package can be downloaded from CRAN. For more information see the documentation.

The {bookdown} package can be installed from CRAN or Github:

install.packages("bookdown")
# or the development version
# devtools::install_github("rstudio/bookdown")

We will use it extensively in this course.

0.12 License

Part of this course if written by us, part of it is inspired by different sources. Please cite the original source if you use anything we also borrowed. Please cite us if you use our work.

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

CC BY-NC-SA 4.0

References

Xie, Yihui. 2019. Bookdown: Authoring Books and Technical Documents with r Markdown. https://CRAN.R-project.org/package=bookdown.