By Robin Brooks
The pioneering archaeology work of Ben Marwick, data science fellow and member of the eScience Institute Steering Committee, has resulted in several high-profile publications and garnered lots of media attention. His paper “Human occupation of northern Australia by 65,000 years ago” was published in Nature, and “sets a new minimum age for the arrival of humans in Australia”. This paper is covered in more depth on UW Today, where you can also find a video on the research.
Marwick is an advocate of open and reproducible science; he also co-authored of two recent pieces in The Conversation: “Buried tools and pigments tell a new history of humans in Australia for 65,000 years” and “Here’s the three-pronged approach we’re using in our own research to tackle the reproducibility issue”.
Additional press can be found in The New York Times, CBS News, ABC News (Australia), Science, Smithsonian magazine, Hakai magazine, and Silicon Republic.
I sent Marwick a few questions by email on the process his team used to develop the research. (This content has been lightly edited for length and clarity.)
Brooks: Did you find support for your reproducibility principles in working with the Nature team?
Marwick: Yes, we all agreed that it was a priority to work in a way that made it easy for us to reproduce our own work. We knew this would be a big and complex project, so we took steps to be sure we could trust our results. It wasn’t easy because although everyone valued the principles, we had different levels of familiarity with reproducible research in our team. So it added a little more time and effort to ensure our work was as reproducible as possible. However, we strongly feel that this investment in reproducibility was worth it. It’s a fundamental principle of good science that other researchers should be able to find enough information in our publication to reproduce what they see in that publication.
Brooks: Could you describe some of the benefits of making archaeological research reproducible?
Marwick: Working in a big group, it’s important to be organised with storing data, and keeping track of different versions. Without a strategy for these simple tasks, chaos and confusion can take over. This can lead to errors in the analysis. The principles and tools of reproducible research helped us a lot to stay organised and efficient in our work.
Our results are relevant to major issues in human evolution, so it’s important that we are certain of the correctness of our analysis. By writing code using the R programming language to analyze our data, we created a detailed record of all the decisions made during the analysis. This makes it easy for us to look back over our work and double-check our assumptions. Without the code to document our analyses, we would likely forget many of the decisions we made, making it hard to verify them. This could result in mistakes or bad choices going undiscovered and affecting our results.
Another advantage of using code is that we can re-run our analyses very easily. We just run the code and the computer regenerates the output. This is very different from what most other archaeologists are doing, which is laboriously pointing-and-clicking in Excel and other spreadsheet-type programs. Re-running the analysis is important for investigating the consequences of the decisions we make during the data analysis. If we had worked entirely in Excel, it would have been very difficult to rerun our analysis, so we probably would not have bothered to explore alternatives beyond the first result. But because we used code, we could easily check how our decisions affected our analysis.
Beyond the benefits to our group, we also see benefits to other archaeologists, and to science generally. For archaeologists, they can look deeply into our work and verify for themselves that we’ve made good choices in our analysis. They can take our data and code and re-analyse themselves using different assumptions to see if our findings are robust. None of this is possible in the traditional way that archaeologists publish short papers to announce their findings, there simply isn’t enough detail. But because we made a commitment to work reproducibly, other archaeologists have a chance to really check for themselves that we’ve got reliable results.
Another benefit is that other archaeologists can easily take our data and combine it with their own to perform new analyses. They can also take our code and use it with their data, to explore new methods. Our commitment to sharing our research materials speeds the impact that our work has on the field. It makes it easier for other researchers to use our results.
To science generally, our study will be a pioneering example to researchers in other fields of how do to reproducible research. Researchers in other fields will look to our paper to see the tools we used and the basic details of how we organised our data to make it reproducible. This will save a lot of time and trial-and-error for other researchers who want to improve the reproducibility of their research. Our paper helps to make reproducible research easier and more normal in science broadly. And this will be better for everyone, including the public, who are more likely to trust research that is reproducible, rather than once-off results that might be anomalous.
Marwick: Until publication we were mostly closed with our research, and worked in private. This is because of our agreement with the Mirarr Traditional Owners of the archaeological site. We wanted to be sure that they were the first to know our results, because the site is part of their heritage. This is a common situation in the social and medical sciences when researching data that has privacy and ethical implications. Now that the publication is released, we also released our data and code, and our work becomes open. We made available files that are usually kept private by most researchers. So it took a little bit of extra effort to ensure that other people could make use of these files.
Brooks: Was your work and collaboration on this project influenced by your connection as a data science fellow with the eScience Institute?
Marwick: Yes, very strongly. As a member of the Reproducibility and Open Science Working Group at eScience I’ve gotten to know faculty on campus in other departments who are highly skilled in making their research more reproducible. I’ve learned a lot from their presentations to the Working Group, and reading their papers. I get tremendous inspiration from these faculty who are doing pioneering and world-class work to improve the reproducibility of their research. That community is a huge source of motivation for me, it’s important to know that there are other faculty who highly value reproducible research and open science practices. It’s only because we have the eScience Institute that I’ve gotten to know these researchers, and it’s transformed how I work and what I think is good science. I’m now busy sharing these insights with my students and colleagues in archaeology and related fields to improve the transparency and integrity of research generally.
Brooks: Do you have any additional thoughts or insight you’d like to share?
Marwick: There is a growing awareness of the importance of reproducible research in science generally. The journal our article appears in, Nature, places a high priority on reproducibility, and has published many editorials and commentaries to encourage researchers to lift their game. I was a co-author on one of these recently in Nature Neuroscience, calling for researchers to share their analysis code. Our paper answers these calls with a bold commitment to reproducibility. We hope it will be a useful example for other researchers to follow and take inspiration from. It’s very exciting to be on the forefront of that and be an agent for change in my discipline.
Some researchers feel that it’s not worth the extra time to make their research reproducible, transparent and open. They may feel that it takes time away from making big discoveries that could get published in prestigious journals. Our paper shows that it is possible to put in the time and effort needed to do reproducible research and get the result published in a high-impact journal.