Data Management Lab Meeting

Last updated on 27 Sep 2023 4 min read bruna-lab-teaching

For lab meeting next week you should read through these papers. I suggest reading them in order (you can Skim Michener and Strasser et al. - both detail rich, but for lab meeting Borer et al. and White et al. will be the most useful.

Bruna, E. M. 2010. Journals can advance tropical biology and conservation by requiring data archiving. Biotropica 42(4): 399–401: Bruna_2010_Biotropica_Editorial
Borer, E. T., E. W. Seabloom, M. B. Jones, and M. Schildhauer. 2009. Some Simple Guidelines for Effective Data Management. Bulletin of the Ecological Society of America: 205-214.: Borer_etal_2009_BullESA
White EP, Baldridge E, Brym ZT, Locey KJ, McGlinn DJ, Supp SR. 2013. Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution. 6(2):1-10.White_etal_2013_IEE
MC Whitlock. 2010. Data archiving in ecology and evolution: best practices. Trends in Ecology & Evolution. 26 (2), p. 61-65.: Whitlock_2011_TREE
CA Strasser, R Cook, W Michener, and A Budden. 2012. Primer on Data Management: What You Always Wanted to Know. A DataONE publication, available via the California Digital Library.: Strasser_etal_DataManagement_2011_DataOne (can also be found on Data ONE’s Best Practices for Data Management website)
Michener, W.K., Brunt, J.W., Helly, J., Kirchner, T.B., Stafford, S.G., 1997. Non-geospatial metadata for the ecological sciences. Ecological Applications 7, 330–342.: Michener_etal_1997_EcolApplications
Sample Metadata and Data Files: Bruna EM, Izzo TJ, Inouye BD, Uriarte M, and Vasconcelos H. 2011. Data from: Asymmetric dispersal and colonization success of Amazonian plant-ants queens. PLoS ONE. Dryad Digital Repository. doi:10.5061/dryad.h6t7g)
DropBox

(for fun you might read this blog post by Carly Strasser, too - it’s about training undergrads in Ecology in data management…strictly optional, however).

This is the context and some of the issues to consider regarding data management and archiving.

Major goal of science is reproducibility. What does it take to reproduce someone’s analyses?

Our papers integrate different types of complex data:
1. climate, measurements, lists of diversity, experiments, observations, gps points
2. on these we layer analyses, statistical models, computer code, etc. THESE ARE ASLO DATA
What happens to these data and analyses once paper is published? Fig. 1 of Michener.
Who cares? Why is it important? A: CNPq, NSF, other researchers, future generations
The right path: data management -> data use-> data sharing (archiving) -> data reuse
How can data be reused by others?
1. Validation of results
2. Meta analyses
3. New questions
4. Increases in citation rates of papers
5. New opportunities for teaching
6. Reduces data loss
First key step – data collection and organization
1. Decide on a naming scheme, create a key, make it unique for each sample
2. Standardize! Consistent within columns: only text, only numbers, , olny dates. Use consistent codes, formats, etc.
3. Reduce chances of errors by using excel lists to constrain choices for data entry
4. Identify missing data with a code
5. Create tables with codes, site data, etc.
6. Excel is great for data analysis, terrible for data archiving
7. Relational databases – MySQL and others are free, essential for large and complex dataset, useful for all others
8. Use descriptive file names (organism_site_year_whatmeasured). Recall to tell people about dates and how to record
Quality control
1. minimize manual data entry – cuts down on mistakes
2. use double entry or spot check records
3. use a database, document changes
4. after data entry, look for outliers, anomalous values, do statistical summaries
Metadata – must know who created the data,whatisthe content of dataset. When was it created, where was it collected, how developed, why developed?
1. Metadata basics: Michner, Borer
2. Metadata standards:EML, see others. http://knb.ecoinformatics.org/software/eml/
3. Can use programs like Morpho (EML),
Analysis
1. Keep raw data raw, use scripts to manipulate. Save them with the data, be sure to annotate well.
2. Workflows: how you get from raw data to the final product of research (flow charts)
3. R/SAS scripts – code is great – well documented code easier to review, remember, share, and repeat
4. Workflows enable reproducibility (can someone independently validate your findings), transparency (can others understand how you arrived at your results), executability (can others re-run or re-use your analyses)
Data stewardship and reuse
1. 20 year rule – metadata and accompanying data should be written for a user 20 years into the future
2. Use stable, non-proprietary formats (csv, txt, tiff)
3. Create backup copies
4. Periodically test ability to restore information
5. Store data in a repository: Institutional archive or discipline/specialty archive
Data citation – allows readers to find data products, promotes reproducibility, better measure of research impact
Data management plan – what is it? Why do it?
E notebooks, online science (notebook, ORNL eNote, Evernote, Google Docs, Blogs, Wikis, theLabNotebook.com, Notebookmaker
Databib- list of data repositories
Importance of doi: precise identification, credit data producers and publishers, link from literature to the data, research metrics for datasets
DataOne.org – tutorials, database of best practices and tools, primer on data managment, investigator toolkit

Image: Numbers by Andy Maguire (CC BY 2.0)

data-archiving data-management open-science

Data Management Lab Meeting

Professor & Distinguished Teaching Scholar

Related