Data Management Lab Meeting

For lab meeting next week you should read through these papers. I suggest reading them in order (you can Skim Michener and  Strasser et al. - both detail rich, but for lab meeting Borer et al. and White et al. will be the most useful.

  1. Bruna, E. M. 2010. Journals can advance tropical biology and conservation by requiring data archiving.  Biotropica 42(4): 399–401: Bruna_2010_Biotropica_Editorial
  2. Borer, E. T., E. W. Seabloom, M. B. Jones, and M. Schildhauer. 2009. Some Simple Guidelines for Effective Data Management. Bulletin of the Ecological Society of America: 205-214.:   Borer_etal_2009_BullESA
  3. White EP, Baldridge E, Brym ZT, Locey KJ, McGlinn DJ, Supp SR.  2013.  Nine simple ways to make it easier to (re)use your data. Ideas in Ecology and Evolution. 6(2):1-10.White_etal_2013_IEE
  4. MC Whitlock. 2010. Data archiving in ecology and evolution: best practices. Trends in Ecology & Evolution. 26 (2), p. 61-65.: Whitlock_2011_TREE
  5. CA Strasser, R Cook, W Michener, and A Budden. 2012. Primer on Data Management: What You Always Wanted to Know. A DataONE publication, available via the California Digital Library.: Strasser_etal_DataManagement_2011_DataOne (can also be found on Data ONE’s Best Practices for Data Management website)
  6. Michener, W.K., Brunt, J.W., Helly, J., Kirchner, T.B., Stafford, S.G., 1997. Non-geospatial metadata for the ecological sciences. Ecological Applications 7, 330–342.: Michener_etal_1997_EcolApplications
  7. Sample Metadata and Data Files: Bruna EM, Izzo TJ, Inouye BD, Uriarte M, and Vasconcelos H. 2011. Data from: Asymmetric dispersal and colonization success of Amazonian plant-ants queens. PLoS ONE. Dryad Digital Repository. doi:10.5061/dryad.h6t7g)
  8. DropBox

(for fun you might read this blog post by Carly Strasser, too - it’s about training undergrads in Ecology in data management…strictly optional, however).

 This is the context and some of the issues to consider regarding data management and archiving.

Major goal of science is reproducibility. What does it take to reproduce someone’s analyses?

  1. Our papers integrate different types of complex data:
    1. climate, measurements, lists of diversity, experiments, observations, gps points
    2. on these we layer analyses, statistical models, computer code, etc. THESE ARE ASLO DATA
  2. What happens to these data and analyses once paper is published? Fig. 1 of Michener.
  3. Who cares? Why is it important? A: CNPq, NSF, other researchers, future generations
  4. The right path: data management -> data use-> data sharing (archiving) -> data reuse
  5. How can data be reused by others?
    1. Validation of results
    2. Meta analyses
    3. New questions
    4. Increases in citation rates of papers
    5. New opportunities for teaching
    6. Reduces data loss
  6. First key step – data collection and organization
    1. Decide on a naming scheme, create a key, make it unique for each sample
    2. Standardize! Consistent within columns: only text, only numbers, , olny dates. Use consistent codes, formats, etc.
    3. Reduce chances of errors by using excel lists to constrain choices for data entry
    4. Identify missing data with a code
    5. Create tables with codes, site data, etc.
    6. Excel is great for data analysis, terrible for data archiving
    7. Relational databases – MySQL and others are free, essential for large and complex dataset, useful for all others
    8. Use descriptive file names (organism_site_year_whatmeasured). Recall to tell people about dates and how to record
  7. Quality control
    1. minimize manual data entry – cuts down on mistakes
    2. use double entry or spot check records
    3. use a database, document changes
    4. after data entry, look for outliers, anomalous values, do statistical summaries
  8. Metadata – must know who created the data,whatisthe content of dataset. When was it created, where was it collected, how developed, why developed?
    1. Metadata basics: Michner, Borer
    2. Metadata standards:EML, see others.   http://knb.ecoinformatics.org/software/eml/
    3. Can use programs like Morpho (EML),
  9. Analysis
    1. Keep raw data raw, use scripts to manipulate. Save them with the data, be sure to annotate well.
    2. Workflows: how you get from raw data to the final product of research (flow charts)
    3. R/SAS scripts – code is great – well documented code easier to review, remember, share, and repeat
    4. Workflows enable reproducibility (can someone independently validate your findings), transparency (can others understand how you arrived at your results), executability (can others re-run or re-use your analyses)
  10.  Data stewardship and reuse
    1. 20 year rule – metadata and accompanying data should be written for a user 20 years into the future
    2. Use stable, non-proprietary formats (csv, txt, tiff)
    3. Create backup copies
    4. Periodically test ability to restore information
    5. Store data in a repository: Institutional archive or discipline/specialty archive
  11. Data citation – allows readers to find data products, promotes reproducibility, better measure of research impact
  12. Data management plan – what is it? Why do it?
  13. E notebooks, online science (notebook, ORNL eNote, Evernote, Google Docs, Blogs, Wikis, theLabNotebook.com, Notebookmaker
  14. Databib- list of data repositories
  15. Importance of doi: precise identification, credit data producers and publishers, link from literature to the data, research metrics for datasets
  16. DataOne.org – tutorials, database of best practices and tools, primer on data managment, investigator toolkit

Image: Numbers by Andy Maguire (CC BY 2.0)

Professor & Distinguished Teaching Scholar

My research & teaching interests include Tropical ecology and conservation, plant population ecology, plant-animal interactions, scientometrics and bibliometrics, science & science policy in Latin America matter.

Related