Exploring GBS data and some file conversions

I’ve been working on organizing my scripts to analyze GBS data for my dissertation chapter on the phylogeography of the Olympia oyster. Still a lot of work to do, but here’s a couple of notebooks. All notebooks can be found at my Github under 2016_Notebooks and the scripts referenced in them can be found in my Github’s Scripts repository.

GBS_File_Conversions.ipynb: I’m planning on using this notebook to describe various file conversion scripts and code snippets I’ve written to work with. May end up splitting them into separate notebooks, but we’ll see. Currently, it describes how to add population information to the Structure (.str) output from ipyrad/pyrad, whether that be integers (required by the actual Structure program), or strings (useful for programs like Adegenet in R). It also describes how to take a .geno or .str file and split it up into many .geno files for looking at pairwise Fst and nucleotide diversity in R.

Exploring_GBS_Data_With_R.ipynb: This notebook has snippets of code used to explore pairwise Fst and nucleotide diversity (pi). Will eventually expand to include other R-based analysis methods.

 

A taste of data analysis

Playing around with some data on protoshell length taken by Ryann (PSRF intern) from the salinity experiment.

In R:

 > plot(Length~Population, data = sal30df)
sal30_length

Boxplot of protoshell length in 7 day old Olympia oyster larvae from 3 populations in Puget Sound. n = 20 per population

Did an anova:

> aov.sal30 <- aov(Length~Population, data=sal30df)
> summary(aov.sal30)

Df Sum Sq Mean Sq F value Pr(>F)
Population 2 9281 4641 14.81 6.61e-06 ***
Residuals 57 17861 313

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Indicates that the means of protoshell length are significantly different (yay!).

> pop <- sal30$Population
> len <- sal30$Length
> ptest <- pairwise.t.test(len, pop, p.adjust = "bonferroni")

Pairwise comparisons using t tests with pooled SD

data: len and pop

Fidalgo Bay Hood Canal
Hood Canal 0.0913             –
Oyster Bay 0.0069          3.9e-06

P value adjustment method: bonferroni

Very significant p-value between Hood Canal and Oyster Bay.

Friday 9/11/15

Thought I’d be switching my online notebook over to the Open Notebook Science network, but when trying to write this post I found that I could not add a new category. Will try to figure that out, but until then will keep posting here.

Took my samples and reagents from the freezer in my lab on campus to the Pritzker DNA Lab at the Field Museum, where I do the bulk of my molecular work. Spent the day alternating between writing my Doctoral Dissertation Improvement Grant (DDIG) and some lab work.

Common Garden Experiment

  • Extracted DNA from 14 broodstock samples using the EZNA Mollusc kit (protocol here). Left to digest for 2.5 hours.
    1. SS2_6
    2. SS2_7
    3. SS2_8
    4. HC1_6
    5. HC1_7
    6. HC1_8
    7. HC1_9
    8. HC1_10
    9. NF1_3
    10. NF1_4
    11. NF1_5
    12. NF1_6
    13. NF1_7
    14. NF1_8
  • Ran out a gel from the DNA extractions of larvae samples done on 8/11/12 at Manchester.
    • Population Tank Family Size Date Storage Est. # Date extracted
      1 Hood Canal LC >100 7/13/2015 75/95 EtOH 8/11/2015
      2 Fidalgo Bay NA LC 100 7/13/2015 75% EtOH 8/11/2015
      3 South Sound NA LC 100 7/13/2015 RNALater 8/11/2015
      4 Hood Canal NA HC2 100 7/17/2015 RNALater 8/11/2015
      5 Hood Canal NA LC 100 7/17/2015 RNALater 8/11/2015
      6 South Sound NA LC 100 7/17/2015 RNALater 8/11/2015
      7 Hood Canal NA LC 100 7/20/2015 RNALater 8/11/2015
      8 Fidalgo Bay NA LC 100 7/20/2015 RNALater 8/11/2015
      9 South Sound NA LC 100 7/20/2015 RNALater 8/11/2015
      10 Hood Canal HC_Tank1_160 NA 160 7/20/2015 RNALater 8/11/2015
      11 Fidalgo Bay NF_Tank1_new NA 160 7/20/2015 RNALater 8/11/2015
      12 South Sound SS_Tank1_new NA 160 7/15/2015 RNALater 8/11/2015
      13 Hood Canal HC_Tank1_new NA 160 7/24/2015 RNALater 8/11/2015
      14 Fidalgo Bay NF_Tank1_new NA 160 7/24/2015 RNALater 8/11/2015
      15 South Sound SS_Tank1_new NA 160 7/24/2015 RNALater 8/11/2015
      16 Hood Canal HC_Tank1_new NA 160 7/27/2015 RNALater 8/11/2015
      17 Fidalgo Bay NF_Tank1_new NA 160 7/27/2015 RNALater 8/11/2015
      18 South Sound SS_Tank1_new NA 160 7/27/2015 RNALater 8/11/2015
      19 Hood Canal HC_Tank2_160 224 8/3/2015 RNALater 8/11/2015
      20 Fidalgo Bay NF_Tank2_160 224 8/3/2015 RNALater 8/11/2015
      21 South Sound SS_Tank2_160 224 8/3/2015 RNALater 8/11/2015
      22 Hood Canal HC_Tank2_160 >224 8/7/2015 RNALater 350 8/11/2015
      23 Fidalgo Bay NF_Tank2_160 >224 8/7/2015 RNALater 504 8/11/2015
      24 Oyster Bay SS_Tank2_160 >224 8/7/2015 RNALater 641 8/11/2015
    • The computer hooked up to the UV camera wouldn’t recognize a USB drive, so I just have a phone picture of a printed out picture for now. 12 and 13 are cut off, but they did not show up on gel. For gels of DNA extracts, I give a ranking from 1 (did not work at all) to 5 (bright band of high molecular weight DNA). These are put of the Sample Master Sheet.
      • gel_9_11_15
        1. 3.5
        2. 4
        3. 4
        4. 1 (rerun)
        5. 4
        6. 4.5
        7. 3.5
        8. 3 (low)
        9. 5
        10. 4
        11. 1 (rerun)
        12. 1 (rerun)
        13. 1 (rerun)
        14. 1 (rerun)
        15. 5
        16. 3 (degrad)
        17. 2.5 (degrad)
        18. 2 (degrad)
        19. 5
        20. 1 (rerun)
        21. 4.5
        22. 3 (degrad)
        23. 2.5 (low)
        24. 5
  • Playing around with data!
    • Looked at larvae release from each population across time. Did this partially out of burning curiosity and because it would be nice to have some preliminary data to put in my DDIG proposal.
    • First, I had to edit the Google Sheet to include zeros for days when a population released no larvae and to have total counts across families. Then in R:
    •  > larvae = read.csv("Larval counts - Day 1 (1).csv", header = TRUE) 
      > names(larvae)

      [1] “Date” “Population” “Family” “Tank.added.to” “Volume.of.tripour..mL.” “Vol.of.drop.counts” “Ethanol.used.”
      [8] “Live.Count.1” “Live.Count.2” “Live.Count.3” “Live.Count.4” “Total.Live.Larvae” “X…” “Notes”
      [15] “Dead.count..1” “Dead..2” “Dead..3” “Dead..4” “Total.Dead” “Total.Larvae” “Total.by.date”

 > pop_Total_Date_na <- na.omit(pop_Total_Date)
> ggplot(data=pop_Total_Date_na, aes(x=Date, y=Total.by.date, \
group=Population, colour=Population)) + geom_line() + geom_point()

Day1Larvae_date

  • Interestingly, the South Sound population produced more larvae earlier. This mirrors the reciprocal transplant experiment, where SS oysters reached their maximum percentage of brooding females sooner at two of the 4 sites.

Oly Population Structure

  • In addition to making libraries for samples from the common garden, I need to start getting DNA ready for one more sequencing run for my project looking at rangewide population structure in Olympia oysters. Did 10 extractions concurrently with the common garden extractions.
  1. WA1_16
  2. WA1_14
  3. BC1_19
  4. BC1_18
  5. CA6_16
  6. CA6_17
  7. CA6_18
  8. CA7_13
  9. OR3_7
  10. OR3_20