University of Florida
Bioinformatics


Lines Analysis

1. Mick Popp from University of Florida sent twenty-six quantified image files on 12/13-14/04. The files are:

US22502637_251279810021_S01_A01.txt
US22502637_251279810022_S01_A01.txt
US22502637_251279810028_S01_A01.txt
US22502637_251279810029_S01_A01.txt
US22502637_251279810030_S01_A01.txt
US22502637_251279810031_S01_A01.txt
US22502637_251279810032_S01_A01.txt
US22502637_251279810033_S01_A01.txt
US22502637_251279810034_S01_A01.txt
US22502637_251279810035_S01_A01.txt
US22502637_251279810036_S01_A01.txt
US22502637_251279810001_S01_A01.txt
US22502637_251279810004_S01_A01.txt
US22502637_251279810005_S01_A01.txt
US22502637_251279810006_S01_A01.txt
US22502637_251279810007_S01_A01.txt
US22502637_251279810009_S01_A01.txt
US22502637_251279810010_S01_A01.txt
US22502637_251279810011_S01_A01.txt
US22502637_251279810012_S01_A01.txt
US22502637_251279810013_S01_A01.txt
US22502637_251279810014_S01_A01.txt
US22502637_251279810015_S01_A01.txt
US22502637_251279810016_S01_A01.txt
US22502637_251279810017_S01_A01.txt
US22502637_251279810018_S01_A01.txt

2. Damion Junk adapted the above files slightly to be compatible with Karthik's stacking program without changing the actual quantified results. The altered files are saved under the same name with a “_jdfix” appended to the end of the name and saved under UFL_data_column_modified_to_run_in_karthiks_program directory.

Using the information provided in Nuzhybcodes.xls, Karthik’s data packaging programs created stacked and side-by-side files saved as tabsep_26files_sbs.txt and tabsep_26files_stacked.txt, respectively, using file_list_for_karthiks_program.csv to add the slide, dye, treatment, and rep.

3. SAS program, makedata.sas, takes the stacked data supplied by Damion and imports it into SAS. The slide numbers were translated from the automatic numbering system used by the stacking program to the numbers supplied by Sergey Nuzhdin in Nuzhybcodes.xls. The design file Purdue_McIntyre-001a.csv information was merged into the quantified results by row and col. Our negative controls were identified and flagged by a 1 in the our_neg_con_flag column. Otherwise, a 0 was in the our_neg_con_flag column. The variables sex , line , rep , and sex_line_rep , which is the contenation of the previous three variables, were added according to Nuzhybcodes.xls. The annotation information was merged into the stacked data by sequence to add the probeuid column into the stacked data. The resulting file was called data_stacked_anno.sas7bdat.

4. The program macro_find_off_anne.sas finds which probes are effectively off and removes them from the data set. This is accomplished by finding the 90 th percentile of our negative controls per slide per dye. If a probe was less than the 90 th percentile for at least 50% of the replicates of the line, then the probe was considered to be off for that line. The percent that a probe is off for a particular line is saved in the column percent_off_line, where line is the specific line. If the probe is off for all of the lines, then that probe is considered to always be off and is assigned a value of 1 for the gene_off column. Otherwise, the probe is given a value of 0 for the gene_off column. Probes that are found to be off, including our negative controls, are removed from the data set and saved in off_list_anne.sas7bdat and exported to off_list_anne.csv. The data for the probes that are found to be on are saved as in anova_nooff_anne.sas7bdat.

5. The program normalize_all.sas normalized the data anova_nooff_anne.sas7bdat. The data was normalized in a variety of fashions:

Bgsubsignal_quartile - the quartile of the bgsubsignal of the particular slide/dye

Log_bgsubsignal - the natural log transform of bgsubsignal

Sqrt_bgsubsignal - the square root of bgsubsignal

Log10_bgsubsignal - the log base 10 transform of bgsubsignal

Bgsubsignal_med - bgsubsignal divided by the median of its respective slide/dye combination.

Log_bgsubsignal_med - the natural log transform of bgsubsignal_med

Bgsubsignal_rank - the rank normalized bgsubsignal performed per slide per dye

The results are saved as anova_normalizations.sas7bdat

6. The program looking_for_transforms.sas was used to determine which normalization and/or transform technique would be the best for the data is anova_nooff_anne.sas7bdat. It was determined that the log base 10 was the optimum transformation and that line 78 may be more variable than the others.

7. The mean of each of the lines for every probe was taken based upon log10_bgsubsignal in expression_means_lines.sas and saved as exp_means_line.sas7bdat. The mean of each the lines by sex for every probe was taken based upon log10_bgsubsignal in expression_means.sas. The means of each sex were ranked for each probe and listed in order from least to greatest in the male_means_order and female_means_order columns, and the results were saved as line_means_order.sas7bdat.

8. The ANOVA was performed on anova_normalizations.sas7bdat in log10_anova.sas on the following model:

where Y = log10_bgsubsignal

μ = overall mean of the normalized values for that probeuid

d = dye and i = Cy3, Cy5

l = line and j = ore, 2b3, 09, 12, 15, 38, 70, 78

s = sex and k = male, female

(sl) = interaction effects of sex and line

ε = error

The effects and interactions from the above model were saved and flagged with a 1 if any of the p-values were less than 0.05. Otherwise, they were flagged as 0. The names are as follows:

Effect or interaction

Name with all lines

Name without line 78

Dye

dye_log10_flag
pdye_log10

dye_log10_no78_flag pdye_log10_no78

Line

line_log10_flag
pline_log10

line_log10_no78_flag pline_log10_no78

Line by sex interaction

linebysex_log10_flag plinebysex_log10

linebysex_log10_no78_flag plinebysex_log10_no78

sex

psex_log10
sex_log10_flag

sex_log10_no78_flag psex_log10_no78

Several tests were run on the residuals. The results were output as follows:

Test or statistic

Name with all lines

Name without line 78

Mean

mean_log10_bgsubsignal

mean_log10_bgsub_no78

median

median_log10_bgsubsignal

median_log10_bgsub_no78

Sign statistic

msign_log10_bgsubsignal

msign_log10_bgsub_no78

Test statistic for normality

normal_log10_bgsubsignal

normal_log10_bgsub_no78

Flag for normality test; 0 if > than 0.05, 1 if ≤ 0.05

norm_flag_log10_bgsub

norm_flag_log10_bgsub_no78

Probability of a greater absolute value for the sign statistic

probm_log10_bgsub_no78

probm_log10_bgsubsignal

Probability value for the test of normality

probn_log10_bgsub_no78

probn_log10_bgsubsignal

Probability value for the signed rank test

probs_log10_bgsub_no78

probs_log10_bgsubsignal

Probability value for the Student's t test

probt_log10_bgsub_no78

probt_log10_bgsubsignal

Statistic for the Student's t test

t_log10_bgsubsignal

t_log10_bgsub_no78

Signed rank statistic

signrank_log10_bgsubsignal

signrank_log10_bgsub_no78

Contrasts were run to test specific lines against each other. The p-values of the contrast results saved as follows:

Contrast

Name with all lines

Name without line 78

Line ore vs. line 2b3

Contrast_orevs2b3

Contrast_orevs2b3_no78

Parents vs. offspring

Contrast_parvsoff

N/A

Line 09 vs. line 2b3

Contrast_09vs2b3

Contrast_09vs2b3_no78

Line 12 vs. line 2b3

Contrast_12vs2b3

Contrast_12vs2b3_no78

Line 15 vs. line 2b3

Contrast_15vs2b3

Contrast_15vs2b3_no78

Line 38 vs. line 2b3

Contrast_38vs2b3

Contrast_38vs2b3_no78

Line 70 vs. line 2b3

Contrast_70vs2b3

Contrast_70vs2b3_no78

Line 78 vs. line 2b3

Contrast_78vs2b3

N/A

Line 09 vs. line ore

Contrast_09vsore

Contrast_09vsore_no78

Line 12 vs. line ore

Contrast_12vsore

Contrast_12vsore_no78

Line 15 vs. line ore

Contrast_15vsore

Contrast_15vsore_no78

Line 38 vs. line ore

Contrast_38vsore

Contrast_38vsore_no78

Line 70 vs. line ore

Contrast_70vsore

Contrast_70vsore_no78

Line 78 vs. line ore

Contrast_78vsore

N/A

Flags for the contrasts above were added. If the p-value for the contrast met a flat threshold of 0.05, then the values were flagged and saved under the same name as their contrast with “_flag” appended to the end of it. For instance, the flag values for contrast_orevs2b3 are labeled as contrast_orevs2b3_flag and contrast_orevs2b3_no78 as contrast_orevs2b3_flag_no78. The effects and interactions, results of the residual tests, and the contrasts as well as all of their respective flags were merged into a single file and saved as sig_norm_flags.sas7bdat.

9. The program extremes_influential.sas checks if there is any correlation between extreme bgsubsignal values and normality problems and Cook’s D.

10. The program check_missings.sas checks results_all_geno2_all0123.sas7bd at if there is any correlation between missing values of bgsubsignal values due and normality problems, which does not exist.

11. The FDRs were calculated for three different threshold levels for the data in sig_norm_flags.sas7bdat by macro_fdr_orig.sas. If a probe meets the 0.05 threshold, then it is designated as red in the flag column, 0.2 then orange, 0.5 then yellow. If probe fails to meet any of those thresholds, then it is designated as tan. The flag columns are labeled as follows:

p-value

Name of FDR column

Pline_log10

Fdr_pline_log10

Pline_log10_no78

Fdr_pline_log10_no78

Psex_log10

Fdr_psex_log10

Psex_log10_no78

Fdr_psex_log10_no78

Pdye_log10

Fdr_pdye_log10

Pdye_log10_no78

Fdr_pdye_log10_no78

Plinebysex_log10

Fdr_plinebysex_log10

Plinebysex_log10_no78

Fdr_plinebysex_log10_no78

12. The expression means, ANOVA results and flags, the FDR flags, and the annotation information were merged into a single file in results_all_log10.sas. The files sig_norm_flags.sas7bdat, expression_means.sas7bdat, results_fdr.sas7bdat, and anno.sas7bdat were merged together by probeuid, saved as results_all_log10.sas7bdat, and exported to results_all_log10.csv.

13. The results of results_all_log10.sas7bdat for the probes that Larry Harshman provided are subsetted in ladder_subset.sas and saved as ladder.sas7bdat

 

---

Parental Lines Analysis

The parental lines analysis uses steps 1-6 from the lines analysis.

14. The program log10_anova_parents.sas subsets the parental lines, ore and 2b3, from W:\anne\data\SAS data\anova_normalizations.sas7bdat. The ANOVA was performed according to the following model:

where Y = log10_bgsubsignal

μ = overall mean of the normalized values for that probeuid

d = dye and i = Cy3, Cy5

l = line and j = ore, 2b3, 09, 12, 15, 38, 70, 78

s = sex and k = male, female

(sl) = interaction effects of sex and line

(dl) = interaction effects of dye and line

ε = error

The effects and interactions from the above model were saved and flagged with a 1 if any of the p-values were less than 0.05. Otherwise, they were flagged as 0. The names are as follows:

Effect or interaction

Name with all lines

Name without line 78

Dye

dye_log10_flag pdye_log10

dye_log10_no78_flag pdye_log10_no78

Line

line_log10_flag pline_log10

line_log10_no78_flag pline_log10_no78

Line by sex interaction

linebysex_log10_flag plinebysex_log10

linebysex_log10_no78_flag plinebysex_log10_no78

Line by dye interaction

dyebyline_log10_flag pdyebyline_log10

dyebyline_log10_no78_flag p dyebyline_log10_no78

sex

psex_log10 sex_log10_flag

sex_log10_no78_flag psex_log10_no78

Several tests were run on the residuals. The results were output as follows:

Test or statistic

Name with all lines

Name without line 78

Mean

mean_log10_bgsubsignal

mean_log10_bgsub_no78

median

median_log10_bgsubsignal

median_log10_bgsub_no78

Sign statistic

msign_log10_bgsubsignal

msign_log10_bgsub_no78

Test statistic for normality

normal_log10_bgsubsignal

normal_log10_bgsub_no78

Flag for normality test; 0 if > than 0.05, 1 if ≤ 0.05

norm_flag_log10_bgsub

norm_flag_log10_bgsub_no78

Probability of a greater absolute value for the sign statistic

probm_log10_bgsub_no78

probm_log10_bgsubsignal

Probability value for the test of normality

probn_log10_bgsub_no78

probn_log10_bgsubsignal

Probability value for the signed rank test

probs_log10_bgsub_no78

probs_log10_bgsubsignal

Probability value for the Student's t test

probt_log10_bgsub_no78

probt_log10_bgsubsignal

Statistic for the Student's t test

t_log10_bgsubsignal

t_log10_bgsub_no78

Signed rank statistic

signrank_log10_bgsubsignal

signrank_log10_bgsub_no78

The effects, interactions, tests, and statistics above are saved sign_norm_flags_parents.sas7bdat.

15. The FDRs were calculated by macro_fdr_orig_parents.sas on sig_norm_flags_parents.sas7bdat. If a probe meets the 0.05 threshold, then it is designated as red in the flag column, 0.2 then orange, 0.5 then yellow. If probe fails to meet any of those thresholds, then it is designated as tan. The flag columns are labeled as follows:

p-value

Name of FDR column

Pline_log10

Fdr_pline_log10

Psex_log10

Fdr_psex_log10

Pdye_log10

Fdr_pdye_log10

Plinebysex_log10

Fdr_plinebysex_log10

Pdyebyline_log10

Fdr_pdyebyline_log10

The results were saved as results_fdr_parents.sas7bdat.

16. The expression means, ANOVA results and flags, the FDR flags, and the annotation information were merged into a single file in results_all_parents.sas. The files sig_norm_flags_parents.sas7bdat, expression_means.sas7bdat, results_fdr_parents.sas7bdat, and anno.sas7bdat were merged together by probeuid, saved as results_all_parents.sas7bdat, and exported to results_all_parents.csv.

 

---

Genotype2 Analysis

The genotype2 analysis uses steps 1-6 from the lines analysis.

17. The program troubleshoot_means.v4.sas merges the data set anova_normalizations.sas7bdat with data from genotypes.corrected4.csv, the file supplied to us by Anne Genissel. The program transforms the 8 line columns from genotypes.corrected4.csv into a single stacked column called genotype. If genotype is equal to “7” or “99”, then genotype1 is given the missing value “.”. Otherwise, genotype1 is equal to genotype. Genotype2 weights the offspring lines, 09, 12, 15, 38, 70, and 78, differently than the parental lines, ore and 2b3. This is done by adding 2 to the genotype1 value for the offspring lines. Genotype2 values for the parental lines are equal to their respective genotype1 values. Missing values for genotype1 remain missing values in genotype2. The means for the four non-missing values of genotype2, 0, 1, 2, and 3, for both sexes are calculated in separate columns and saved as follows:

Mean_0f à mean of the reps when genotype2 = 0 for the females within a probeuid

Mean_0m à mean of the reps when genotype2 = 0 for the males within a probeuid

Mean_1f à mean of the reps when genotype2 = 1 for the females within a probeuid

Mean_1m à mean of the reps when genotype2 = 1 for the males within a probeuid

Mean_2f à mean of the reps when genotype2 = 2 for the females within a probeuid

Mean_2m à mean of the reps when genotype2 = 2 for the males within a probeuid

Mean_3f à mean of the reps when genotype2 = 3 for the females within a probeuid

Mean_3m à mean of the reps when genotype2 = 3 for the males within a probeuid

The means were merged back into the full data set along with the genotype, genotype1, and genotype2 indicator variables and saved as norm_hope_means_v4.sas7bdat.

18. The program subset_means_v4.sas subsets data for probes that are missing or not missing different classes of genotype2 from the data norm_hope_means_v4.sas7bdat.

Classes of genotype2 present

Number of probes

File name

0123

6562

geno2_good_means0123.sas7bdat

012

4032

geno2_good_means012.sas7bdat

013

705

geno2_good_means013.sas7bdat

023

0

geno2_good_means023.sas7bdat

123

0

geno2_good_means123.sas7bdat

01

0

geno2_good_means01.sas7bdat

02

0

geno2_good_means02.sas7bdat

03

0

geno2_good_means03.sas7bdat

12

0

geno2_good_means12.sas7bdat

13

0

geno2_good_means13.sas7bdat

23

0

geno2_good_means23.sas7bdat

0

0

geno2_good_means0.sas7bdat

1

0

geno2_good_means1.sas7bdat

2

0

geno2_good_means2.sas7bdat

3

0

geno2_good_means3.sas7bdat

none

816

geno2_good_nomeans.sas7bdat

The means of the different classes within a sex were ranked. The classes were then listed in rank order in the columns female_means_order and male_means_order. These two columns were combined with the means calculated in the previous step and saved as means_order_v4.sas7bdat and exported to means_order_v4.csv.

19. The ANOVA was performed on norm_hope_means_v4.sas7bdat in genotype2_anova_class0123_v4.sas on the following model:

where Y = log10_bgsubsignal

μ = overall mean of the normalized values for that probeuid

d = dye and i = Cy3, Cy5

l = genotype2 and j = 0, 1, 2, 3

s = sex and k = male, female

(sl) = interaction effects of sex and line

ε = error

The effects and interactions from the above model were saved and flagged with a 1 if any of the p-values were less than 0.05. Otherwise, they were flagged as 0. The names are as follows:

Effect or interaction

Name with all lines

Name without line 78

Dye

dye_log10_flag
pdye_log10

dye_log10_no78_flag pdye_log10_no78

Genotype2

genotype2_log10_flag pgenotype2_log10

genotype2_log10_no78_flag pgenotype2_log10_no78

Genotype2 by sex interaction

genotype2bysex_log10_flag pgenotype2bysex_log10

genotype2bysex_log10_no78_flag pgenotype2bysex_log10_no78

sex

psex_log10
sex_log10_flag

sex_log10_no78_flag psex_log10_no78

Several tests were run on the residuals. The results were output as follows:

Test or statistic

Name with all lines

Name without line 78

Mean

mean_log10_bgsubsignal

mean_log10_bgsub_no78

median

median_log10_bgsubsignal

median_log10_bgsub_no78

Sign statistic

msign_log10_bgsubsignal

msign_log10_bgsub_no78

Test statistic for normality

normal_log10_bgsubsignal

normal_log10_bgsub_no78

Flag for normality test; 0 if > than 0.05, 1 if ≤ 0.05

norm_flag_log10_bgsub

norm_flag_log10_bgsub_no78

Probability of a greater absolute value for the sign statistic

probm_log10_bgsub_no78

probm_log10_bgsubsignal

Probability value for the test of normality

probn_log10_bgsub_no78

probn_log10_bgsubsignal

Probability value for the signed rank test

probs_log10_bgsub_no78

probs_log10_bgsubsignal

Probability value for the Student's t test

probt_log10_bgsub_no78

probt_log10_bgsubsignal

Statistic for the Student's t test

t_log10_bgsubsignal

t_log10_bgsub_no78

Signed rank statistic

signrank_log10_bgsubsignal

signrank_log10_bgsub_no78

Contrasts were run to test specific genotype2 classes against each other. The p-values of the contrast results saved as follows:

Contrast

Name with all lines

Name without line 78

Line ore vs. line 2b3

Contrast_parents

Contrast_parents_no78

Offspring 0 vs. 1

Contrast_offspring

Contrast_offspring_no78

0 vs. 1

Contrast_0vs1

Contrast_0vs1_no78

Parents 0 vs. offspring 0

Contrast_p0vso0

Contrast_p0vso0_no78

Parents 1 vs. offspring 1

Contrast_p1vso1

Contrast_p1vso1_no78

Flags for the contrasts above were added. If the p-value for the contrast met a flat threshold of 0.05, then the values were flagged and saved under the same name as their contrast with “_flag” appended to the end of it. For instance, the flag values for contrast_orevs2b3 are labeled as contrast_orevs2b3_flag and contrast_orevs2b3_no78 as contrast_orevs2b3_flag_no78. The effects and interactions, results of the residual tests, appropriate means, and the contrasts as well as all of their respective flags were merged into a single file and saved as sig_norm_flags_geno2_0123_v4.sas7bdat.

20. The FDRs were calculated by macro_fdr_geno2.sas on sig_norm_flags_geno2_0123_v4.sas7bdat. If a probe meets the 0.05 threshold, then it is designated as red in the flag column, 0.2 then orange, 0.5 then yellow. If probe fails to meet any of those thresholds, then it is designated as tan. The flag columns are labeled as follows:

p-value

Name of FDR column

Pgenotype2_log10

Fdr_pgenotype2_log10

Pgenotype2_log10_no78

Fdr_pgenotype2_log10_no78

The results were saved as geno2_fdr_0123.sas7bdat.

21. The expression means, ANOVA results and flags, the FDR flags, and the annotation information were merged into a single file in results_all_parents.sas. The files sig_norm_flags_geno2_0123_v4.sas7bdat, means_order_v4.sas7bdat, results_fdr_geno2_0123.sas7bdat, and anno.sas7bdat were merged together by probeuid , saved as results_all_geno2_all0123.sas7bdat, and exported to results_all_geno2_all0123.csv.