I am new to GSEA analysis, so please excuse if these are basic questions. I am currently analyzing WES data, and for that I compare my gene lists to different custom backgrounds. When I looked at the results of the Functional annotation chart I noticed something that I cannot explain in the Count, LT, PH, PT Columns.
It seems like the a gene in my gene list is only counted in the "Count" column, if it also appears in the background. Thus the more background genes I include, the bigger the "Count" gets and the more significant the enrichment is. Is this intentional? And if yes, why? I thought that the background shouldn't play a role in determining what genes of my gene list are in a particular Pathway or GO term.
Furthermore, I noticed that the PT (population total) and LT (List total) are different for each Term. I also have no idea why this is the case, since the number of genes in my gene list and my background shouldn't change.
If somebody could help my understand these observations, I'd be very grateful!
DAVID does not perform GSEA, it uses a modified Fisher Exact for enrichment analysis which means that you should upload genes that are differentially expressed (i.e. meets a fold change threshold), etc in your experiment and choose a background that represents the possible pool from which your gene list could have been selected.
The genes in your list must come from the set of background genes. Otherwise, they are not a part of the analysis. A good example is a gene expression microarray. You only have the possibility of selecting the genes for which there are probes on your microarray.
PT (population total) and LT (List total) are specific to the category (i.e. KEGG_PATHWAY). It is the number of genes (background or gene list) that are annotated to at least one term in the category.
Thank you very much! This was extremely helpful! I guess in the Functional annotation chart "Count" is equivalent to "LH"?
In that case, if I am not mistaken the DAVID tool is not suitable for the type of analysis I want to carry out.
Specifically, I have a list of mutated genes for many different biopsies and my idea was to analyze whether certain genes are mutated more often than other genes. For example I wanted to compare biopsy 1 vs. biopsy 2 and see if maybe more genes in the TNF signaling pathway are mutated. Any chance you can recommend a tool with which I can carry out such a kind of analysis?
The following explanation has been given, based on the instructions from the above DAVID link. Consider the following result from a chart record.
Eg: Category Term Count % PValue Genes List Total GAD_DISEASE metabolic syndrome 6 16.66666667 5.58E-05 HSD11B1, FADS2, ADIPOQ, APOE, F3, PPARGC1A 34
Pop Hits Pop Total Fold Enrichment Bonferroni Benjamini FDR 165 12971 13.87272727 0.058244592 0.034299056 0.034044044
Category - Every term in the annotation cluster
Count - Genes involved in individual terms. The count column in the result chart denotes the number of involved genes (from the input list of genes) that has been annotated to that specific term. Here the count is 6. This means, out of 36 total genes (from my list), 6 genes are involved and got annotated to "Metabolic syndrome" term.
% = (Involved genes / Total genes) * 100 Eg: Involved genes = 6 (value under count) Eg: Total genes = 36 (From my study) % = (6/36) * 100 = 16.66666667 (This value has been displayed under % column). Using the % formula, one can check the values under % column of the chart.
P-Value = EASE Score, the modified Fisher Exact P-Value. They are identical to that in the Chart Report. The smaller, the more enriched. The p-values associated with each annotation terms inside each clusters are exactly the same values as p-values (Fisher Exact/EASE Score) shown in the regular chart report for the same terms.
Genes = Official gene symbol of the involved genes. The number that's been given under "Count" column will match to this displayed "Genes" column. Say, Count 6 denotes those 6 genes are annotated to that specific term "Metabolic syndrome" and their names (6 genes) are displayed under Genes column.
Gerenal definition : Population Total (PT) = Total genes in the human genome background. Population Hits (PH) = Total number of genes involved in the specific pathway / database. List Total (LT) = Total genes in the gene list (User's input gene list). List Hits (LH) = Total number of genes from the gene list (User's input gene list) involved in the specific pathway / database.
From the example table, List Total (LT) = 34 (Value under List Total) Population Hits (PH) = 165 (Value under Pop Hits) Population Total (PT) = 12971 (Value under Pop Total) List Hits (LH) = 6 (Value under Count)
In the "Homo sapiens" as background (12,971 genes total; Population Total (PT)), 165 genes are involved in the Metabolic syndrome (Population Hits (PH)). A given gene list has found that Six genes (List Hits (LH)) out of 34 total genes in the list (List Total (LT)) belong to the Metabolic syndrome. Then we ask the question if 6/34 is more than random chance compared to the "Homo sapiens" background of 165/12971.
DAVID tool is used when our gene list has "Differentially Expressed Genes" (DEGs). The tool enriches the statistically significant DEGs and displays the results as chart based on modified Fisher Exact p values. The smaller the p values, the more those genes are enriched - statistically significant.