Location:Home > Application > Expert Insights | How to Analyze CRISPR Library Data to Find Targets
Expert Insights|How to Analyze CRISPR Library Data to Find Targets
CRISPR screening is a high-throughput gene screening method based on the CRISPR/Cas9 system. After constructing a cell pool transduced by the library with multiple sgRNAs, target cells are enriched under specific conditions, and then NGS sequencing and bioinformatics analysis are used to identify phenotype-related target genes. Many people may already understand the principles and procedures of CRISPR library screening, but still have many questions about the analysis of screening results and the identification of target genes. Today, we will systematically introduce the process of CRISPR library analysis and answer your questions.
The raw sequencing files (raw reads) obtained from NGS sequencing contain some low-quality reads with adapters. To ensure the quality of the analysis, it is necessary to filter the raw reads to obtain clean reads. Subsequently, the quality of the sequencing data is assessed based on Q20 and Q30. Typically, if Q20 > 90% or Q30 > 85% (Figure 1), the sequencing data is considered as qualified. If the values are below these thresholds, it indicates low sequencing quality and high data error rates, and re-sequencing is required.
Figure 1: Sequencing Data Quality Assessment
Due to factors such as sgRNA library quality, mutations introduced in NGS library construction and sequencing, some sequences in the clean reads cannot be matched to the corresponding sgRNA library. To ensure the effectiveness of the analysis, it is necessary to align the clean reads that can be matched to the sgRNA library to obtain valid data (mapped reads) from the CRISPR library screening results. To ensure the accuracy and reliability of the sequencing results, the sequencing depth (mean depth) of the mapped reads should be evaluated, with a recommended sequencing depth of over 300x (sequencing depth = mapped reads/number of sgRNAs).
Figure 2: sgRNA Sequencing Depth Analysis
For CRISPR library screening results, the RRA (Robust Rank Aggregation) algorithm in the MAGeCK software [1,2] is typically used to analyze sgRNAs in the experimental and control groups to identify differential genes. As a comprehensive ranking algorithm, RRA scores and ranks each gene. The lower the RRA score, the higher the ranking, indicating a higher likelihood that the gene is a target gene. Additionally, in the bioinformatics analysis, both positive and negative screening results are analyzed. Positive screening results indicate that the gene is significantly enriched in the experimental group, while negative screening results indicate that the gene is significantly depleted in the experimental group.
Figure 3: Analysis Results of RRA Algorithm
The identified target genes are further subjected to GSEA enrichment analysis (Figure 4) and GO enrichment analysis (Figure 5) to reveal the signaling pathways targeted by the enriched or depleted genes.
Figure 4: GSEA Enrichment Analysis
Figure 5: GO Enrichment Analysis
As a large-scale gene screening method, CRISPR libraries inevitably produce some false-positive results. Therefore, during the target gene screening process, it is recommended to select multiple genes as candidate genes and verify them through downstream experiments.
Figure 6: RRA algorithm ranking screens target gene Cop1 [3]
As mentioned earlier, the CRISPR library screening results are usually analyzed using the RRA algorithm. The higher the ranking of a gene, the greater the likelihood that it is a target gene. If it is not possible to effectively identify the target gene, the top 20 or 30 genes can be selected as candidate genes and verified through downstream gene knockout or overexpression experiments. For example, Liu et al. identified the target gene Cop1 through RRA algorithm ranking [3].
Firstly, as we know that FDR = Q value = adjusted p-value. The p-value reflects the probability of finding a significant difference between the experimental and control groups for a particular gene, while FDR represents the false discovery rate, i.e., the proportion of false discoveries among all findings. Simply put, when p-value < 0.05, it indicates that the likelihood of a significant difference between the experimental and control groups for that gene is greater than 95%, and when FDR < 0.05, it indicates that the likelihood of the aforementioned judgment being true is greater than 95%.
Typically, genes screened using FDR < 0.05 are more likely to be target genes. However, due to the large number of genes screened in the library, a single gene's p-value usually needs to be less than 1*10^-7 to achieve FDR < 0.05. Screening solely based on FDR often leads to the omission of many true positive genes. Therefore, in the vast majority of library screening cases, p-value rather than FDR is used to screen target genes.
LFC represents the fold change in sgRNA between the experimental and control groups. When LFC > 1, it means that the number of sgRNAs targeting a specific gene in the experimental group is twice that of the control group. When LFC > 2, it means that the number of sgRNAs targeting that gene in the experimental group is four times that of the control group, and so on.
In addition to the ranking method mentioned above for screening target genes, researchers can also combine p-value and LFC to screen for potential target genes. For example, Guo et al. identified the target gene CDC7 using the conditions p < 0.01 and LFC ≤ -2 [4].
Figure 7: Screening Target Gene CDC7 [4] by Combining p-value and LFC Conclusion
We hope that today's introduction to the CRISPR library analysis process will help eliminate some of the doubts you may have regarding the analysis of screening results and the identification of target genes. If you have any further questions during the actual operation, feel free to communicate with us at any time.
Ubigene’s One-stop CRISPR Screen Service, start at 8K USD
Paired with 400+ Premade Library Cell Pools, fast as 8 wks to screen targets
Screening-ready Library Cell Pools are available now, from $2290
Inquire now by clicking the 'Contact Us' button on the right.>>>
References
[1]Kolde R, Laur S, Adler P, Vilo J. Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics. 2012 Feb 15;28(4):573-80.
[2]Li W, Xu H, Xiao T, Cong L, Love MI, Zhang F, Irizarry RA, Liu JS, Brown M, Liu XS. MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biol. 2014;15(12):554.
[3]Wang X, Tokheim C, Gu SS, Wang B, Tang Q, Li Y, Traugh N, Zeng Z, Zhang Y, Li Z, Zhang B, Fu J, Xiao T, Li W, Meyer CA, Chu J, Jiang P, Cejas P, Lim K, Long H, Brown M, Liu XS. In vivo CRISPR screens identify the E3 ligase Cop1 as a modulator of macrophage infiltration and cancer immunotherapy target. Cell. 2021 Oct 14;184(21):5357-5374.e22.
[4]Deng L, Yang L, Zhu S, Li M, Wang Y, Cao X, Wang Q, Guo L. Identifying CDC7 as a synergistic target of chemotherapy in resistant small-cell lung cancer via CRISPR/Cas9 screening. Cell Death Discov. 2023 Feb 2;9(1):40.
Expert Insights|How to Analyze CRISPR Library Data to Find Targets
CRISPR screening is a high-throughput gene screening method based on the CRISPR/Cas9 system. After constructing a cell pool transduced by the library with multiple sgRNAs, target cells are enriched under specific conditions, and then NGS sequencing and bioinformatics analysis are used to identify phenotype-related target genes. Many people may already understand the principles and procedures of CRISPR library screening, but still have many questions about the analysis of screening results and the identification of target genes. Today, we will systematically introduce the process of CRISPR library analysis and answer your questions.
The raw sequencing files (raw reads) obtained from NGS sequencing contain some low-quality reads with adapters. To ensure the quality of the analysis, it is necessary to filter the raw reads to obtain clean reads. Subsequently, the quality of the sequencing data is assessed based on Q20 and Q30. Typically, if Q20 > 90% or Q30 > 85% (Figure 1), the sequencing data is considered as qualified. If the values are below these thresholds, it indicates low sequencing quality and high data error rates, and re-sequencing is required.
Figure 1: Sequencing Data Quality Assessment
Due to factors such as sgRNA library quality, mutations introduced in NGS library construction and sequencing, some sequences in the clean reads cannot be matched to the corresponding sgRNA library. To ensure the effectiveness of the analysis, it is necessary to align the clean reads that can be matched to the sgRNA library to obtain valid data (mapped reads) from the CRISPR library screening results. To ensure the accuracy and reliability of the sequencing results, the sequencing depth (mean depth) of the mapped reads should be evaluated, with a recommended sequencing depth of over 300x (sequencing depth = mapped reads/number of sgRNAs).
Figure 2: sgRNA Sequencing Depth Analysis
For CRISPR library screening results, the RRA (Robust Rank Aggregation) algorithm in the MAGeCK software [1,2] is typically used to analyze sgRNAs in the experimental and control groups to identify differential genes. As a comprehensive ranking algorithm, RRA scores and ranks each gene. The lower the RRA score, the higher the ranking, indicating a higher likelihood that the gene is a target gene. Additionally, in the bioinformatics analysis, both positive and negative screening results are analyzed. Positive screening results indicate that the gene is significantly enriched in the experimental group, while negative screening results indicate that the gene is significantly depleted in the experimental group.
Figure 3: Analysis Results of RRA Algorithm
The identified target genes are further subjected to GSEA enrichment analysis (Figure 4) and GO enrichment analysis (Figure 5) to reveal the signaling pathways targeted by the enriched or depleted genes.
Figure 4: GSEA Enrichment Analysis
Figure 5: GO Enrichment Analysis
As a large-scale gene screening method, CRISPR libraries inevitably produce some false-positive results. Therefore, during the target gene screening process, it is recommended to select multiple genes as candidate genes and verify them through downstream experiments.
Figure 6: RRA algorithm ranking screens target gene Cop1 [3]
As mentioned earlier, the CRISPR library screening results are usually analyzed using the RRA algorithm. The higher the ranking of a gene, the greater the likelihood that it is a target gene. If it is not possible to effectively identify the target gene, the top 20 or 30 genes can be selected as candidate genes and verified through downstream gene knockout or overexpression experiments. For example, Liu et al. identified the target gene Cop1 through RRA algorithm ranking [3].
Firstly, as we know that FDR = Q value = adjusted p-value. The p-value reflects the probability of finding a significant difference between the experimental and control groups for a particular gene, while FDR represents the false discovery rate, i.e., the proportion of false discoveries among all findings. Simply put, when p-value < 0.05, it indicates that the likelihood of a significant difference between the experimental and control groups for that gene is greater than 95%, and when FDR < 0.05, it indicates that the likelihood of the aforementioned judgment being true is greater than 95%.
Typically, genes screened using FDR < 0.05 are more likely to be target genes. However, due to the large number of genes screened in the library, a single gene's p-value usually needs to be less than 1*10^-7 to achieve FDR < 0.05. Screening solely based on FDR often leads to the omission of many true positive genes. Therefore, in the vast majority of library screening cases, p-value rather than FDR is used to screen target genes.
LFC represents the fold change in sgRNA between the experimental and control groups. When LFC > 1, it means that the number of sgRNAs targeting a specific gene in the experimental group is twice that of the control group. When LFC > 2, it means that the number of sgRNAs targeting that gene in the experimental group is four times that of the control group, and so on.
In addition to the ranking method mentioned above for screening target genes, researchers can also combine p-value and LFC to screen for potential target genes. For example, Guo et al. identified the target gene CDC7 using the conditions p < 0.01 and LFC ≤ -2 [4].
Figure 7: Screening Target Gene CDC7 [4] by Combining p-value and LFC Conclusion
We hope that today's introduction to the CRISPR library analysis process will help eliminate some of the doubts you may have regarding the analysis of screening results and the identification of target genes. If you have any further questions during the actual operation, feel free to communicate with us at any time.
Ubigene’s One-stop CRISPR Screen Service, start at 8K USD
Paired with 400+ Premade Library Cell Pools, fast as 8 wks to screen targets
Screening-ready Library Cell Pools are available now, from $2290
Inquire now by clicking the 'Contact Us' button on the right.>>>
References
[1]Kolde R, Laur S, Adler P, Vilo J. Robust rank aggregation for gene list integration and meta-analysis. Bioinformatics. 2012 Feb 15;28(4):573-80.
[2]Li W, Xu H, Xiao T, Cong L, Love MI, Zhang F, Irizarry RA, Liu JS, Brown M, Liu XS. MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens. Genome Biol. 2014;15(12):554.
[3]Wang X, Tokheim C, Gu SS, Wang B, Tang Q, Li Y, Traugh N, Zeng Z, Zhang Y, Li Z, Zhang B, Fu J, Xiao T, Li W, Meyer CA, Chu J, Jiang P, Cejas P, Lim K, Long H, Brown M, Liu XS. In vivo CRISPR screens identify the E3 ligase Cop1 as a modulator of macrophage infiltration and cancer immunotherapy target. Cell. 2021 Oct 14;184(21):5357-5374.e22.
[4]Deng L, Yang L, Zhu S, Li M, Wang Y, Cao X, Wang Q, Guo L. Identifying CDC7 as a synergistic target of chemotherapy in resistant small-cell lung cancer via CRISPR/Cas9 screening. Cell Death Discov. 2023 Feb 2;9(1):40.