-
Dear Developer, I am currently using CAFE5 to analyze gene family evolution and have encountered an issue with high failure rates for many gene families when running the model with k=2. I would greatly appreciate your guidance on how to address this problem. Relevant data files can be obtained: Here are the details of my analysis:
I have a few questions:
Any suggestions on troubleshooting this issue or optimizing CAFE5 for my dataset would be incredibly helpful. I’d be happy to provide additional details, such as the input files or full log, if needed. Thank you for your time and support! Best regards, |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
First of all, make sure you understand the meaning of the K parameter. See #184 for an explanation. With k=2, CAFE will attempt to approximate a gamma curve using one low and one high value. The failures you're seeing indicate that the high value that was chosen failed due to saturation (the rate of evolution was too great to be evaluated). CAFE will assume that the calculated values for that particular family lie near the lower value instead. You could play with the optimizer settings, but we've never identified any changes to the optimizer that make a significant difference in the final results. It might be interesting to hold lambda constant and see how alpha gets optimized with a K of 4 or 5, but again I'd expect that the results would be about the same. In sum, I think your analysis is fine. If you want to try to improve it, removing families with high variance will often help. |
Beta Was this translation helpful? Give feedback.
-
Subject: Follow-Up on High Proportion of Unconverged Families in CAFE5 Analysis Dear Ben, Thank you for your prompt and insightful response to my query regarding the high proportion of unconverged gene families in my CAFE5 analysis. Your explanation of the k parameter and the role of saturation due to high evolutionary rates has been immensely helpful in understanding the issue. As per your suggestion, I have conducted runs with k=4 and k=5, keeping other parameters consistent. As you anticipated, the proportion of unconverged families remained approximately 60% (consistent with my previous runs using k=2 and k=3), confirming that increasing model complexity does not mitigate the saturation issue. The log files are available here: https://drive.google.com/drive/folders/1f0hmznlqgDAUlOAzFnv-ZXe572KZDPRn?usp=sharing I have two questions I’d like to seek your advice on: Question 1: I attempted to improve convergence by filtering out gene families with copy numbers exceeding 100 in any single species, but this did not significantly reduce the failure rate, suggesting that high variance or other data characteristics persist. However, I noticed that unconverged families do not always exhibit large copy number differences across species (e.g., for one family, the maximum copy number is 83 and the minimum is 8; for OG0000024: 75, 47, 8, 34, 83, 31, 18, 39, 56, 23, 26). I am puzzled why families without apparent high-variance copy numbers still lead to saturation and convergence failures. Could you provide insight into this? Question 2: Given that approximately 8893 out of 15911 families are unconverged, I am concerned that removing all these families could result in significant information loss, potentially excluding biologically meaningful signals. Your reassurance that the analysis is “fine” is encouraging, but I’d like to confirm my understanding: can I conclude that unconverged families, caused by saturation from high evolutionary rates, do not substantially impact the final inferences of gene family expansion and contraction? If so, is it appropriate to select the run with the lowest -lnL (highest likelihood) for downstream analyses, provided I validate the biological consistency of the results? Thank you again for your guidance. I greatly appreciate your time and expertise and look forward to any further clarification you can provide. Best regards, Jiahe |
Beta Was this translation helpful? Give feedback.
First of all, make sure you understand the meaning of the K parameter. See #184 for an explanation. With k=2, CAFE will attempt to approximate a gamma curve using one low and one high value. The failures you're seeing indicate that the high value that was chosen failed due to saturation (the rate of evolution was too great to be evaluated). CAFE will assume that the calculated values for that particular family lie near the lower value instead.
You could play with the optimizer settings, but we've never identified any changes to the optimizer that make a significant difference in the final results. It might be interesting to hold lambda constant and see how alpha gets optimized with a K of…