Subject: Assistance with CAFE5: High Failure Rates for Gene Families with k=2 #229

Jiahe-Sun · 2025-04-16T09:44:48Z

Jiahe-Sun
Apr 16, 2025

Dear Developer,

I am currently using CAFE5 to analyze gene family evolution and have encountered an issue with high failure rates for many gene families when running the model with k=2. I would greatly appreciate your guidance on how to address this problem.

Relevant data files can be obtained:
https://drive.google.com/drive/folders/1f0hmznlqgDAUlOAzFnv-ZXe572KZDPRn?usp=sharing

Here are the details of my analysis:

Input Data: My dataset includes 15,819 gene families, filtered to 8,889 families present at the root. The species tree is provided in Newick format with branch lengths.

Model Settings: I ran CAFE5 with the following command:

cafe5 -i gene_family_filter.txt -t tree.txt -p -k 2  -o k2p

Output Summary: The optimization completed 58 iterations, yielding a likelihood of -lnL=107587.81832123 with parameters lambda=0.0015195674385245 and alpha=0.67584765310652. The log reports "160 values were attempted (0% rejected)," indicating a stable optimization process.
Issue: Over 6,000 families (e.g., OG0000024) are flagged with "61 failures" and failure rates >20%, suggesting they did not converge. This affects approximately 67.5% of the families, which limits the reliability of the results.

I have a few questions:

What might cause such a high proportion of gene families to fail convergence with k=2? Could it be related to small family sizes, noisy data, or issues with the species tree?
The log indicates "61 failures" for each non-converged family. Does this number reflect a specific limit or condition in CAFE5’s optimization process?
Would you recommend adjusting parameters (e.g., increasing iterations, modifying Nelder-Mead settings, or setting initial lambda/alpha values) to improve convergence? I’ve tried k=1 (no failures but higher -lnL=123904.95640715) and k=3 (similar failure rates, -lnL=106905.16948757).
Are there best practices for preprocessing data to reduce failures, such as filtering families with low copy numbers or high variance?

Any suggestions on troubleshooting this issue or optimizing CAFE5 for my dataset would be incredibly helpful. I’d be happy to provide additional details, such as the input files or full log, if needed.

Thank you for your time and support!

Best regards,
Jiiahe

Answered by benfulton

Apr 16, 2025

First of all, make sure you understand the meaning of the K parameter. See #184 for an explanation. With k=2, CAFE will attempt to approximate a gamma curve using one low and one high value. The failures you're seeing indicate that the high value that was chosen failed due to saturation (the rate of evolution was too great to be evaluated). CAFE will assume that the calculated values for that particular family lie near the lower value instead.

You could play with the optimizer settings, but we've never identified any changes to the optimizer that make a significant difference in the final results. It might be interesting to hold lambda constant and see how alpha gets optimized with a K of…

View full answer

benfulton · 2025-04-16T18:36:54Z

benfulton
Apr 16, 2025
Maintainer

First of all, make sure you understand the meaning of the K parameter. See #184 for an explanation. With k=2, CAFE will attempt to approximate a gamma curve using one low and one high value. The failures you're seeing indicate that the high value that was chosen failed due to saturation (the rate of evolution was too great to be evaluated). CAFE will assume that the calculated values for that particular family lie near the lower value instead.

You could play with the optimizer settings, but we've never identified any changes to the optimizer that make a significant difference in the final results. It might be interesting to hold lambda constant and see how alpha gets optimized with a K of 4 or 5, but again I'd expect that the results would be about the same.

In sum, I think your analysis is fine. If you want to try to improve it, removing families with high variance will often help.

0 replies

Jiahe-Sun · 2025-04-18T02:00:31Z

Jiahe-Sun
Apr 18, 2025
Author

Subject: Follow-Up on High Proportion of Unconverged Families in CAFE5 Analysis

Dear Ben,

Thank you for your prompt and insightful response to my query regarding the high proportion of unconverged gene families in my CAFE5 analysis. Your explanation of the k parameter and the role of saturation due to high evolutionary rates has been immensely helpful in understanding the issue.

As per your suggestion, I have conducted runs with k=4 and k=5, keeping other parameters consistent. As you anticipated, the proportion of unconverged families remained approximately 60% (consistent with my previous runs using k=2 and k=3), confirming that increasing model complexity does not mitigate the saturation issue. The log files are available here: https://drive.google.com/drive/folders/1f0hmznlqgDAUlOAzFnv-ZXe572KZDPRn?usp=sharing

I have two questions I’d like to seek your advice on:

Question 1: I attempted to improve convergence by filtering out gene families with copy numbers exceeding 100 in any single species, but this did not significantly reduce the failure rate, suggesting that high variance or other data characteristics persist. However, I noticed that unconverged families do not always exhibit large copy number differences across species (e.g., for one family, the maximum copy number is 83 and the minimum is 8; for OG0000024: 75, 47, 8, 34, 83, 31, 18, 39, 56, 23, 26). I am puzzled why families without apparent high-variance copy numbers still lead to saturation and convergence failures. Could you provide insight into this?

Question 2: Given that approximately 8893 out of 15911 families are unconverged, I am concerned that removing all these families could result in significant information loss, potentially excluding biologically meaningful signals. Your reassurance that the analysis is “fine” is encouraging, but I’d like to confirm my understanding: can I conclude that unconverged families, caused by saturation from high evolutionary rates, do not substantially impact the final inferences of gene family expansion and contraction? If so, is it appropriate to select the run with the lowest -lnL (highest likelihood) for downstream analyses, provided I validate the biological consistency of the results?

Thank you again for your guidance. I greatly appreciate your time and expertise and look forward to any further clarification you can provide.

Best regards,

Jiahe

2 replies

benfulton Apr 18, 2025
Maintainer

The failure rate message does not indicate that the family did not converge (and I'm not really even sure what you mean by that). As CAFE evaluates a given lambda value and gamma distribution, it assigns probabilities to various points along that distribution. If a probability cannot be calculated, it will simply be discarded and the most likely probability from the remaining points will be assumed. A failure will be more related to the particular gamma distribution than the variance in the family. If no final value can be calculated for all of the families, you should consider dropping the higher-variance families.

Jiahe-Sun Apr 19, 2025
Author

Thank you for helping me understand cafe5 better!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Subject: Assistance with CAFE5: High Failure Rates for Gene Families with k=2 #229

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Subject: Assistance with CAFE5: High Failure Rates for Gene Families with k=2 #229

Uh oh!

Uh oh!

Jiahe-Sun Apr 16, 2025

Replies: 2 comments · 2 replies

Uh oh!

benfulton Apr 16, 2025 Maintainer

Uh oh!

Jiahe-Sun Apr 18, 2025 Author

Uh oh!

benfulton Apr 18, 2025 Maintainer

Uh oh!

Jiahe-Sun Apr 19, 2025 Author

Jiahe-Sun
Apr 16, 2025

Replies: 2 comments 2 replies

benfulton
Apr 16, 2025
Maintainer

Jiahe-Sun
Apr 18, 2025
Author

benfulton Apr 18, 2025
Maintainer

Jiahe-Sun Apr 19, 2025
Author