How to evaluate the effectiveness of DeepResearch, and are there any relevant evaluation methods and metrics