Nice work! And I'd like to do some research based on your code. Can you provide more details about training? For example, the text encoder(tokenizer, max_token_length).
I've set the parameter as you specified in the Tab.8 of your paper. But I get really bad results. I think it maybe due to the text encoder setup, which is default text encoder of CLIP with vocabulary ~48000 and max length 77. However, this cannot suit the need of medical image. So I think it would be better for the community to reproduce your results if you can provide more details about the training process. Thank you very much!