Let LLM use different default generate params while --llm_temperature and llm_max_tokens are 0.

fireicewolf · fireicewolf · commit b35420cf9368 · 2024-10-13T17:05:03.000+08:00
diff --git a/CHANGLOG.md b/CHANGLOG.md
@@ -0,0 +1,13 @@
+### NEW
+
+1. Add Mini-CPM V2.6 Support.
+2. Add Florence2 Support.
+
+### CHANGE
+
+1. GUI using Gradio 5 now.
+2. Now LLM will use own default generate params while `--llm_temperature` and `llm_max_tokens` are 0.
+
+### BUG FIX
+
+1. Fix minor bugs.
diff --git a/README.md b/README.md
@@ -2,8 +2,8 @@
 
 A Python base cli tool for caption images
 with [WD series](https://huggingface.co/SmilingWolf), [joy-caption-pre-alpha](https://huggingface.co/spaces/fancyfeast/joy-caption-pre-alpha), [LLama3.2 Vision Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct),
-[Qwen2 VL Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
-and [Mini-CPM V2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6) models.
+[Qwen2 VL Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct), [Mini-CPM V2.6](https://huggingface.co/openbmb/MiniCPM-V-2_6)
+and [Florence-2](https://huggingface.co/microsoft/Florence-2-large)models.
 
 ## Introduce
 
@@ -12,6 +12,9 @@ This tool can make a caption with danbooru style tags or a nature language descr
 
 ### New Changes:
 
+#### 2024.10.13: Add Florence2 Support. Now LLM will use own default generate params while `--llm_temperature` and
+`--llm_max_tokens` are 0.
+
 #### 2024.10.11: GUI using Gradio 5 now. Add Mini-CPM V2.6 Support.
 
 #### 2024.10.09: Build in wheel, now you install this repo from pypi.
@@ -41,7 +44,9 @@ wd-llm-caption-gui
 
 <img alt="DEMO_her.jpg" src="DEMO/DEMO_her.jpg" width="300" height="400"/>
 
-#### WD Caption
+### Standalone Inference
+
+#### WD Tags
 
 Use wd-eva02-large-tagger-v3
 
@@ -87,6 +92,16 @@ Default Mini-CPM V2.6 7B, no quantization
 The image depicts a humanoid robot with a human-like appearance, standing on a balcony railing at night. The robot has a sleek, white and black body with visible mechanical joints and components, suggesting advanced technology. Its pose is confident, with one hand resting on the railing and the other hanging by its side. The robot has long, straight, platinum blonde hair that falls over its shoulders. The background features a cityscape with illuminated buildings and a prominent tower, suggesting an urban setting. The lighting is dramatic, highlighting the robot against the darker backdrop of the night sky. The overall atmosphere is one of futuristic sophistication.
 ```
 
+#### Florence 2 large
+
+Default Florence 2 large, no quantization
+
+```text
+The image is a promotional poster for an AIGC work by DukeG. It features a young woman with long blonde hair, standing on a rooftop with a city skyline in the background. She is wearing a futuristic-looking outfit with a white and black color scheme. The outfit has a high neckline and long sleeves, and the woman is posing with one hand on her hip and the other resting on the railing. The text on the poster reads "Publish on 2024.07.30" and "Generated by Stable Diffusion" with the text "Tuned by Adobe Photoshop".
+```
+
+### WD+LLM Inference
+
 #### Joy Caption with WD
 
 Use wd-eva02-large-tagger-v3 and LLama3.1 8B, no quantization.
@@ -180,6 +195,15 @@ place).
 |:-------------:|:------------------------------------------------------------:|:--------------------------------------------------------------------:|
 | MiniCPM-V-2_6 | [Hugging Face](https://huggingface.co/openbmb/MiniCPM-V-2_6) | [ModelScope](https://www.modelscope.cn/models/OpenBMB/MiniCPM-V-2_6) |
 
+### Florence-2 models
+
+|        Model        |                          Hugging Face Link                           |                                  ModelScope Link                                  |
+|:-------------------:|:--------------------------------------------------------------------:|:---------------------------------------------------------------------------------:|
+|  Florence-2-large   |  [Hugging Face](https://huggingface.co/microsoft/Florence-2-large)   |   [ModelScope](https://www.modelscope.cn/models/AI-ModelScope/Florence-2-large)   |
+|   Florence-2-base   |   [Hugging Face](https://huggingface.co/microsoft/Florence-2-base)   |   [ModelScope](https://www.modelscope.cn/models/AI-ModelScope/Florence-2-base)    |
+| Florence-2-large-ft | [Hugging Face](https://huggingface.co/microsoft/Florence-2-large-ft) | [ModelScope](https://www.modelscope.cn/models/AI-ModelScope/Florence-2-large-ft ) |
+| Florence-2-base-ft  | [Hugging Face](https://huggingface.co/microsoft/Florence-2-base-ft)  |  [ModelScope](https://www.modelscope.cn/models/AI-ModelScope/Florence-2-base-ft)  |
+
 ## Installation
 
 Python 3.10 works fine.
@@ -437,7 +461,7 @@ e.g., `character_name_(series)` will be expanded to `character_name, series`.
 
 `--llm_choice`
 
-select llm models[`joy`, `llama`, `qwen`, `minicpm`], default is `llama`.
+select llm models[`joy`, `llama`, `qwen`, `minicpm`, `florence`], default is `llama`.
 
 `--llm_config`
 
@@ -481,11 +505,11 @@ user prompt for caption.
 
 `--llm_temperature`
 
-temperature for joy LLM model, default is `0.5`.
+temperature for LLM model, default is `0`，means use llm own default value.
 
 `--llm_max_tokens`
 
-max tokens for joy LLM model output, default is `300`.
+max tokens for LLM model output, default is `0`, means use llm own default value.
 
 </details>
 
diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-v0.1.2-alpha
+v0.1.3-alpha
diff --git a/pyproject.toml b/pyproject.toml
@@ -50,13 +50,13 @@ dynamic = ["version"]
 authors = [
     { name = "DukeG", email = "fireicewolf@gmail.com" },
 ]
-description = "A Python base cli tool for caption images with WD series, Joy-caption-pre-alpha, meta Llama 3.2 Vision Instruct, Qwen2 VL Instruct and Mini-CPM V2.6 models."
+description = "A Python base cli tool for caption images with WD series, Joy-caption-pre-alpha, meta Llama 3.2 Vision Instruct, Qwen2 VL Instruct, Mini-CPM V2.6 and Florence-2 models."
 readme = "README.md"
-keywords = ["image-caption", "WD", "Llama 3.2 Vision Instruct", "Qwen2 VL Instruct", "Mini-CPM V2.6", "Joy Caption Alpha"]
+keywords = ["Image Caption", "WD", "Llama 3.2 Vision Instruct", "Joy Caption Alpha", "Qwen2 VL Instruct", "Mini-CPM V2.6", "Florence-2"]
 license = { file = 'LICENSE' }
 requires-python = ">=3.10"
 classifiers = [
-    "Development Status :: 5 - Production/Stable",
+    "Development Status :: 3 - Alpha",
     "Intended Audience :: Developers",
     "Intended Audience :: Science/Research",
     "License :: OSI Approved :: Apache Software License",
diff --git a/wd_llm_caption/caption.py b/wd_llm_caption/caption.py
@@ -377,14 +377,16 @@ def run_inference(
                     pbar.set_description('Processing with Qwen model...')
                 elif self.use_minicpm:
                     pbar.set_description('Processing with Mini-CPM model...')
+                elif self.use_florence:
+                    pbar.set_description('Processing with Florence model...')
                 self.my_llm.inference()
                 pbar.update(1)
 
                 pbar.close()
         else:
             if self.use_wd:
                 self.my_tagger.inference()
-            elif self.use_joy or self.use_llama or self.use_qwen:
+            elif self.use_joy or self.use_llama or self.use_qwen or self.use_minicpm or self.use_florence:
                 self.my_llm.inference()
 
         total_inference_time = time.monotonic() - start_inference_time
@@ -695,14 +697,14 @@ def setup_args() -> argparse.Namespace:
     llm_args.add_argument(
         '--llm_temperature',
         type=float,
-        default=0.5,
-        help='temperature for LLM model, default is `0.5`.'
+        default=0,
+        help='temperature for LLM model, default is `0`，means use llm own default value.'
     )
     llm_args.add_argument(
         '--llm_max_tokens',
         type=int,
-        default=300,
-        help='max tokens for LLM model output, default is `300`.'
+        default=0,
+        help='max tokens for LLM model output, default is `0`, means use llm own default value.'
     )
 
     gradio_args = args.add_argument_group("Gradio dummy args, no effects")
diff --git a/wd_llm_caption/gui.py b/wd_llm_caption/gui.py
@@ -41,6 +41,8 @@ def gui_setup_args():
     parser.add_argument('--inbrowser', action='store_true', help="auto open in browser")
     parser.add_argument('--log_level', type=str, choices=['DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'],
                         default='INFO', help="set log level, default is `INFO`")
+    parser.add_argument('--models_save_path', type=str, default=caption.DEFAULT_MODELS_SAVE_PATH,
+                        help='path to save models, default is `models`.')
 
     return parser.parse_args()
 
@@ -165,9 +167,9 @@ def llm_choice_visibility(caption_method_radio):
                                                      value=caption.DEFAULT_USER_PROMPT_WITH_WD)
 
                         llm_temperature = gr.Slider(label="temperature for LLM model",
-                                                    minimum=0.1, maximum=1.0, value=0.5, step=0.1)
+                                                    minimum=0, maximum=1.0, value=0, step=0.1)
                         llm_max_tokens = gr.Slider(label="max token for LLM model",
-                                                   minimum=1, maximum=1024, value=300, step=1)
+                                                   minimum=0, maximum=2048, value=0, step=1)
 
                         with gr.Group():
                             gr.Markdown("<center>Common Settings</center>")
@@ -422,6 +424,7 @@ def caption_models_load(
                     os.environ["HF_TOKEN"] = str(huggingface_token_value)
 
                 get_gradio_args = gui_setup_args()
+                args.models_save_path = str(get_gradio_args.models_save_path)
                 args.log_level = str(get_gradio_args.log_level)
                 args.caption_method = str(caption_method_value).lower()
                 args.llm_choice = str(llm_choice_value).lower()
@@ -450,7 +453,6 @@ def caption_models_load(
                     CAPTION_FN.set_logger(args)
 
                 caption_init = CAPTION_FN
-
                 args.wd_force_use_cpu = bool(wd_force_use_cpu_value)
 
                 args.llm_use_cpu = bool(llm_use_cpu_value)
diff --git a/wd_llm_caption/utils/inference.py b/wd_llm_caption/utils/inference.py
@@ -298,8 +298,8 @@ def get_caption(
             image: Image.Image,
             system_prompt: str,
             user_prompt: str,
-            temperature: float = 0.5,
-            max_new_tokens: int = 300,
+            temperature: float = 0,
+            max_new_tokens: int = 0,
     ) -> str:
         # Import torch
         try:
@@ -355,8 +355,16 @@ def get_caption(
             ], dim=1).to(device)
             attention_mask = torch.ones_like(input_ids)
             # Generate caption
-            self.logger.debug(f'LLM temperature is {temperature}')
-            self.logger.debug(f'LLM max_new_tokens is {max_new_tokens}')
+            if temperature == 0:
+                temperature = 0.5
+                self.logger.warning(f'LLM temperature not set, using default value {temperature}')
+            else:
+                self.logger.debug(f'LLM temperature is {temperature}')
+            if max_new_tokens == 0:
+                max_new_tokens = 300
+                self.logger.warning(f'LLM max_new_tokens not set, using default value {max_new_tokens}')
+            else:
+                self.logger.debug(f'LLM max_new_tokens is {max_new_tokens}')
             generate_ids = self.llm.generate(input_ids,
                                              inputs_embeds=inputs_embeds,
                                              attention_mask=attention_mask,
@@ -378,9 +386,37 @@ def get_caption(
                 self.logger.debug(f'Using system prompt:{system_prompt}')
                 self.logger.debug(f'Using user prompt:{user_prompt}')
                 messages = [{'role': 'user', 'content': [image, f'{user_prompt}']}]
+                if temperature == 0 and max_new_tokens == 0:
+                    max_new_tokens = 2048
+                    self.logger.warning(f'LLM temperature and max_new_tokens not set, only '
+                                        f'using default max_new_tokens value {max_new_tokens}')
+                    params = {
+                        'num_beams': 3,
+                        'repetition_penalty': 1.2,
+                        "max_new_tokens": max_new_tokens
+                    }
+                else:
+                    if temperature == 0:
+                        temperature = 0.7
+                        self.logger.warning(f'LLM temperature not set, using default value {temperature}')
+                    else:
+                        self.logger.debug(f'LLM temperature is {temperature}')
+                    if max_new_tokens == 0:
+                        max_new_tokens = 2048
+                        self.logger.warning(f'LLM max_new_tokens not set, using default value {max_new_tokens}')
+                    else:
+                        self.logger.debug(f'LLM max_new_tokens is {max_new_tokens}')
+                    params = {
+                        'top_p': 0.8,
+                        'top_k': 100,
+                        'temperature': temperature,
+                        'repetition_penalty': 1.05,
+                        "max_new_tokens": max_new_tokens
+                    }
+                params["max_inp_length"] = 4352
                 content = self.llm.chat(image=image, msgs=messages, tokenizer=self.llm_tokenizer,
                                         system_prompt=system_prompt if system_prompt else None,
-                                        sampling=False, stream=False)
+                                        sampling=False, stream=False, **params)
             elif self.models_type == "florence":
                 self.logger.warning(f"Florence models don't support system prompt or user prompt!")
                 self.logger.warning(f"Florence models don't support temperature or max tokens!")
@@ -433,10 +469,29 @@ def run_inference(task_prompt, text_input=None):
                 # Generate caption
                 self.logger.debug(f'LLM temperature is {temperature}')
                 self.logger.debug(f'LLM max_new_tokens is {max_new_tokens}')
-                output = self.llm.generate(**inputs,
-                                           max_new_tokens=max_new_tokens,
-                                           do_sample=True, top_k=10,
-                                           temperature=temperature)
+                if temperature == 0 and max_new_tokens == 0:
+                    max_new_tokens = 300
+                    self.logger.warning(f'LLM temperature and max_new_tokens not set, only '
+                                        f'using default max_new_tokens value {max_new_tokens}')
+                    params = {}
+                else:
+                    if temperature == 0:
+                        temperature = 0.5
+                        self.logger.warning(f'LLM temperature not set, using default value {temperature}')
+                    else:
+                        self.logger.debug(f'LLM temperature is {temperature}')
+                    if max_new_tokens == 0:
+                        max_new_tokens = 300
+                        self.logger.warning(f'LLM max_new_tokens not set, using default value {max_new_tokens}')
+                    else:
+                        self.logger.debug(f'LLM max_new_tokens is {max_new_tokens}')
+                    params = {
+                        'do_sample': True,
+                        'top_k': 10,
+                        'temperature': temperature,
+                    }
+
+                output = self.llm.generate(**inputs, max_new_tokens=max_new_tokens, **params)
                 content = self.llm_processor.decode(output[0][inputs["input_ids"].shape[-1]:],
                                                     skip_special_tokens=True, clean_up_tokenization_spaces=True)