Merge branch 'huggingface:main' into cvt

fffffgggg54 · web-flow · commit 7ba93ae547d3 · 2024-05-20T17:52:19.000-07:00
diff --git a/README.md b/README.md
@@ -26,6 +26,12 @@
 * The Hugging Face Hub (https://huggingface.co/timm) is now the primary source for `timm` weights. Model cards include link to papers, original source, license. 
 * Previous 0.6.x can be cloned from [0.6.x](https://github.com/rwightman/pytorch-image-models/tree/0.6.x) branch or installed via pip with version.
 
+### May 14, 2024
+* Support loading PaliGemma jax weights into SigLIP ViT models with average pooling.
+* Add Hiera models from Meta (https://github.com/facebookresearch/hiera).
+* Add `normalize=` flag for transorms, return non-normalized torch.Tensor with original dytpe (for `chug`)
+* Version 1.0.3 release
+
 ### May 11, 2024
 * `Searching for Better ViT Baselines (For the GPU Poor)` weights and vit variants released. Exploring model shapes between Tiny and Base.
 
@@ -42,6 +48,7 @@
 | [vit_medium_patch16_reg4_gap_256.sbb_in1k](https://huggingface.co/timm/vit_medium_patch16_reg4_gap_256.sbb_in1k)  | 83.47 | 96.622 | 38.88 | 256 |
 | [vit_medium_patch16_reg1_gap_256.sbb_in1k](https://huggingface.co/timm/vit_medium_patch16_reg1_gap_256.sbb_in1k)  | 83.462 | 96.548 | 38.88 | 256 |
 | [vit_little_patch16_reg4_gap_256.sbb_in1k](https://huggingface.co/timm/vit_little_patch16_reg4_gap_256.sbb_in1k)  | 82.514 | 96.262 | 22.52 | 256 |
+| [vit_wee_patch16_reg1_gap_256.sbb_in1k](https://huggingface.co/timm/vit_wee_patch16_reg1_gap_256.sbb_in1k)  | 80.256 | 95.360 | 13.42 | 256 |
 | [vit_pwee_patch16_reg1_gap_256.sbb_in1k](https://huggingface.co/timm/vit_pwee_patch16_reg1_gap_256.sbb_in1k)  | 80.072 | 95.136 | 15.25 | 256 |
 | [vit_mediumd_patch16_reg4_gap_256.sbb_in12k](https://huggingface.co/timm/vit_mediumd_patch16_reg4_gap_256.sbb_in12k) | N/A | N/A | 64.11 | 256 |
 | [vit_betwixt_patch16_reg4_gap_256.sbb_in12k](https://huggingface.co/timm/vit_betwixt_patch16_reg4_gap_256.sbb_in12k)  | N/A | N/A | 60.4 | 256 |
diff --git a/hfdocs/source/feature_extraction.mdx b/hfdocs/source/feature_extraction.mdx
@@ -192,9 +192,9 @@ There are two additional creation arguments impacting the output features.
 
 #### Output index selection
 
-The `out_indices` argument is supported by all models, but not all models have the same index to feature stride mapping. Look at the code or check feature_info to compare. The out indices generally correspond to the `C(i+1)th` feature level (a `2^(i+1)` reduction). For most convnet models, index 0 is the stride 2 features, and index 4 is stride 32. For many ViT or ViT-Conv hybrids there may be many to all features maps of the same shape, or a combination of hierarchical and non-hieararchical feature maps. It is best to look at the `feature_info` attribute to see the number of features, their corresponding channel count and reduction level.
+The `out_indices` argument is supported by all models, but not all models have the same index to feature stride mapping. Look at the code or check feature_info to compare. The out indices generally correspond to the `C(i+1)th` feature level (a `2^(i+1)` reduction). For most convnet models, index 0 is the stride 2 features, and index 4 is stride 32. For many ViT or ViT-Conv hybrids there may be many to all features maps of the same shape, or a combination of hierarchical and non-hierarchical feature maps. It is best to look at the `feature_info` attribute to see the number of features, their corresponding channel count and reduction level.
 
-`out_indices` supports negative indexing, this makes it easy to get the last, penunltimate, etc feature map. `out_indices=(-2,)` would return the penultimate feature map for any model.
+`out_indices` supports negative indexing, this makes it easy to get the last, penultimate, etc feature map. `out_indices=(-2,)` would return the penultimate feature map for any model.
 
 #### Output stride (feature map dilation)
 
@@ -228,7 +228,7 @@ Accompanying the `forward_intermediates` function is a `prune_intermediate_layer
 
 An `indices` argument is used for both `forward_intermediates()` and `prune_intermediate_layers()` to select the features to return or layers to remove. As with the `out_indices` for `features_only` API, `indices` is model specific and selects which intermediates are returned.
 
-In non-hierarchical block based models such as ViT the indices correspond to the blocks, in models with hierarchical stages they usually correspond to the output of the stem + each hierarhical stage. Both positive (from the start), and negative (relative to the end) indexing works, and `None` is used to return all intermediates.
+In non-hierarchical block based models such as ViT the indices correspond to the blocks, in models with hierarchical stages they usually correspond to the output of the stem + each hierarchical stage. Both positive (from the start), and negative (relative to the end) indexing works, and `None` is used to return all intermediates.
 
 The `prune_intermediate_layers()` call returns an indices variable, as negative indices must be converted to absolute (positive) indices when the model is trimmed.
 
diff --git a/hfdocs/source/installation.mdx b/hfdocs/source/installation.mdx
@@ -28,7 +28,7 @@ You should install `timm` in a [virtual environment](https://docs.python.org/3/l
    # Deactivate the virtual environment
    source .env/bin/deactivate
    ```
-`
+
 Once you've created your virtual environment, you can install `timm` in it.
 
 ## Using pip
diff --git a/timm/layers/patch_dropout.py b/timm/layers/patch_dropout.py
@@ -6,7 +6,7 @@
 
 class PatchDropout(nn.Module):
     """
-    https://arxiv.org/abs/2212.00794
+    https://arxiv.org/abs/2212.00794 and https://arxiv.org/pdf/2208.07220
     """
     return_indices: torch.jit.Final[bool]
 
diff --git a/timm/models/_builder.py b/timm/models/_builder.py
@@ -10,7 +10,8 @@
 from timm.models._features import FeatureListNet, FeatureDictNet, FeatureHookNet, FeatureGetterNet
 from timm.models._features_fx import FeatureGraphNet
 from timm.models._helpers import load_state_dict
-from timm.models._hub import has_hf_hub, download_cached_file, check_cached_file, load_state_dict_from_hf
+from timm.models._hub import has_hf_hub, download_cached_file, check_cached_file, load_state_dict_from_hf,\
+    load_custom_from_hf
 from timm.models._manipulate import adapt_input_conv
 from timm.models._pretrained import PretrainedCfg
 from timm.models._prune import adapt_model_from_file
@@ -185,7 +186,12 @@ def load_pretrained(
     elif load_from == 'hf-hub':
         _logger.info(f'Loading pretrained weights from Hugging Face hub ({pretrained_loc})')
         if isinstance(pretrained_loc, (list, tuple)):
-            state_dict = load_state_dict_from_hf(*pretrained_loc)
+            custom_load = pretrained_cfg.get('custom_load', False)
+            if isinstance(custom_load, str) and custom_load == 'hf':
+                load_custom_from_hf(*pretrained_loc, model)
+                return
+            else:
+                state_dict = load_state_dict_from_hf(*pretrained_loc)
         else:
             state_dict = load_state_dict_from_hf(pretrained_loc)
     else:
diff --git a/timm/models/_hub.py b/timm/models/_hub.py
@@ -190,6 +190,13 @@ def load_state_dict_from_hf(model_id: str, filename: str = HF_WEIGHTS_NAME):
     return torch.load(cached_file, map_location='cpu')
 
 
+def load_custom_from_hf(model_id: str, filename: str, model: torch.nn.Module):
+    assert has_hf_hub(True)
+    hf_model_id, hf_revision = hf_split(model_id)
+    cached_file = hf_hub_download(hf_model_id, filename=filename, revision=hf_revision)
+    return model.load_pretrained(cached_file)
+
+
 def save_config_for_hf(
         model,
         config_path: str,
diff --git a/timm/models/mobilenetv3.py b/timm/models/mobilenetv3.py
@@ -622,43 +622,6 @@ def _gen_lcnet(variant: str, channel_multiplier: float = 1.0, pretrained: bool =
     return model
 
 
-def _gen_lcnet(variant: str, channel_multiplier: float = 1.0, pretrained: bool = False, **kwargs):
-    """ LCNet
-    Essentially a MobileNet-V3 crossed with a MobileNet-V1
-
-    Paper: `PP-LCNet: A Lightweight CPU Convolutional Neural Network` - https://arxiv.org/abs/2109.15099
-
-    Args:
-      channel_multiplier: multiplier to number of channels per layer.
-    """
-    arch_def = [
-        # stage 0, 112x112 in
-        ['dsa_r1_k3_s1_c32'],
-        # stage 1, 112x112 in
-        ['dsa_r2_k3_s2_c64'],
-        # stage 2, 56x56 in
-        ['dsa_r2_k3_s2_c128'],
-        # stage 3, 28x28 in
-        ['dsa_r1_k3_s2_c256', 'dsa_r1_k5_s1_c256'],
-        # stage 4, 14x14in
-        ['dsa_r4_k5_s1_c256'],
-        # stage 5, 14x14in
-        ['dsa_r2_k5_s2_c512_se0.25'],
-        # 7x7
-    ]
-    model_kwargs = dict(
-        block_args=decode_arch_def(arch_def),
-        stem_size=16,
-        round_chs_fn=partial(round_channels, multiplier=channel_multiplier),
-        norm_layer=partial(nn.BatchNorm2d, **resolve_bn_args(kwargs)),
-        act_layer=resolve_act_layer(kwargs, 'hard_swish'),
-        se_layer=partial(SqueezeExcite, gate_layer='hard_sigmoid', force_act_layer=nn.ReLU),
-        num_features=1280,
-        **kwargs,
-    )
-    model = _create_mnv3(variant, pretrained, **model_kwargs)
-    return model
-
 
 def _cfg(url: str = '', **kwargs):
     return {
diff --git a/timm/models/vision_transformer.py b/timm/models/vision_transformer.py
diff --git a/timm/utils/distributed.py b/timm/utils/distributed.py
diff --git a/timm/version.py b/timm/version.py