4-th place solution (Detection Part) for the BYU - Locating Bacterial Flagellar Motors 2025 competition.

Throughout the competition there were numerous missile strikes, bombings, and other acts of war that have taken the lives of many innocent people in Ukraine. 
Rockets from russia hit within few kilometers from my home in Odesa. Each day Kaggle users from Ukraine facing the chance of not waking up. Just keep in this mind while you read this solution writeup.

I would like to thank the Armed Forces of Ukraine, the Security Service of Ukraine, Defence Intelligence of Ukraine, and the State Emergency Service of Ukraine for providing safety and security to participate in this great competition, complete this work, and help science, technology, and business not to stop but to move forward.

This writeup describes the detection & blending part of our solution for the BYU - Locating Bacterial Flagellar Motors 2025 competition. Some details are omitted as my solution is heavily based upon other 3D detection challenge CryoET challenge solution, but I will try to cover the most important parts of the solution.

I'd like to thank my teammate @christofhenkel for his great performance and collaboration during the competition. It's a pleasure to work with you, Christof! I will not make any spoilers here, but Christof's solution is very out of the box and unique, so I highly recommend you to check out his solution writeup as well: 4th place: Simple ResNet18 classification.

TLDR: The detection part of the solution is a hybrid 2.5D (2D encoder / 3D decoder) detection model. There are 4 checkpoints (4 folds) in total. Having an input volume of shape [D, H, W] the model first reduces the depth of the input volume by a factor of 4 using nn.Conv3d with stride 4 ([D//4, H, W]). Then we pass the reduced feature maps through a 2D encoder (maxxvit_rmlp_small_rw_256.sw_in1k) and take the last feature map ([D//4, H//32, W//32]). Next we pass it via 3D-CNN decoder to reduce output feature map even more to [D//32, H//32, W//32]. Finally, we pass this feature map to the object detection head which predicts logits and offsets map. The model is trained with a custom loss function that mimics PP-Yolo loss function with a few modifications. We accelerate model inference with NVidia TensorRT to achieve 200% speedup compared to eager PyTorch runtime and leverage two T4 GPUs to run predictions in parallel.

Introduction

I started initially with a 3D SegResNet detection model (taken from our CryoET challenge solution) solution. The only change I did was to use output stride of 4 instead of 2. This approach was ok, without external data I was able to achieve 0.831 on the public LB. However, I was not satisfied with the results and wanted to try something different.

Next, I implemented a hybrid 2.5D (2D encoder / 3D decoder) model also with output stride of 4. This model used a 2D encoder (maxxvit_rmlp_small_rw_256.sw_in1k) and a 3D decoder to reduce the depth of the input volume by a factor of 4. At this point I also added external data by @brendanartley (Kudos for sharing it with the community!). This boosted the public LB score to 0.846. At this point I dropped the idea of fully 3D-CNN and focused on 2.5D models as they we much faster at train and inference.

Modeling approach

My final approach uses hybrid 3D-2D-3D architecture with a 2D encoder in the middle. The model itself trained for a 3D object detection task, with a training objective matching CryoET challenge solution. In the nutshell, model predicts a class map of shape [B, C, D/32, H/32, W/32] and an offsets map of shape [B, 3, D/32, H/32, W/32] where B is the batch size, C is the number of classes (1 in our case), and D, H, W are the depth, height, and width of the input volume respectively. Initial 3D convolution with stride 4 reduces the depth of the input volume by a factor of 4, which allows us to use a 2D encoder to extract features from the input volume.

Stem

The stem of the model is a 3D convolution with kernel size 3x3x3 and stride 4. It serves a purpose of reducing the depth of the input volume and extract initial representations from the input data. The stem is followed by an instance normalization layer to normalize the feature maps. Use of large kernel size and 8 output channels prevent information loss and extracts meaningful features from the input volume.

self.initial_conv = nn.Sequential(
    nn.Conv3d(
        in_channels=1,
        out_channels=8,
        kernel_size=(7, 3, 3),
        stride=(4, 1, 1),
        padding=(3, 1, 1),
        bias=False,
    ),
    nn.InstanceNorm3d(8),
)

2D Encoder

For the 2D encoder, I used maxxvit_rmlp_small_rw_256.sw_in1k from the timm library. The choice of this specific encoder was motivated by presence of ViT blocks in the architecture, which are known to be effective in capturing long-range dependencies in the data. My assumption was that is it important for the task of detecting bacterial flagellar motors, as these structures are small but exist at the specific locations with regard to the bacterial cell body. And having a transformer-based architecture in the middle of the model would help to capture these dependencies.

self.backbone = timm.create_model(
    model_name="maxxvit_rmlp_small_rw_256.sw_in1k",
    pretrained=True,
    features_only=True,
    in_chans=8,
)

def forward_25d_backbone(self, x: torch.Tensor):
    """Process 3D input through 2D backbone.

    Args:
        x: Input tensor of shape (B, C, D, H, W)

    Returns:
        List of feature maps, each of shape (B, C_i, D // self.initial_conv_stride, H_i, W_i)
        where D_i = D // d_stride for all feature maps
    """
    # Apply initial 3D convolution to reduce D dimension
    x = self.initial_conv(x)  # Now D dimension is reduced by d_stride
    B, C, D, H, W = x.shape
    x = einops.rearrange(x, "b c d h w -> (b d) c h w")
    features = self.backbone(x)
    features_3d = einops.rearrange(features[-1], "(b d) c h w -> b c d h w", d=D)
    return features_3d

3D Decoder

After obtaining the feature maps from the 2D encoder, I pass them through a 3D decoder to reduce the depth of the feature maps by a factor of 8 to obtain the final feature maps of shape [B, C, D/32, H/32, W/32]. In terms of architecture, the decoder consists two Conv3d + Normalization + Activation where first Conv3d layer has stride of 2 in depth dimension.

The goal of the decoder is to incorporate the spatial information from the 2D encoder into a consistent 3D representation that can be used for object detection.

in_channels = backbone_channels[-1]
out_channels = None

self.neck = nn.Sequential()
for out_channels in config.decoder_channels:
    self.neck.append(
        nn.Sequential(
            nn.Conv3d(in_channels, out_channels, kernel_size=1, bias=False),
            nn.Conv3d(
                out_channels,
                out_channels,
                kernel_size=(3, 3, 3),
                padding=1,
                stride=(2, 1, 1),
                bias=False,
                groups=config.decoder_num_groups,
            ),
            get_norm_layer(config.decoder_norm_type, out_channels, config.decoder_num_groups),
            get_activation(config.decoder_activation, inplace=True),
            nn.Dropout3d(config.decoder_dropout),
            nn.Conv3d(
                out_channels, out_channels, kernel_size=(3, 3, 3), padding=1, bias=False, groups=config.decoder_num_groups
            ),
            get_norm_layer(config.decoder_norm_type, out_channels, config.decoder_num_groups),
            get_activation(config.decoder_activation, inplace=True),
        )
    )
    in_channels = out_channels

3D Object Detection Head

The final part of the model is the 3D object detection head that predicts class logits and offsets map.

class ObjectDetectionHead(nn.Module):
    def __init__(
        self,
        in_channels: int,
        num_classes: int,
        stride: int,
        head_kernel_size: int = 3,
        intermediate_channels: int = 64,
        offset_intermediate_channels: int = 32,
        norm_type: Literal["instance", "batch", "group"] = "instance",
        num_groups: int = 32,
        activation: str = "silu",
    ):
        super().__init__()

        def make_conv_block(in_ch: int, out_ch: int) -> nn.Sequential:
            return nn.Sequential(
                nn.Conv3d(in_ch, out_ch, kernel_size=head_kernel_size, padding=head_kernel_size // 2),
                get_activation(activation, inplace=True),
                get_norm_layer(norm_type, out_ch, num_groups),
                nn.Conv3d(out_ch, out_ch, kernel_size=head_kernel_size, padding=head_kernel_size // 2),
                get_activation(activation, inplace=True),
                get_norm_layer(norm_type, out_ch, num_groups),
            )

        self.stride = stride

        self.cls_stem = make_conv_block(in_channels, intermediate_channels)
        self.cls_head = nn.Conv3d(intermediate_channels, num_classes, kernel_size=1, padding=0)

        self.offset_stem = make_conv_block(in_channels, offset_intermediate_channels)
        self.offset_head = nn.Conv3d(offset_intermediate_channels, 3, kernel_size=1, padding=0)

        torch.nn.init.zeros_(self.offset_head.weight)
        torch.nn.init.constant_(self.offset_head.bias, 0)


    def forward(self, features):
        logits = self.cls_head(self.cls_stem(features))
        offsets = self.offset_head(self.offset_stem(features)).tanh() * stride
        return logits, offsets

Training

I used 4-fold stratified group split. Stratification was done by Voxel size while external data was additionally groupped by dataset id to prevent data leakage. For validation I used only tomos with 0 or 1 motor instances.

Training epoch used fixed number of random crops (4) per study and fixed number of random crops around each motor instance (8). For data augmentations I used:

Random flips along X, Y, Z axes.
Random rotations along Z-axis (+- 180 degrees)
Random brightness, contrast & gamma alterations
Slight rotations along X and Y axis (+-10)
Heavy scale jitter to cover all resolution in 8-20A range (With mode around 13A)
Additional anisotropic scale jitter +-10% along each axis
Mixup with 0.5 probability (I found that in this specific competition mixup improved training speed a lot - with mixup I was able to achieve same accuracy within 25 epochs as without mixup it required at least 50 epochs of training).

During training, I used 0.5x input scale resolution and my input volumes were of size 128x256x256px.

Validation

For validation, I used sliding window approach with the same window size and 0.5x overlap between tiles. During validation individual tiles accumulated to final classmap and offsets map and F2 score was computed on the final maps. After each epoch, I computed thresholds that maximizes F2 score on the validation set. I saved top-5 models for training experiment which I later averaged which almost always increased the F2 score.

As many users noticed, the local validation was overly optimistic and did not reflect the true performance of the model on the public LB. At first, I thought it was due to the fact that LB contains tomos in higher resolution that released training data. However, after adding external data to the training set, I found that the local validation score is still much higher than the public LB score.

Only two days before the end of the competition I found that the reason for this discrepancy is that the public/private LB contains approximately 50/50 positive/negative studies, while the local validation set contains much more positive studies. After splitting each tomo volume in half ([D,H,W] -> [D,H, W/2:], [D,H,W/2:]) and computing predictions for each half separately, I was able to achieve a much more realistic local validation score that is closer to the public LB score:

It was too late to change the training strategy, but I was able to use this knowledge to select the best models for the final ensemble and play with blending.

Blenidng

Christof's and mine solutions are very different in nature and how raw model predictions looked like. The only way to take our prediction CSVs and blend them somehow.

The validation on halves helped us find the best blending method which I called "Winner takes all with double Otsu".

I will explain the Winner takes all approach first and then explain the double Otsu part.

Winner takes all

Compute normalized ranks for each solution's scores.
For each prediction, take the solution with the highest rank and use its coordinates and normalized rank instead of raw score.

Double Otsu

"Winner takes all with double Otsu" method extends the Winner takes all approach with an additional step supress negative predictions.

We take the raw scores from both solutions and compute Otsu threshold for each solution separately.
We binarize scores array of each prediction using the computed threshold. This gives us two binary masks.
We use these masks to find the most possible negative predictions (where both masks are 0). If both models agrees (in terms of binary masks) that the prediction is negative, we assign it a score of 0 and set the coordinates to -1.
Remaining predictions are processed using the Winner takes all approach.

| Method                         | Score (percentile)   |
|:-------------------------------|:---------------------|
| EK Predictions                 | 0.9115 (54.2%)       |
| CH Predictions                 | 0.9413 (54.7%)       |
| ---                            | ---                  |
| Winner Takes All (Rank-based)  | 0.9408 (55.7%)       |
| Double Otsu Blend              | 0.9421 (54.2%)       |

As you can see, the double Otsu blend increased the score a little bit compared to the Winner takes all approach.

Other methods like simple averaging, rank-averaging, weighted coordinate blending, re-weighting based in IoU and others did not work well for this competition.

That what we used for the final submission.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
configs		configs
cryodet		cryodet
data		data
.dockerignore		.dockerignore
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
build_cache.py		build_cache.py
build_cache_extra.py		build_cache_extra.py
clear_cache.py		clear_cache.py
download_extra_data.py		download_extra_data.py
evaluate_blending.py		evaluate_blending.py
evaluate_ensemble.py		evaluate_ensemble.py
export_ensemble_onnx.py		export_ensemble_onnx.py
generate_inference_kernel.py		generate_inference_kernel.py
predict_single.py		predict_single.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
train_hydra.py		train_hydra.py
train_maxvit_s32.sh		train_maxvit_s32.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

4-th place solution (Detection Part) for the BYU - Locating Bacterial Flagellar Motors 2025 competition.

Introduction

Modeling approach

Stem

2D Encoder

3D Decoder

3D Object Detection Head

Training

Validation

Blenidng

Winner takes all

Double Otsu

About

Uh oh!

Releases

Packages

Languages

BloodAxe/Kaggle-2025-BYU

Folders and files

Latest commit

History

Repository files navigation

4-th place solution (Detection Part) for the BYU - Locating Bacterial Flagellar Motors 2025 competition.

Introduction

Modeling approach

Stem

2D Encoder

3D Decoder

3D Object Detection Head

Training

Validation

Blenidng

Winner takes all

Double Otsu

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages