
Semantic and instance segmentation are fundamental tasks in computer vision. Semantic segmentation involves classifying individual pixels based on semantic meaning, dividing the image into meaningful segments. Instance segmentation, in contrast, requires identifying and precisely outlining objects at a pixel level, assigning each object a separate segmentation mask.
Fully Convolutional Networks (FCNs) remain foundational for semantic segmentation and are widely applied in organ segmentation, tumor detection, vascular segmentation, cell and nucleus segmentation, lesion detection, and other medical or natural image domains.
Below is a structured overview of influential architectures.
I. AUTOENCODER-BASED ARCHITECTURES
1. Deconvolutional (Transposed Convolution) Networks – These networks apply an encoder–decoder approach:
– Encoder: convolutional layers transform the input image into feature maps.
– Decoder: upsampling and deconvolution layers reconstruct the features into a segmentation map.
2. SegNet also uses an encoder–decoder structure, but its decoder upsamples feature maps using pooling indices from the encoder. This design preserves spatial information efficiently while keeping computations light.
3. U-Net is one of the most influential models in medical image segmentation. U-Net employs dense skip connections between its encoder (CNN-based) and decoder (upsampling + convolution). This allows spatial details lost during downsampling to be effectively recovered.
4. U-Net++: An advanced variant of U-Net, U-Net++ introduces nested and dense skip pathways. These improve feature fusion between encoder and decoder stages, yielding more precise segmentation results. It has been especially effective in medical imaging tasks where fine-grained boundaries matter.
5. LinkNet improves over SegNet by introducing skip connections that link residual blocks between encoder and decoder stages. With an encoder often based on ResNet, it achieves real-time performance while preserving fine details.
6. V-Net
Designed specifically for volumetric (3D) medical image segmentation, V-Net extends the encoder–decoder paradigm to handle 3D data. It replaces pooling with strided convolutions and incorporates residual connections, making it well-suited for segmenting organs or tumors in CT/MRI scans.
II. REGION-BASED ARCHITECTURES
1. Mask R-CNN – Extends Faster R-CNN by adding a branch for pixel-level segmentation. A Region Proposal Network (RPN) generates candidate object regions, and ROI alignment ensures precise mask predictions. Mask R-CNN remains one of the most popular instance segmentation frameworks.
2. MaskLab – Builds upon Faster R-CNN but introduces a direction prediction branch. This additional component helps distinguish between multiple objects of the same category, improving instance segmentation accuracy.
III. CONTEXT-ENHANCED ARCHITECTURES
1. DeepLab – A family of models that leverage atrous (dilated) convolutions to capture multi-scale context without losing resolution. DeepLab also introduces atrous spatial pyramid pooling (ASPP), allowing the network to aggregate contextual information at multiple receptive fields. It has become a benchmark for robust semantic segmentation across natural image datasets.
2. PSPNet (Pyramid Scene Parsing Network) addresses the challenge of global context by introducing a pyramid pooling module. This module captures features at different spatial scales, improving segmentation performance for scenes with varying object sizes.
3. Path Aggregation Network (PANet) enhances the Feature Pyramid Network (FPN) by adding a bottom-up pathway. This pathway enables low-level information to flow upward in the network. Fused feature maps from multiple levels improve both localization and semantic understanding. This design allows PANet to achieve strong results in object detection and segmentation tasks.
IV. ATTENTION-DRIVEN ARCHITECTURES
1. Context Attention Modules – Attention mechanisms have been integrated into segmentation models to capture richer context:
(i) Multistage Context Refinement Network: Introduces a context attention module with two components:
1. Context feature extraction – captures both local and global context.
2. Context feature refinement – filters redundant information for cleaner representations.
This module is placed in the skip connections between encoder and decoder.
(ii) Covariance-Based Attention: Models dependencies between local and global features using either spatial covariance attention or channel covariance attention, improving context modeling.
V. TRANSFORMER-BASED ARCHITECTURES
1. Vision Transformers (ViT) – Transformers have also made their way into segmentation:
(i) Standard ViT excels at capturing global relationships but struggles with local spatial features.
(ii) Global Context ViT enhances this by explicitly addressing local–global balance.
Additional modules, such as a transformer-scale gate, further improve multi-scale feature representation, boosting segmentation accuracy.
Wrapping Up
The field of image segmentation has evolved rapidly, from encoder–decoder models like U-Net to attention-driven and transformer-based architectures. Each innovation — whether preserving fine-grained detail, refining context, or integrating global attention — pushes performance closer to human-level perception. As research continues, hybrid models combining CNNs and transformers are likely to dominate future segmentation benchmarks.

Leave a comment