Photograph via snooOG

Computer Vision is the scientific subfield of AI concerned with developing algorithms to extract meaningful information from raw images, videos, and sensor data. This community is home to the academics and engineers both advancing and applying this interdisciplinary field, with backgrounds in computer science, machine learning, robotics, mathematics, and more.

We welcome everyone from published researchers to beginners!

Content which benefits the community (news, technical articles, and discussions) is valued over content which benefits only the individual (technical questions, help buying/selling, rants, etc.).

If you want an answer to a query, please post a legible, complete question that includes details so we can help you in a proper manner!

Related Subreddits

Computer Vision Discord group

Computer Vision Slack group


86,391 Subscribers


CNN Blocks used for building modern CV algorithms

Hello everyone, Im looking to better understand the use of convolutional blocks such as YOLOv8's bottleneck block, SPPF, and C2f blocks, how are they constructed? and why are they built that way?, and what led the creators choose to use this order of layers? etc.

I think I have a solid understanding of what each CNN layer does ( Conv, Pooling, FC) but the concept of these blocks alludes to me.

any explanations or resources are appreciated.

18:08 UTC


Recommendations for calibration targets

I'm doing some radial lense distortion correction of some fisheye lenses and am looki for a reasonably priced checkerboard target. Doesn't necessary need to be flat aluminum plate, but that would be ideal.

Where do you guys buy your calibration targets?

1 Comment
16:43 UTC


Is there a python lib that can add different sized patches into multiple images?

Let's say I have 2000 patches, all of different sizes, and all smaller than 200x200 pixels. I need a routine that will combine the patches into multiple larger images. E.g. if I specify an output image size of 640x480, I want the routine to create multiple 640x480 images containing all the patches, creating as many of the larger images as is needed to contain all the patches.

edit: I managed to find an interesting solution - 2D bin packing, as used by texture packers used for 2D game development.

16:26 UTC


grid splits up annotations

Hello, I'm using a model that had to detect plants. If i run the model it looks like there is a grid that slits up the annotations in 2 or 4. These should be one annotation. does anyone know how to get rid of this problem? The pictures are in 4K.

15:19 UTC


Exploring Diffusion Models for Furniture Generation in Virtual Staging - Seeking Advice!

I'm interested in using diffusion models for generating furniture in virtual staging projects. I'm hoping to create realistic and diverse furniture images that can be placed within empty room photos. However, I'm facing some challenges and would appreciate any insights or guidance you might have.

Here's my current understanding:

  • Diffusion models offer exciting possibilities for generating images.
  • Simple challenge includes creating a segmentation mask, automatically which determines in which areas to in paint.
  • Challenges include fine-tuning for realistic furniture, realistic placement, and controlling style/detail.

I'm interested in:

  • Exploring existing tools or research on this specific application of diffusion models.
  • Learning about techniques for fine-tuning, positioning, and style control.
  • Understanding the feasibility and limitations of this approach.

Questions for the community:

  • Has anyone attempted using diffusion models for furniture generation in virtual staging?
  • Are there specific models or datasets recommended for this task?
  • What resources (research papers, tutorials, tools) would you recommend exploring?
  • Any general advice or suggestions for tackling this challenge?

I'm open to learning and appreciate any assistance you can provide!

I've attached a sample figure of how I've generated segmentation mask to find out the area to in paint.

I've used object detection to detect doors and windows, and subtract those areas from a "base mask" and used it as my area to in-paint. But this method is trivial. Also I've used runway ML's Diffusion model, it doesn't understand layout of the furniture properly.


15:12 UTC


Resume Review

Could you please review my resume and point out the weaknesses?

Page 1

Page 2

10:37 UTC


Is it possible to add features in Ultralytics Hub application?

Is it possible to add features in Ultralytics Hub application like the phone will vibrate when an object is detected at a certain radius. And is it possible to integrate it to google map so that it has shortest path algorithm? Thank you

10:26 UTC


Unet model result

I saw one person converted binary mask's vector to matrix (128,128,2) using tensorflow.keras.utils.to_categorical(mask,num_classes =2 ). and used sofmax function in models output in Binary Semantic Segmentation problem and got about 98% accuracy with only 5 epochs. With the same data I have used masks itself as a 128x128x1 size and used sigmoid. When I train Unet model I get good results in 50 epochs. How can I improve my model to get best result like this person with 5 epochs. Any suggests??

Here is that person's model:

class UNet(Model):

Implementation of the U-Net model for image segmentation.
The class provides methods to create the input tensor, apply convolutional and deconvolutional layers,
perform pooling and merging operations, and define the U-Net model architecture.


- img_height (int): Height of the input tensor.
- img_width (int): Width of the input tensor.

- U-Net model

unet = UNet(img_height=256, img_width=256)
model = unet.model()

def __init__(self, img_height, img_width):
super(UNet, self).__init__()
self.img_height = img_height
self.img_width = img_width

def conv_block(self, tensor, nfilters, size=3, padding='same', initializer="he_normal"):
x = Conv2D(filters=nfilters, kernel_size=(size, size), padding=padding, kernel_initializer=initializer)(tensor)
x = BatchNormalization()(x)
x = Activation("relu")(x)
x = Conv2D(filters=nfilters, kernel_size=(size, size), padding=padding, kernel_initializer=initializer)(x)
x = BatchNormalization()(x)
x = Activation("relu")(x)
return x

def deconv_block(self, tensor, residual, nfilters, size=3, padding='same', strides=(2, 2)):

y = Conv2DTranspose(nfilters, kernel_size=(size, size), strides=strides, padding=padding)(tensor)
y = concatenate([y, residual], axis=3)
y = self.conv_block(y, nfilters)
return y

def model(self, nclasses=2, filters=32):

# down
input_layer = Input(shape=(self.img_height, self.img_width, 3), name='image_input')
conv1 = self.conv_block(input_layer, nfilters=filters)
conv1_out = MaxPooling2D(pool_size=(2, 2))(conv1)
conv2 = self.conv_block(conv1_out, nfilters=filters*2)
conv2_out = MaxPooling2D(pool_size=(2, 2))(conv2)
conv3 = self.conv_block(conv2_out, nfilters=filters*4)
conv3_out = MaxPooling2D(pool_size=(2, 2))(conv3)
conv4 = self.conv_block(conv3_out, nfilters=filters*8)
conv4_out = MaxPooling2D(pool_size=(2, 2))(conv4)
conv4_out = Dropout(0.5)(conv4_out)
conv5 = self.conv_block(conv4_out, nfilters=filters*16)
conv5 = Dropout(0.5)(conv5)
# up
deconv6 = self.deconv_block(conv5, residual=conv4, nfilters=filters*8)
deconv6 = Dropout(0.5)(deconv6)
deconv7 = self.deconv_block(deconv6, residual=conv3, nfilters=filters*4)
deconv7 = Dropout(0.5)(deconv7)
deconv8 = self.deconv_block(deconv7, residual=conv2, nfilters=filters*2)
deconv9 = self.deconv_block(deconv8, residual=conv1, nfilters=filters)
# output
output_layer = Conv2D(filters=nclasses, kernel_size=(1, 1))(deconv9)
output_layer = BatchNormalization()(output_layer)
output_layer = Activation('softmax')(output_layer)
model = Model(inputs=input_layer, outputs=output_layer, name='Unet')
return model

And Here is my model:

# Building Unet by dividing encoder and decoder into blocks

from keras.models import Model
from keras.layers import Input, Conv2D, MaxPooling2D, UpSampling2D, concatenate, Conv2DTranspose, BatchNormalization, Dropout, Lambda
from keras.optimizers import Adam
from keras.layers import Activation, MaxPool2D, Concatenate

def conv_block(input, num_filters):

x = Conv2D(num_filters, 3, padding="same")(input)
x = BatchNormalization()(x) #Not in the original network.
x = Activation("relu")(x)
x = Conv2D(num_filters, 3, padding="same")(x)
x = BatchNormalization()(x) #Not in the original network
x = Activation("relu")(x)
return x
#Encoder block: Conv block followed by maxpooling
def encoder_block(input, num_filters):
x = conv_block(input, num_filters)
p = MaxPool2D((2, 2))(x)
return x, p
#Decoder block
#skip features gets input from encoder for concatenation
def decoder_block(input, skip_features, num_filters):
x = Conv2DTranspose(num_filters, (2, 2), strides=2, padding="same")(input)
x = Concatenate()([x, skip_features])
x = conv_block(x, num_filters)
return x
#Build Unet using the blocks
def build_unet(input_shape, n_classes, filters=32):
droprate = 0.25
inputs = Input(input_shape)
s1, p1 = encoder_block(inputs, 64)
s2, p2 = encoder_block(p1, 128)
#p2 = Dropout(droprate)(p2)
s3, p3 = encoder_block(p2, 256)
# p3 = Dropout(droprate)(p3)
s4, p4 = encoder_block(p3, 512)
# p4 = Dropout(droprate)(p4)
b1 = conv_block(p4, 1024) #Bridge
d1 = decoder_block(b1, s4, 512)
# d1 = Dropout(droprate)(d1)
d2 = decoder_block(d1, s3, 256)
# d2 = Dropout(droprate)(d2)
d3 = decoder_block(d2, s2, 128)
#d3 = Dropout(droprate)(d3)
d4 = decoder_block(d3, s1, 64)
if n_classes == 1: #Binary
activation = 'sigmoid'
activation = 'softmax'
outputs = Conv2D(n_classes, 1, activation = activation, padding="same")(d4) #Change the activation based on n_classes
# outputs = BatchNormalization()(outputs)
# outputs = Activation(activation)(outputs)
model = Model(inputs, outputs, name="U-Net")
return model

10:15 UTC


Extract accurately gyphl, font and color

How I configure tesseract and other tools don't even extract the gyphl right, where Google Lens achieve it right. I want a tool, any tool that will extract accurately gyphl, font and color (and necessary configuration)

09:57 UTC


How to create/handle .model and .marker files for CylinderTag

Hello, I want to us the code for CylinderTag, to track cylindrical objects but can't find out how to create the required .marker or .model files. The marker generator of CylinderTag creates .bmp files that I don't know how to convert into .marker files.


Can anyone tell me if there are any other solutions on how to track cylindrical objects or how to create the required .marker and .model files?

09:57 UTC


Does anyone know how does Hawkeye work??

How do they calibrate cameras and get extrensic parameters. What type of cameras can be used for moving ball detection. And how can they do synchronisation of cameras. I am knew to 3D geometry. And multi-view geometry. I am thinking they do intrensic parameters with checkeboard. And extrensic parameters through PnP through 20 fixed points but I tried re creating that for my own understanding I was unable to get good accuracy actually. Please enlighten me with the same.

1 Comment
09:49 UTC


What are some foundational papers in CV that every newcomer should read?

My thoughts: "Attention is All You Need" by Ashish Vaswani et al. (2017): This paper introduced the Transformer architecture, which revolutionized natural language processing and has also impacted CV tasks like image captioning and object detection.

"DETR: End-to-End Object Detection with Transformers" by Nicolas Carion et al. (2020): This paper proposed DETR, a Transformer-based model that achieved state-of-the-art performance in object detection without relying on traditional hand-crafted features.

"Diffusion Models Beat Real-to-Real Image Generation" by Aditya Ramesh et al. (2021): This paper presented diffusion models, a novel approach to image generation that has achieved impressive results in tasks like generating realistic images from text descriptions.

04:21 UTC


Looking for a job to start my career in Computer vision and image processing. NEED HELP AND GUIDANCE.

Hey everyone,

I am a master's student and I am looking for a job/internship in the field of computer vision and image processing. I am open to Internships/ Contract roles/Full-time roles. I am flexible about relocation anywhere in US. I am looking for a position where I can start my career and learn from the people who are working with the current and latest technologies. If any of you are hiring please let me know, I would love to chat about what I could bring to the table. Any leads would be really appreciated.

Thank you everyone.

02:15 UTC


Tiny nerf questions

In volume rendering, what is the difference between transmittance for every sample point (alpha as defined in the paper) versus cumulative transmittance for every sample point (weights as defined in the paper)?

19:50 UTC


Hobby project - mobile phone app to run constantly and observe sky for birds, then record the trajectory.

Looking for a project to observe the sky from the flat stationary phone and look “up” at the sky. Any moving object should be recognised and tracked. Direction and speed recorded. I’m also thinking if fish eye clip on leans will help to improve the the field of view. Any advise on the phone model is welcome as well. Or perhaps if the phone is really poor strategy due to performance then suggestion for the platform.

Please share your thoughts on what toolkit and libraries it is better to use. There are two use cases, first one is not time sensitive (just logging) the other is to be able the take a decision asap (for example sent a text or active alarm)

1 Comment
18:49 UTC


Powerboxes 0.2.2 is here 😎

Recently I shared in this community a small lib that I’m working on. Since then, it has quite evolved with most notably support for rotated boxes metrics 🎉 Also, new metrics were introduced

For reminder or information: powerboxes is a python lib written in rust for fast bounding box manipulation

Feedbacks are welcomed ! Thanks for your time :)

17:34 UTC


Facial Similarity Script

I've got a python script working fairly well. It uses InsightFace and the buffalo_l model. It takes a reference image, and compares it to a folder of thousands of images, aligning them and then giving a similarity score to each. However, given that it is only using 5 landmarks, I believe it could be more accurate. I imagine it's just taking these few distances into account, but not any of the other information that it could glean from comparing more landmarks.

As a result I looked into the insightface/alignment/coordinate_reg/image_infer.py example. It uses the landmark_2d_106 model. And after some trial and error, I got it working correctly and drawing 106 landmarks on my images.

But now I don't know how to utilize these landmarks. face_align.norm_crop() seems to only work with 5 landmarks. Ultimately I'm trying to align and score on similarity as accurately as possible taking as many facial features as possible into account. How can I do this as accurately as possible given the fact that I'm not very adept at coding and have struggled to even get this far. Maybe it's not with InsightFace and with a different library? Maybe something already exists and I'm reinventing the wheel?

Thank you!

16:39 UTC


CLIP Embeddings for Document Classification?

I am evaluating CLIP embeddings for a specific use case which is classifying pages within a financial document, specifically those pages corresponding to the distinct financial statements (Balance Sheet, etc.). I am not using regex because the financial statement text varies significantly across companies. Additionally, a document can have up to 100+ pages, while only 3-6 might correspond to the financial statements. Could I train CLIP on a custom dataset to identify those pages within the documents?

I have also contemplated other methodologies such as a vision language model, however, that obviously comes with heavier computation costs. Let me know if this is a feasible use case for CLIP, or if there is a more efficient solution I could look into!

16:04 UTC


How can cpp(C++) be best utilized when integrating a PyTorch model into a product?

I'm seeking a computer vision job and am practicing skills for the product development stage, where many sources suggest cpp is essential.

How can cpp be best utilized when integrating a PyTorch model into a product? I've discovered torchlib, torchscript, and cpp + ONNX as methods. Are there other approaches? Which would be most beneficial to practice?

Additionally, I'm curious about the specific capabilities sought by real companies in this field.

Thank you for reading!

15:00 UTC


Computer vision engineer

Hello everyone I am a senior computer vision engineer in Egypt and have worked with multiple startups to build solutions like AutoML (no/low-code computer vision), auto annotation pipelines, medical imaging .. etc

I am wondering about

  • the possibility of working in Germany without getting a master's degree in AI as per my search, I found the easiest way to get a job there

  • Also, what are the best other countries to work there as a computer vision engineer


1 Comment
14:38 UTC


How to do cloth-body collisions handling when the body can move?

Hey everyone, I'm currently working on a cloth simulation project and trying to implement cloth-body collision handling. The body can move so the cloth have to follow when the body moves.

Does anyone have any experience or advice on how to do this? Any resources or insights would be greatly appreciated! Thanks in advance for your help!

1 Comment
13:53 UTC


If I want to perform segmentation in a classification problem in post-processing, which method is the most effective?

So I’m basically I’m building a classification model that takes in an image of a fruit and outputs whether it’s fresh and rotten. I want to perform segmentation in post-processing and output the degree of rottenness if rotten. I’ve been toying with several segmentation techniques but the results haven’t been good enough. It keeps mistaking shadows as part of the rot among other problems and it looks as most segmentation techniques aren’t up to par. I’m thinking of trying unet but I don’t have annotated masks.

12:28 UTC


"Generative Models: What do they know? Do they know things? Let's find out!". Quote from paper: "Our findings reveal that all types of the generative models we study contain rich information about scene intrinsics [normals, depth, albedo, and shading] that can be easily extracted using LoRA."

10:13 UTC


Research Topic Ideas

What are some topics I can pursue related to semantic segmentation? What are the latest research trends along this task of computer vision?

08:36 UTC


Understanding and Implementing “Defeating line-noise CAPTCHAs with multiple quadratic snakes”

I am trying to learn the underlying maths discussed in this paper and I’m absolutely enthralled with the results of their experiments.

I would like implement what is shown in the paper as to learn how it works but I having trouble understanding the proofs and concepts that support their methodology. Can any one point me in the right direction? The paper is linked below:


08:34 UTC


Experimenting with 3D reconstruction

Hey, i’m looking onto methods in 3D reconstruction and I was wondering if any of you had any suggestions to algorithms I could poke around with for dense reconstruction. I was thinking maybe nerf, or reality capture?

01:59 UTC


Computer vision remote positions

Guys where can Computer vision engineers or developers find global remote positions?

21:07 UTC


Could someone please help figure out this line of code for KITTI odometry dataset handler?

I'm referring to the code from slambook2 ch13 dataset.cpp on github. In particular, the dataset gives the projection matrix which the author decomposes to determine the intrinsic, base length, etc... But for some reason, he also multiplies the intrinsic matrix with 0.5 after calculating and for the life of me, i can't figure out why.

The relevant section is here:
Mat33 K;
K << projection_data[0], projection_data[1], projection_data[2],
projection_data[4], projection_data[5], projection_data[6],
projection_data[8], projection_data[9], projection_data[10];
Vec3 t;
t << projection_data[3], projection_data[7], projection_data[11];
t = K.inverse() * t;
K = K * 0.5; <-------- This right here!`

Data looks like this:
P0: 7.188560000000e+02 0.000000000000e+00 6.071928000000e+02 0.000000000000e+00 0.000000000000e+00 7.188560000000e+02 1.852157000000e+02 0.000000000000e+00 0.000000000000e+00 0.000000000000e+00 1.000000000000e+00 0.000000000000e+00

20:27 UTC


Seat Occupancy Detection

I am in trouble and really need your help!

I am looking for a seat occupancy detection ready to use model. I need it very soon so your help is highly appreciated!

20:25 UTC

Back To Top