/r/computervision
Computer Vision is the scientific subfield of AI concerned with developing algorithms to extract meaningful information from raw images, videos, and sensor data. This community is home to the academics and engineers both advancing and applying this interdisciplinary field, with backgrounds in computer science, machine learning, robotics, mathematics, and more.
We welcome everyone from published researchers to beginners!
Content which benefits the community (news, technical articles, and discussions) is valued over content which benefits only the individual (technical questions, help buying/selling, rants, etc.).
If you want an answer to a query, please post a legible, complete question that includes details so we can help you in a proper manner!
Related Subreddits
/r/computervision
I have a task that involves processing receipts: recognizing the country, retailer, products, and prices. From what I’ve read, this falls under the umbrella of Intelligent Document Processing (IDP).
Are there any flexible, ready-made solutions for this that won’t hit a complexity wall? Or is it still a case of having to build the entire pipeline from scratch?
Would love to hear your recommendations or experiences! Thanks!
Hi everyone,
I’m looking to train a custom face detection model using a dataset with images and XML annotation files. I plan to use PyTorch for training and save the model as a .pth
file. I also want to apply this trained model in real-time using a webcam feed (e.g., with OpenCV).
Can anyone recommend some comprehensive tutorials or resources that cover both:
I’d really appreciate any guidance or links to detailed tutorials on these steps!
Thanks in advance!
has anyone worked on a project where the data quality was the main hindrance? how do you get thru it??
What was the most unusual or unexpected computer vision project you’ve been involved in? Here are two from my experience:
What about you?
hey guys i am trying to make a simple recommender that recommends a furniture from a set of furniture for a particular spot in the room. I tried cosine similarity between the spot and the furniture but it seems it doesn't work as expected, what other approaches can i try?
I’m running into an issue with a CV project where we’re able to to read in HD ProRes files at 200-300 fps, but moving to UHD takes us down to ~20-30 FPS. This is with heavy multi threading on a M1 Ultra Mac Studio fully maxing out the processor on OpenCV. I’d expect 1/4th performance, but 1/10th seems a bit excessive.
Has anyone worked out a way to utilize the hardware Media Engine for ProRes acceleration. The same machine can read and write a UHD ProRes in Resolve at over 300fps. Processing the actual CV tasks is extremely fast (200+ FPS) after the read, but the read is super slow. Tried FFMPEG-Python, and it was comparable to multithreaded CV2.
If anyone has found a library that can utilize the Mac’s Media Engine to get a NP of the uncompressed frames, I’d be eternally grateful.
I am trying with Watershed, K-means applying various pre-processing steps, but the segmentation is not upto the mark. For edge detection I used multi-scale edge detection with CLAHE and canny but the results are very bad.
The images are not very crowded, but not very simple either.
(This is just for my learning purposes, exploring with traditional CV segmentation methods)
Hi, everyone. Hope you're doing well. I'm from Argentina and I'm doing some digital image processing stuff applied to football. I want to extract some information about the distance that a player travelled on a match, and for that I took some videos from a fotball match. The camera couldn't take all the pitch, even with the "fish eye" lens so I had to record two videos from the match.
The main problem is that the two images warp a lot! That makes very difficult the post-processing.
I have two videos from the left and right side of a football pitch and I want to join them. I used this implementation of a
https://github.com/OpenStitching/stitching
I load two images from an instant of the match and then use the feature detector and later I match the features form the two images. Next step is warping the images, and here is the main problem, where the images bend a lot for matching the features.
The final result is like this:
I would like to know if there's something better in terms of less warping that I could use. I found a page that makes the joining of the two images with less distortion:
I'm still not sure how they achieve that, any help would be very appreciated! Here is the code of the homography that I described before:
https://github.com/agusrol/homography_football/blob/main/homography_final.ipynb
Hi everyone! I am kind of new to CV and I wanted to try my hand at building an image classifier for the original Pokémon starters (Bulbasaur, Charmander, and Squirtle). I started off by building my dataset by creating a dataset by downloading all the images of every Pokémon card of the three starters but there is a surprising lack of images for those three. I only got a range of 20-30 different images for each. I plan to use data augmentation to increase the dataset size, but I think that this will still result in way too small of a dataset.
If anyone had some advice on some ways I can help increase my dataset size that would be fantastic.
Note: I haven't included any pictures from the shows or fan art yet since I had some concerns about adding pictures like that for my dataset
Anyone been to any good networking events in the UK (especially London)? Cheers!
Hi everyone! Sorry if my question is trivial, but I can't really understand this matter.
I know that Perspective N-Point is very useful to find the camera pose with respect to an object (and so, the object pose relative to the camera), but I can't really get how to automatically select the most important points to get a 2d-3d correspondence and effectively use PnP.
I'll give an example: I have the 3d models of some objects, and I can manually select some points on them. Then, with a Blender add-on I can select the same points on a photo and calculate the correct camera pose. At the same time, I don't really get how to automatize this process (without manually choosing the points) to match 2d and 3d points. Some researches made me discover solutions like feature detection and matching in images, but I don't get how to implement them.
Thanks in advance!
There are a handful of reasons why computer vision models achieve low Mean Average Precision (mAP) ratings. One way to overcome this challenge is by using synthetic image datasets.
But where do you find the best service provider?
Great news!
I just launched my online directory of service providers in the synthetic image data generation and simulation industry. (More listings coming this week)
It's free to browse all providers, and I would appreciate it if you could check it out and share your feedback with me.
Link: https://www.inkmanworkshop.com/
Thank you!
-Eli
Hi everyone,
I'm new in computervision and I'm blocked during several hours on this problem.
I try to detect 32 Keypoints on a badminton court : all lines intersections and the feet of the net. For that I used pytorch on a resnet50 pretrained model. I have a Dataset of annotates images and I train the model like that :
class KeypointsDataset(Dataset):
def __init__(self, img_dir, data_file):
self.img_dir = img_dir
with open(data_file, "r") as f:
self.data = json.load(f)
self.transforms = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
item = self.data[idx]
img = cv2.imread(f"{self.img_dir}/{item['id']}.jpeg")
if img is None:
raise ValueError(f"Image {item['id']} could not be loaded.")
h, w = img.shape[:2]
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = self.transforms(img)
# Ensure keypoints are consistent
kps = np.array(item['kps']).flatten()
if len(kps) != 64: # Check that we have 32 keypoints (x, y)
raise ValueError(f"Expected 64 keypoint values, but got {len(kps)} for item {item['id']}")
kps = kps.astype(np.float32)
kps[::2] *= 224.0 / w # Adjust x coordinates
kps[1::2] *= 224.0 / h # Adjust y coordinates
return img, kps
train_dataset = KeypointsDataset("data/images","data/data_train.json")
val_dataset = KeypointsDataset("data/images","data/data_val.json")
train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=True)
from torchvision.models import resnet50, ResNet50_Weights
# Load the model with pretrained weights
weights = ResNet50_Weights.DEFAULT
model = resnet50(weights=weights)
model.fc = torch.nn.Linear(model.fc.in_features, 32*2)
criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)
epochs=100
best_val_loss = float("inf")
patience = 10 # Number of epochs to wait for improvement before stopping
trigger_times = 0
for epoch in range(epochs):
for i, (imgs,kps) in enumerate(train_loader):
imgs = imgs.to(device)
kps = kps.to(device)
optimizer.zero_grad()
outputs = model(imgs)
loss = criterion(outputs, kps)
loss.backward()
optimizer.step()
if i % 10 == 0:
print(f"Epoch {epoch}, iter {i}, loss: {loss.item()}")
# Validation Phase
model.eval()
val_loss = 0.0
with torch.no_grad():
for imgs, kps in val_loader:
imgs = imgs.to(device)
kps = kps.to(device)
outputs = model(imgs)
loss = criterion(outputs, kps)
val_loss += loss.item()
val_loss /= len(val_loader)
print(f"Epoch {epoch}, validation loss: {val_loss}")
# Early Stopping Logic
if val_loss < best_val_loss:
best_val_loss = val_loss
trigger_times = 0
torch.save(model.state_dict(), "best_keypoints_model.pth") # Save best model
else:
trigger_times += 1
if trigger_times >= patience:
print(f"Early stopping triggered at epoch {epoch}")
break
When I called the model, I make the resizing back like this :
class CourtLineDetector:
def __init__(self, model_path):
self.model = models.resnet50(pretrained=False)
self.model.fc = torch.nn.Linear(self.model.fc.in_features, 32*2) # Adjust for the number of keypoints
self.model.load_state_dict(torch.load(model_path, map_location='cpu'))
self.transform = transforms.Compose([
transforms.ToPILImage(),
transforms.Resize((224, 224)),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
def predict(self, frame):
img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
image_tensor = self.transform(img_rgb).unsqueeze(0)
with torch.no_grad():
self.model.eval()
outputs = self.model(image_tensor)
keypoints = outputs.squeeze().cpu().numpy()
original_height, original_width = frame.shape[:2]
# Scale keypoints back to the original frame dimensions
keypoints[::2] *= original_width / 224.0
keypoints[1::2] *= original_height / 224.0
return keypoints
def draw_keypoints(self, image, keypoints):
# Plot keypoints on the image
for i in range(0, len(keypoints), 2):
x = int(keypoints[i])
y = int(keypoints[i+1])
if x is not None and y is not None:
cv2.putText(image, str(i//2), (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 2)
cv2.circle(image, (x, y), 5, (0, 0, 255), -1)
return image
def draw_keypoints_on_video(self, video_frames, keypoints):
output_video_frames = []
for frame in video_frames:
frame = self.draw_keypoints(frame, keypoints)
output_video_frames.append(frame)
return output_video_frames
But, as you can see, the dot look kind of in the good position, but not in good scale.
I already try several method in order to correct that like padded the picure before training (and after) in order to keep the correct scale... but that don't work either.
My original dataset is really simple with
[
{
"id": "badminton_court_keypoints_167",
"metric": 0.0,
"kps": [
[
385,
353
],
...
]
},
...
]
Futhermore, do you think there is a better way to achieve the same results, I'll be happy to understand better this field.
Thank you so much for taking the time to help me.
Hi, I am wondering if it is possible to build an app that allows users to take a picture of themselves in a mirror that automatically adds their clothing items into an inventory, ideally a photo of the item and tags such as clothing category, color, fabric etc. Then the user should be able to mix and match the items in the wardrobe and see the outfit on a realistic avatar of themselves. Is this possible from a technological standpoint? If so, could you recommend which concepts, resources etc. to look into? What kind of expertise would one need to develop this product? Please bear in mind that I have engineering background and coding experience but I don't know anything about computer vision itself.
I am in the process of writing a survey paper that explores the real-world challenges of AI-driven visual sensing across various sectors, such as wearable devices (VR/AR/smart glasses), construction, mining, oil, robotics, retail, and more. My focus is on the limitations and constraints posed by camera technology in these applications. Any insights or contributions on this topic would be greatly appreciated!
I have a skybox panorama image ( 360 view in bottom/up/left/right/front/back view ). I also have the camera position and rotation vectors, and I've noticed that the rotation vector is for camera "bottom" view.
I'd want, having the bottom view rotation vector, to calculate rotation vectors for all other views ( left/right/up/front/back ), but I'd like to start with "left" view. The problem is if I only manipulate y axis, and rotate it 90 degrees, objects that must be on the bottom are on the left side of the image rendered from camera perspective, and if I additionally rotate it 90 degrees on Z axis, objects that must be on the bottom are slightly on the right if it makes sense.
As I understand, it happens because the Z axis rotates with Y rotation, and it is not perfectly aligned now. Is there a way to properly calculate rotation for panorama view?
PS. Sorry if I explained that poorly, I'll try to create example if what I'm saying does not make sence.
This blog post explores how CV solutions are used in autonomous driving cars. It is not a technical post, but it can be useful for general knowledge. If you have more interesting Use cases or projects, please add them to the thread. It would be very useful to me. Thank you.
Hi, I've been trying to implement Edge detection using convolution (sobel operator) in C and utilizing the pthreads library for multithreading but when I compare the execution time of the program with and without threads I'm getting a slower execution time for the one without threads. I've tried to reduce the number of threads so less overhead to create and manage them, Increased the image resolution but nothing worked.
Here is my code
Image mul_convolution(int **input_image, Kernel k, int width, int height) {
int num_threads = 4; // Number of threads
pthread_t threads[num_threads];
ThreadData thread_data[num_threads];
int rows_per_thread = height / num_threads;
Image img;
img.width = width;
img.height = height;
img.data = (int **)(malloc(sizeof(int *) * height));
for (int i = 0; i < height; i++) {
img.data[i] = (int *)(malloc(sizeof(int) * width));
}
for (int i = 0; i < num_threads; i++) {
thread_data[i].input_image = input_image;
// thread_data[i].output_image = img.data;
thread_data[i].k = k;
thread_data[i].width = width;
thread_data[i].height = height;
thread_data[i].start_row = i * rows_per_thread;
thread_data[i].end_row = (i == num_threads - 1) ? height : (i + 1) * rows_per_thread;
thread_data[i].output_image = malloc((thread_data[i].end_row - thread_data[i].start_row) * sizeof(int *));
for (int j = 0; j < (thread_data[i].end_row - thread_data[i].start_row); j++) {
thread_data[i].output_image[j] = malloc(width * sizeof(int));
}
pthread_create(&threads[i], NULL, thread_convolution, &thread_data[i]);
}
for (int i = 0; i < num_threads; i++) {
pthread_join(threads[i], NULL);
}
return img;
}
void* thread_convolution(void* arg) {
ThreadData* data = (ThreadData*)arg;
int **input_image = data->input_image;
Kernel k = data->k;
int width = data->width;
int height = data->height;
int start_row = data->start_row;
int end_row = data->end_row;
for (int i = start_row; i < end_row; i++) {
for (int j = 0; j < width; j++) {
int sum = 0;
for (int m = 0; m < k.size; m++) {
for (int n = 0; n < k.size; n++) {
int x = i + m - k.size / 2;
int y = j + n - k.size / 2;
if (x >= 0 && x < height && y >= 0 && y < width) {
sum += input_image[x][y] * k.data[m][n];
}
}
}
data->output_image[i - start_row][j] = sum;
}
}
pthread_exit(NULL);
}
Hi all, I'm looking for some papers that are beginner-friendly (I am only familiar with basic neural network concepts) that discuss the process of combining multiple perspectives of a photo into a 3D model.
Ideally, I'm looking for something that supports calibration beforehand, so that the reconstruction is as quick as possible.
Right now, I need to do a literature survey and would like some help in finding good direction. All the papers I've found were way too complicated for my skill level and I couldn't get through them at all.
Here's a simple diagram to illustrate what I'm trying to look into: https://imgur.com/a/MJue7I2
Thanks!
Hi Folks,
I have a use case where I am given two images. For notations let's call IMAGE1 and IMAGE2. My task is to select an object from IMAGE1 ( by selection, I mean to obtain the segmented mask of the object ). Place this segmented mask object naturally in IMAGE2, where a masked region is provided by the user. We have to ensure that the object from IMAGE1 should be naturally blended into IMAGE2. Can someone shed light on what might be the best model or group of models to do this?
Example: Place a tree from IMAGE1 into IMAGE2 ( group of people taking selfie on a grassland)
Smooth edge blending, Shadows, etc.
Dataset: For now, I choose to work on the COCO dataset. A subset of 60K images
Since painting has many techniques, It's confusing which set of models I need to pipeline for my use case, which might give a good, realistic, natural image.
I have explored the following techniques but could not settle on one strategy.
Partial Convolutionals.
Generative Adversarial Networks (GANs)
Autoencoders.
Diffusion Models
Context-based attention models etc.
Thanks for checking on my post. Please provide some insights if you have some experience or ideas working on such use cases.
I was wondering if there’s anything that could make these plates clear in this video any help would be greatly appreciated
Fine-Tune Mask RCNN PyTorch on Custom Dataset
https://debuggercafe.com/fine-tune-mask-rcnn-pytorch-on-custom-dataset/
Instance segmentation is an exciting topic with a lot of use cases. It combines both object detection and image segmentation to provide a complete solution. Instance segmentation is already making a mark in fields like agriculture and medical imaging. Crop monitoring and tumor segmentation are some of the practical aspects where it is extremely useful. But in deep learning, fine-tuning an instance segmentation model on a custom dataset often proves to be difficult. One of the reasons is the complex training pipeline. Another reason is being able to find good and customizable code to train instance segmentation models on custom datasets. To tackle this, in this article, we will learn how to fine-tune the PyTorch Mask RCNN model on a small custom dataset.
Hi everyone!
I built a pipeline combining a Reflectance-Based Deep Retinex model with Real-ESRGAN to enhance low-light images. The Retinex model separates the image into reflectance and illumination components, allowing us to adjust brightness and contrast based on predicted coefficients. This helps to improve visibility in low-light images while keeping details natural. After this, I thought eh that was kinda just recreating a paper. So, I tried improving it with Real-ESRGAN. It steps in to upscale the images, adding super-resolution for clearer, high-quality results.
The model has shown decent results in handling challenging low-light conditions by producing images with better visibility and refined details. If you're interested, I’ve shared the code here: Project.
I still wasn't exactly able to reproduce the results from the paper here. But the final image is clearer and with a lot less noise than even the ground truth at some points.
Here's an example:
I’d love any feedback or thoughts for improvement using this method.
P.S. I'm only a grad student, take it easy on me xD
Has anyone written custom code to calculate metrics like precision, recall, and the confusion matrix for YOLO segmentation? I have my predicted label files, but since I've modified the way I'm getting inference results, the default val
function in Ultralytics doesn’t work for me anymore. Any advice on implementing these metrics for a custom YOLO segmentation format would be really helpful!
I am struggling to find a pose estimation model that is accurate enough to estimate poses consistently for sports footage (single person, 30fps, 17 key points)
Do you have any tricks/tips for video post processing to increase accuracy?
Thanks!
I've looked around and haven't found one of the 5K monitors I'm interested in on display. The only retailer that carries anything anymore is Best Buy, and I live in LA. They do have the LG 45" OLED which is big and beautiful in person, although probably too curved, not much of a hub, and sold as a gaming monitor. The size is nice being tall AND wide! I'm not a gamer except for some FPV Drone Simulation on occasion.
What I am is a MAC creative who works in photoshop, InDesign, Illustrator and a fair amount of Premier. I'm looking for a combination of color accuracy, size (but not a fan of narrow 49" monitors) and resolution. I'm currently on an Imac 27" which is what I'm used to with it's 5K resolution, and sometimes text is hard to read. Because I have a 23" sidecar monitor I can't mount a VESA and pull it close to my face when needed. However, I do prefer to keep the monitor a little further from my face for eyeball tanning sake. 5K resolution comes in real handy as I'm often using screen grabs.
What I like about the Dell is the resolution, the hub with ample USB C ports, the ambient light sensor. But Dell is not a name I associate with computer monitors. I'm also a fan of OLED screens. My TV is an LG OLED and it's been sweet! I like the idea of the screen emitting the light rather than an array of LED's from behind. I see that LG has a 5K OLED coming 2025/26
I am still debating between an M2 Studio Ultra or an M4 Mini if you'd like to chime in on that feel free. If I found a screamin' deal on a M2 Ultra studio i'd probably get that. This next computer will likely be a placeholder till the M4 Ultra/Studio or whatever Apple does next is released. So an M4 mini might have better resale when that time comes.
So with black Friday looming, is it worth the extra scratch for the Dell or LG 40"? Or would I be happy with an LG OLED 38" or 45"?