/r/computervision
Computer Vision is the scientific subfield of AI concerned with developing algorithms to extract meaningful information from raw images, videos, and sensor data. This community is home to the academics and engineers both advancing and applying this interdisciplinary field, with backgrounds in computer science, machine learning, robotics, mathematics, and more.
We welcome everyone from published researchers to beginners!
Content which benefits the community (news, technical articles, and discussions) is valued over content which benefits only the individual (technical questions, help buying/selling, rants, etc.).
If you want an answer to a query, please post a legible, complete question that includes details so we can help you in a proper manner!
Related Subreddits
/r/computervision
Hello everyone,
I was reading the nnU-Net paper: https://www.nature.com/articles/s41592-020-01008-z or arxiv version https://arxiv.org/abs/1809.10486 and I was wondering if I can find the pre-trained version of their model? Specifically I'm looking for the nnU-Net that they themselves trained on their dataset, since I am looking to work with the same Medical Decathlon dataset, conducting Knowledge Distillation specifically.
I found their github https://github.com/MIC-DKFZ/nnUNet They provide details on how to train and do inference etc. generally. I was originally going to train an nnUNet on the existing Medical Decathlon dataset but that would be doing work that already has been done. I was wondering if anyone knows how do I find the trained model instance that they worked with? I thought about emailing them, but idk how acceptable is that of a request in the CV community.
nnUNetV1 would also be fine.
Would be grateful for any advice.
Hi, I’m looking for papers that goes into more depth of these components mentioned in my professors lecture. The closest thing I have found is data association discussed in ORB-SLAM3 paper, but it does not mention loop closure in local map or global map.
My professor said this is discussed in many papers, but I have so far not found one.
Train S3D Video Classification Model using PyTorch
https://debuggercafe.com/train-s3d-video-classification-model/
PyTorch (Torchvision) provides a host of pretrained video classification models. Training and fine-tuning these models can prove to be an invaluable asset in building many real-life applications. However, preparing the right code to start with custom video classification training can be difficult. In this article, we will train the S3D video classification model from PyTorch. Along the way, we will discuss the pitfalls, caveats, and optimization techniques specific to the model.
Hello, everyone! I'm currently working on a YOLO vision project using the pose model, and I'm having some issues estimating distance using only a camera. I know it might sound a bit arbitrary, but this is the solution we have for now while I wait for a LiDAR sensor I ordered last week. Since I'm in a Latin American country, it may take a month to arrive.
Right now, we're estimating distance by using the focal length with the person facing the camera, and it seems to be working well, with an error margin of around 20-25 cm. Here’s the code we're using:
```
float YoloV8::estimateDistance(const cv::Rect_<float>& bbox, const std::vector<float>& keypoints) {
// If we have valid keypoints, use the shoulder distance
if (!keypoints.empty() && keypoints.size() >= 21) { // Ensure we have enough keypoints
// Get shoulder coordinates (5 and 6 in COCO format)
float shoulder1X = keypoints[12]; // Right shoulder X (5 * 3)
float shoulder2X = keypoints[15]; // Left shoulder X (6 * 3)
float shoulder1Conf = keypoints[14]; // Right shoulder confidence
float shoulder2Conf = keypoints[17]; // Left shoulder confidence
// If both shoulders are detected with sufficient confidence
if (shoulder1Conf > KPS_THRESHOLD && shoulder2Conf > KPS_THRESHOLD) {
float shoulderWidth = std::abs(shoulder2X - shoulder1X);
if (shoulderWidth > 0) {
return (AVERAGE_SHOULDER_WIDTH * CAMERA_FOCAL_LENGTH_SHOULDERS) / shoulderWidth;
}
}
}
// Fallback to the original method if we cannot use shoulders
return (AVERAGE_PERSON_WIDTH * CAMERA_FOCAL_LENGTH) / bbox.width;
}
```
The issue I’m currently facing is with profile views; the distance calculation becomes inaccurate, returning values that don't make sense.
Background: Worked in the research labs of McGill University and IISC Bangalore in the fields of CV, ML, Robotics and IoT
Tech stacks: PyTorch, OpenCV, Mediapipe, ROS, puredata, C++
Currently looking for contract based projects, if you a professional looking to delegate your work, or a college student looking to get their final year project done at an industrial level, feel free to contact me for my portfolio/profile.
I am really trying to find my target market, and it would really help me out if some of you took this survey for me. We will be releasing more information about it in the future. I think you all will love it, developers and hobbyists alike. I am trying to figure out who my target market is, and it would be extremely helpful if some of you could fill out this survey for me. https://forms.gle/6KzCHZskboepSpWQ6
The paper aims to shorten acquisition time, reduce costs, and accelerate the deployment of imaging devices.
https://openreview.net/pdf?id=MloaGA6WwX
Contributions:
We expect further applications to similar datatypes e.g. data efficiency on multi-channel images, other hyperspectral/multispectral application, cell microscopy, weather and climate data et.c
Code is available, PM me if interested.
Hi all. I am an MBA student at Temple university and we are doing our final project looking at Machine and computer vision. I would be grateful if you would be able to fill out this survey and if possible send to anyone else that works in manufacturing. We are looking for opinions from those that currently and do not currently use vision systems. Here is the link to the survey: https://fox.az1.qualtrics.com/jfe/form/SV_0cEBnNUQ9jnxZpI
Alternatively if you would like to do a short interview on your experiences, this would also be much appreciated.
Thanks so much!
Hi all:
I am reading a paper and need to thoroughly understand it. This is the paper: https://ieeexplore.ieee.org/abstract/document/6983606
I can pay. If anyone here is well versed in this and can read through and thoroughly understand/help me implement this, please DM me. Thanks!
I’m looking for a model like gdino, where there is a sort of open-vocabulary/zero-shot support, but also one that is preferably faster (and maybe smaller/less resource intensive). I looked into yolo-world but it didnt support the open-vocab part quite like I wanted to (e.g. instead of detecting all apples in a scene, I would want to detect “apple on table” which gdino is much better at compared to yolo world from what I’ve tested).
Or should I just maybe fine-tune yolo world to do what I want it to do?
Title
Tried out omniparser and it's pretty decent, but it misses some stuff. Also, I'd like something that can recognize boxes / layouts instead of just icons / text
We are making a robot for you: https://forms.gle/ggVetcDios9m15yV8you:
Hi!
I recently started my adventure with computer vision. I wrote some code that was supposed to use YOLO algorythm working with GPU (I am using nvidia cuda), encountered whole lot of errors trying to open video files with it and I'm still having some problems with it - it seems to have problems with reading the frames. The code is down below. I spent couple hours with chat gpt and scrolling through internet in search of help but nothing worked:(, also checked the directory, the video resolution, it seems to be fine. Do you have any idea how to repair it? I will be grateful for any kind of help!
P.S. ffmpeg seems to have no problems with localizing and opening the file through command prompt
import random
import threading
import cv2 as cv
import numpy as np
from ultralytics import YOLO
import torch
import time
import subprocess as sp
import os
cv.setNumThreads(1)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Load the class list
with open("utils/coco.txt", "r") as my_file:
class_list = my_file.read().strip().split("\n")
detection_colors = [(random.randint(0, 255), random.randint(0, 255), random.randint(0, 255)) for _ in range(len(class_list))]
model = YOLO("weights/yolov8n.pt", "v8").to(device)
def read_video_ffmpeg(path, frame_wid=640, frame_hyt=480):
command = ['ffmpeg', '-loglevel', 'error', '-i', path, '-f', 'image2pipe', '-pix_fmt', 'bgr24', '-vcodec', 'rawvideo', '-']
print("Running FFmpeg command:", " ".join(command)) # Debug print
pipe = sp.Popen(command, stdout=sp.PIPE, stderr=sp.PIPE, bufsize=10**8)
while True:
raw_image = pipe.stdout.read(frame_wid * frame_hyt * 3)
print(f"Raw image length: {len(raw_image)}")
# Check if FFmpeg gave any error
err = pipe.stderr.read().decode()
if err:
print("FFmpeg error:", err)
break
if not raw_image:
print("End of video stream or no data received.")
break
try:
frame = np.frombuffer(raw_image, dtype='uint8').reshape((frame_hyt, frame_wid, 3))
yield frame
except ValueError as e:
print(f"Error reshaping frame: {e}")
break
pipe.stdout.close()
pipe.stderr.close()
pipe.terminate()
class Video:
def __init__(self, src="D:/OPENCV/videos/DJI_0302.MP4"):
self.src = src
# Check if the video file exists
if not os.path.isfile(self.src):
print(f"Error: Video file does not exist at path: {self.src}")
return # Stop initializing if file does not exist
self.frame_wid = 2720 # Update frame width
self.frame_hyt = 1536 # Update frame height
self.frame_gen = read_video_ffmpeg(self.src, self.frame_wid, self.frame_hyt)
self.frame = None
self.running = True
threading.Thread(target=self.update, daemon=True).start()
def update(self):
while self.running:
try:
self.frame = next(self.frame_gen)
print("Frame read successfully")
except StopIteration:
self.running = False
except Exception as e:
print(f"Error updating frame: {e}")
self.running = False
def read(self):
return self.frame
def stop(self):
self.running = False
video_stream = Video(src="D:/OPENCV/videos/DJI_0302.MP4")
fps_limit = 10
while video_stream.running:
start_time = time.time()
frame = video_stream.read()
if frame is None:
print("No frame to display")
break
detect_params = model.predict(source=[frame], conf=0.25, save=False)
if detect_params:
boxes = detect_params[0].boxes
for box in boxes:
clsID = int(box.cls.cpu().numpy()[0])
conf = box.conf.cpu().numpy()[0]
bb = box.xyxy.cpu().numpy()[0]
cv.rectangle(
frame,
(int(bb[0]), int(bb[1])),
(int(bb[2]), int(bb[3])),
detection_colors[clsID],
5,
)
cv.putText(
frame,
f"{class_list[clsID]} {round(conf * 100, 2)}%",
(int(bb[0]), int(bb[1]) - 10),
cv.FONT_HERSHEY_COMPLEX,
1,
(255, 255, 255),
2,
)
cv.imshow("Object Detection", frame)
elapsed_time = time.time() - start_time
frame_delay = max(1, int((1 / fps_limit - elapsed_time) * 1000))
if cv.waitKey(frame_delay) == ord("q"):
break
video_stream.stop()
cv.destroyAllWindows()
Does anyone have a recommendation for a theory-based, up-to-date book on Computer Vision based on deep learning techniques? My main topic of interest is object detection.
Hi there :)
I got something cool to share with you, over the past few months i have been running around trying to find a way to make a dream come true
Im creating a online hub for people in ai that care about technological innovation and having a positive impact by building and contributing on projects
This is hub will be a place to find like minded people to connect with and work on passion projects with.
Currently we are coding a platform so that everyone can find each other and get to know each other
After we got some initial users we will start with short builder programs where individuals and teams can compete in a online competition where the projects that stand out the most can earn some prize :)
Our goal is to make the world a better place by helping others to do the same
If you like our initiative, please sign up below on our website !
https://www.yournewway-ai.com/
And in some weeks, once we're ready we will send you a invite to join our platform :)
Hi, given 2 camera2world matrices, I am trying to compute the rotation degree of camera from first image to second image, for this purpose I calculated the relative transformation between the matrices(multiplying second matrix by the inverse of the first), and took the sub matrix(:3,:3 of the 4*4 relative transform matrix), I have the ground truth rotation value but for some reason they do not match the Euler degrees I compute using scipy's rotation package, any clue what I am doing wrong mathmatically?
*the values of cam2world are the output obtained from Dust3r if that makes a difference
hello guys, my first post on here and I just want to say I freaking hate my amd gpu (running on windows) so damn much, I have been trying for 6 weeks now to train a simple face detection model using a public dataset, but my amd gpu refuses to elaborate! I wish I knew how bad amd was when it comes to machine learning and computer vision before I bought it 😔😔 I can’t even download linux due to other reasons, I also tried directML but that failed miserably for some reason, not really looking for help but if anyone is considering buying a build for computer vision (which I was not when I got mine) please avoid amd at all costs.
You can check it out here: https://www.coursera.org/learn/hands-on-data-centric-visual-ai
I want to measure fps to benchmark different versions of YOLO and I do this by running inference 5 times on a video and then averaging fps for each frame. To be sure that this task is not interrupted by the scheduler, I put sudo nice -n -20
before yolo predict
and I check processes with jtop
(and ofc power mode is fixed). However, under these conditions I sometimes get big differences for the same model (i.e. 50<->75 fps).
Do you know which is the reason? Temperature? Or is there a more robust way to achieve my goal?
Hello CV,
I'm currently in the process of training YOLO to identify which industrial complexes does NOT have solar panels on their roof. I want it feed it training data of google maps satellite images, but I'm unsure how to go about this.
The questions that I have:
- How do I determine the correct size (pixel) for my training data?
- Is there any available API that can help me make the process easier?
- Is there a way to use the globe/3d view to help identify the model identify if the roof is flat or slanted?
Thank you, hope someone can help me
I was tasked with a project at work to build a facial recognition app that runs on Android tablets for one of our clients on a tight deadline. The first thing I did was detect the face on the device, send it to a local server, get DLIB to create an embedding from the captured face THEN compare the embedding with the list of saved face embeddings. This worked (albeit with max. achievable latency), and the effective accuracy was about 50-60%.
After deploying this solution i started working on the app again, to enable on-device recognition using TFLite and MobileFaceNet - (Normalized embeddings and L2Normalization). It works BUT the accuracy is like -30%.
At the moment i am using one frontal picture of each employee, can I increase the number of comparison (base) pictures per employee?
I (think) i realized that base pictures taken in front of a dark background tend to yield more accurate comparisons - is this the case (theoretically)?
Any other suggestions would be bloody appreciated - Oh and by the way, prior to this project i had no knowledge of CV, so please explain things like you are talking to a five four year old.
Hey Hi I’m developing my first project with opencv using a basler camera but I cannot achieve image acquiring: it opens the image and cracks instactly (doesn’t respond anymore)
Is there any guide anywhere I can use?
Also I can’t see the camera on Pylon viewer, but it runs in my python code in spyder (the one that cracks)
Hi everyone, I was assigned the task of lane detection. However, after searching the internet, I found many methods, mainly the lane segmentation method or polyline-based detection since I only want is to predict the dot on the lane, like in the attached image. Can you suggest any model or any method that already worked on this?
Thank you very much
Processing img ocgy2almxuwd1...
I am willing to start a small academic project that takes in a street view from a Maps' API and then do some processing on it. For example, if we are passing by a monument or any building that is of pretty much importance and the crowd is pretty much covering all the space up. I would like to erase them, be it cars or people and content-fill give a clear one. I need help to what papers to read, if anyone has done anything similar to this. Mainly, how to project the 360 view? On what sort of plane to perform all the desired actions. Anything other help would also be helpful
Planning to make an identifier for manufacturing parts kept in a storage line with nuts and bolts of different sizes. Any recommendations?
This is my experience by far using orange pi 5 and my tries up until now in making yolov9 work on orange pi5/ RK3588 SoC . Our company uses Orange Pi5 4GB (RK3588 SoC) as the main process unit of our traffic cameras . This boards are pact with NPU which is very useful considering our process's behind the since of the whole detection process . I decided to make 3 different models, one for detecting vehicles, one for detecting License plates and an other one for reading the plates. I chose yolov9 since it had more accuracy comparing with yolov10 and more speed compared to yolov8, I also chose t variant of yolov9 models since they are the lightest and probably faster on edge devices. . After process of making a good dataset base on company data and my best tires on normalizing the dataset, I got a good acceptable above 70% accuracy on test environment(and 60-82% in real life soon after) . After 3 work days of work on orange pi, I was able to boot up on OS (The company gave me a board that had already OS(some old version of PiOH the specialized Ubuntu for orange pi boards) but that had some old dependencies like onnx 1.13.0 and my newer models wasn't compatible so after checking multiple versions of the arm Linux Versions (armbian, arch, piOH etc...) I got hands on https://github.com/Joshua-Riek/ubuntu-rockchip/wiki Which helped me boot up correctly to orange pi(In this process I even though I damaged a board since this shitty boards are moody and sometimes they simply don't want to boot to SD card or nvme or show red light so we found out they are alive) . After that, I made a simple python code, for taking frames from cameras and trying to detect object via my models (vehicle detection->cut the vehicle image-> send to license plate detection model->detect the lisence plate -> cut lisence plate -> send to OCR model -> read license plate, and then save images of the car, lisence plate nad the OCR output. . After trying for 1 week on trying different types of approach on importing my .pt model to .rknn, I found out, YOLOv9 models are simply not compatible with Rk3588 NPU's since Only models saved in torch.jit.trace can be used and YOLOv9 isn't. yet you can't use any other types of YOLO models but those that cosumized to be able to convert to rknn This was my experience, I hope it help others to do not fall in this shitty hole of not understanding wtf doc and manuals said in rknn-toolkit2
Hey everyone! For my capstone project, I'm building a system to detect people in wheelchairs through video streaming, but here's the catch: it has to run on a microcontroller like a Raspberry Pi 4 or 5. I’m pretty new to machine learning and YOLO models, so I could really use some advice on a few things:
Thanks in advance for any help! Any advice or resources would be really appreciated.