/r/computervision
Computer Vision is the scientific subfield of AI concerned with developing algorithms to extract meaningful information from raw images, videos, and sensor data. This community is home to the academics and engineers both advancing and applying this interdisciplinary field, with backgrounds in computer science, machine learning, robotics, mathematics, and more.
We welcome everyone from published researchers to beginners!
Content which benefits the community (news, technical articles, and discussions) is valued over content which benefits only the individual (technical questions, help buying/selling, rants, etc.).
If you want an answer to a query, please post a legible, complete question that includes details so we can help you in a proper manner!
Related Subreddits
/r/computervision
Does anyone have this project?? It’s for my project plz help
Hi! I’m new to deep learning and working on a project using YOLOv8 to detect insects in museum display images. I’m running this on my school’s HPC, which has an NVIDIA A100 GPU and uses Slurm. The training stops around the 4th epoch with a "segmentation error." Oddly enough, it works fine with a much smaller dataset (around 10 images, training for 1 epoch).Has anyone encountered this before, or have any tips on troubleshooting this?
I am working on a research project which will contribute to my PhD dissertation.
This is a user study where ML developers answer a survey to understand the issues, challenges, and needs of ML developers to build privacy-preserving models.
If you work on ML products or services or you are part of a team that works on ML, please help me by answering the following questionnaire: https://pitt.co1.qualtrics.com/jfe/form/SV_6myrE7Xf8W35Dv0.
For sharing the study:
Please feel free to share the survey with other developers.
Thank you for your time and support!
Mary
Are the 3D models available for the DTU Dataset for all the scans? Obj, ply files etc
Im currently working at a startup that focuses on drone detection using RF sensors and radars, but my boss recently asked me to explore using computer vision to identify and track drones with our camera system. We’re using a solid camera setup(a 1/2" sensor with 31x optical zoom) so we’ve got some good hardware to work with.
ive done a few ML projects in school and played around with YOLOv8 before, but Im trying to figure out if it’s really the best fit for this. Is YOLO good enough for this kind of task, or should I try architecting a custom model specifically tuned for tracking drones in the sky? My priority is to make it fast enough to keep up with the drones while being accurate enough to identify them reliably. will be using edge computing too probably a jetson or something
any advice??
Goal: to digitalize a few notebooks and family history letters (some English, some Portuguese) using HTR run locally (I want to eventually package this into an open-source app with some other features).
Resources: no GPU, 16GB of RAM on an Intel MacBook Pro. Extra CPU can be provided by my home cluster.
I've tried: Microsoft's handwritten OCRs on hugging face (was most accurate that I used, but only good for English), TrOCR, EasyOCR, and uploading to ChatGPT4.o. ChatGPT has been the most successful with an insane accuracy—just uploaded the entire page. I tried downloading a local version of LLAVA to imitate the same idea as ChatGPT but it wasn't successful at all. I've not tried using separate segmenting from transcription, mostly resources that do both or using tiny data samples (a photo cropped to one line).
I really want something that I can do from the command line.
I kind of feel a bit overhwelmed by how many tools and how different they are.
Here's some of the attempts:
Original text:
"A fim de ser um lider que estimula crescimento e" (note that "estimula" is spelled incorrectly as "estumula")
EasyOCR was a bust (see result on last line)
Microsoft handwritten OCR was impressive:
Hello Everyone,
I need to perform a point cloud segmentation, I have many scans of a rock surface and I want to segment into 3 category. I have lots of different scans in different locations labelled every point into the 3 category so I can perform a supervised deep learning. I have some experience with machine learning in TensorFlow but I am new to computer vision.
My two questions are:
do I need a part segmentation or semantic segmentation? the difference really confuses me sorry!
I have spent a lot of time going through the examples in pointnet and pointnet++ however there are many somewhat newer projects like DGCNN, and POINTCNN. My question is should I be using these newer packages, or the older more established ones. Also many of these packages are only available in TF1 but there are some open pull requests for tf2 available in some cases. If anyone has a lot of experience could you advise me on the best place to start. I have XYZRGB data labelled.
Dear researchers (myself included), Please stop acting like we are releasing a software package. I've been working with RT-DETR for my thesis and it took me a WHOLE FKING DAY only to figure out what is going on the code. Why do some of us think that we are releasing a super complicated stand alone package? I see this all the time, we take a super simple task of inference or training, and make it super duper complicated by using decorators, creating multiple unnecessary classes, putting every single hyper parameter in yaml files. The author of RT-DETR has created over 20 source files, for something that could have be done in less than 5. The same goes for ultralytics or many other repo's. Please stop this. You are violating the simplest cause of research. This makes it very difficult for others take your work and improve it. We use python for development because of its simplicityyyyyyyyyy. Please understand that there is no need for 25 differente function call just to load a model. And don't even get me started with the rediculus trend of state dicts, damn they are stupid. Please please for God's sake stop this non-sense.
Hello everyone,
I was reading the nnU-Net paper: https://www.nature.com/articles/s41592-020-01008-z or arxiv version https://arxiv.org/abs/1809.10486 and I was wondering if I can find the pre-trained version of their model? Specifically I'm looking for the nnU-Net that they themselves trained on their dataset, since I am looking to work with the same Medical Decathlon dataset, conducting Knowledge Distillation specifically.
I found their github https://github.com/MIC-DKFZ/nnUNet They provide details on how to train and do inference etc. generally. I was originally going to train an nnUNet on the existing Medical Decathlon dataset but that would be doing work that already has been done. I was wondering if anyone knows how do I find the trained model instance that they worked with? I thought about emailing them, but idk how acceptable is that of a request in the CV community.
nnUNetV1 would also be fine.
Would be grateful for any advice.
Hi, I’m looking for papers that goes into more depth of these components mentioned in my professors lecture. The closest thing I have found is data association discussed in ORB-SLAM3 paper, but it does not mention loop closure in local map or global map.
My professor said this is discussed in many papers, but I have so far not found one.
Train S3D Video Classification Model using PyTorch
https://debuggercafe.com/train-s3d-video-classification-model/
PyTorch (Torchvision) provides a host of pretrained video classification models. Training and fine-tuning these models can prove to be an invaluable asset in building many real-life applications. However, preparing the right code to start with custom video classification training can be difficult. In this article, we will train the S3D video classification model from PyTorch. Along the way, we will discuss the pitfalls, caveats, and optimization techniques specific to the model.
Hello, everyone! I'm currently working on a YOLO vision project using the pose model, and I'm having some issues estimating distance using only a camera. I know it might sound a bit arbitrary, but this is the solution we have for now while I wait for a LiDAR sensor I ordered last week. Since I'm in a Latin American country, it may take a month to arrive.
Right now, we're estimating distance by using the focal length with the person facing the camera, and it seems to be working well, with an error margin of around 20-25 cm. Here’s the code we're using:
```
float YoloV8::estimateDistance(const cv::Rect_<float>& bbox, const std::vector<float>& keypoints) {
// If we have valid keypoints, use the shoulder distance
if (!keypoints.empty() && keypoints.size() >= 21) { // Ensure we have enough keypoints
// Get shoulder coordinates (5 and 6 in COCO format)
float shoulder1X = keypoints[12]; // Right shoulder X (5 * 3)
float shoulder2X = keypoints[15]; // Left shoulder X (6 * 3)
float shoulder1Conf = keypoints[14]; // Right shoulder confidence
float shoulder2Conf = keypoints[17]; // Left shoulder confidence
// If both shoulders are detected with sufficient confidence
if (shoulder1Conf > KPS_THRESHOLD && shoulder2Conf > KPS_THRESHOLD) {
float shoulderWidth = std::abs(shoulder2X - shoulder1X);
if (shoulderWidth > 0) {
return (AVERAGE_SHOULDER_WIDTH * CAMERA_FOCAL_LENGTH_SHOULDERS) / shoulderWidth;
}
}
}
// Fallback to the original method if we cannot use shoulders
return (AVERAGE_PERSON_WIDTH * CAMERA_FOCAL_LENGTH) / bbox.width;
}
```
The issue I’m currently facing is with profile views; the distance calculation becomes inaccurate, returning values that don't make sense.
Background: Worked in the research labs of McGill University and IISC Bangalore in the fields of CV, ML, Robotics and IoT
Tech stacks: PyTorch, OpenCV, Mediapipe, ROS, puredata, C++
Currently looking for contract based projects, if you a professional looking to delegate your work, or a college student looking to get their final year project done at an industrial level, feel free to contact me for my portfolio/profile.
I am really trying to find my target market, and it would really help me out if some of you took this survey for me. We will be releasing more information about it in the future. I think you all will love it, developers and hobbyists alike. I am trying to figure out who my target market is, and it would be extremely helpful if some of you could fill out this survey for me. https://forms.gle/6KzCHZskboepSpWQ6
The paper aims to shorten acquisition time, reduce costs, and accelerate the deployment of imaging devices.
https://openreview.net/pdf?id=MloaGA6WwX
Contributions:
We expect further applications to similar datatypes e.g. data efficiency on multi-channel images, other hyperspectral/multispectral application, cell microscopy, weather and climate data et.c
Code is available, PM me if interested.
Hi all. I am an MBA student at Temple university and we are doing our final project looking at Machine and computer vision. I would be grateful if you would be able to fill out this survey and if possible send to anyone else that works in manufacturing. We are looking for opinions from those that currently and do not currently use vision systems. Here is the link to the survey: https://fox.az1.qualtrics.com/jfe/form/SV_0cEBnNUQ9jnxZpI
Alternatively if you would like to do a short interview on your experiences, this would also be much appreciated.
Thanks so much!
Hi all:
I am reading a paper and need to thoroughly understand it. This is the paper: https://ieeexplore.ieee.org/abstract/document/6983606
I can pay. If anyone here is well versed in this and can read through and thoroughly understand/help me implement this, please DM me. Thanks!
I’m looking for a model like gdino, where there is a sort of open-vocabulary/zero-shot support, but also one that is preferably faster (and maybe smaller/less resource intensive). I looked into yolo-world but it didnt support the open-vocab part quite like I wanted to (e.g. instead of detecting all apples in a scene, I would want to detect “apple on table” which gdino is much better at compared to yolo world from what I’ve tested).
Or should I just maybe fine-tune yolo world to do what I want it to do?
Title
Tried out omniparser and it's pretty decent, but it misses some stuff. Also, I'd like something that can recognize boxes / layouts instead of just icons / text
We are making a robot for you: https://forms.gle/ggVetcDios9m15yV8you:
Hi!
I recently started my adventure with computer vision. I wrote some code that was supposed to use YOLO algorythm working with GPU (I am using nvidia cuda), encountered whole lot of errors trying to open video files with it and I'm still having some problems with it - it seems to have problems with reading the frames. The code is down below. I spent couple hours with chat gpt and scrolling through internet in search of help but nothing worked:(, also checked the directory, the video resolution, it seems to be fine. Do you have any idea how to repair it? I will be grateful for any kind of help!
P.S. ffmpeg seems to have no problems with localizing and opening the file through command prompt
import random
import threading
import cv2 as cv
import numpy as np
from ultralytics import YOLO
import torch
import time
import subprocess as sp
import os
cv.setNumThreads(1)
device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")
# Load the class list
with open("utils/coco.txt", "r") as my_file:
class_list = my_file.read().strip().split("\n")
detection_colors = [(random.randint(0, 255), random.randint(0, 255), random.randint(0, 255)) for _ in range(len(class_list))]
model = YOLO("weights/yolov8n.pt", "v8").to(device)
def read_video_ffmpeg(path, frame_wid=640, frame_hyt=480):
command = ['ffmpeg', '-loglevel', 'error', '-i', path, '-f', 'image2pipe', '-pix_fmt', 'bgr24', '-vcodec', 'rawvideo', '-']
print("Running FFmpeg command:", " ".join(command)) # Debug print
pipe = sp.Popen(command, stdout=sp.PIPE, stderr=sp.PIPE, bufsize=10**8)
while True:
raw_image = pipe.stdout.read(frame_wid * frame_hyt * 3)
print(f"Raw image length: {len(raw_image)}")
# Check if FFmpeg gave any error
err = pipe.stderr.read().decode()
if err:
print("FFmpeg error:", err)
break
if not raw_image:
print("End of video stream or no data received.")
break
try:
frame = np.frombuffer(raw_image, dtype='uint8').reshape((frame_hyt, frame_wid, 3))
yield frame
except ValueError as e:
print(f"Error reshaping frame: {e}")
break
pipe.stdout.close()
pipe.stderr.close()
pipe.terminate()
class Video:
def __init__(self, src="D:/OPENCV/videos/DJI_0302.MP4"):
self.src = src
# Check if the video file exists
if not os.path.isfile(self.src):
print(f"Error: Video file does not exist at path: {self.src}")
return # Stop initializing if file does not exist
self.frame_wid = 2720 # Update frame width
self.frame_hyt = 1536 # Update frame height
self.frame_gen = read_video_ffmpeg(self.src, self.frame_wid, self.frame_hyt)
self.frame = None
self.running = True
threading.Thread(target=self.update, daemon=True).start()
def update(self):
while self.running:
try:
self.frame = next(self.frame_gen)
print("Frame read successfully")
except StopIteration:
self.running = False
except Exception as e:
print(f"Error updating frame: {e}")
self.running = False
def read(self):
return self.frame
def stop(self):
self.running = False
video_stream = Video(src="D:/OPENCV/videos/DJI_0302.MP4")
fps_limit = 10
while video_stream.running:
start_time = time.time()
frame = video_stream.read()
if frame is None:
print("No frame to display")
break
detect_params = model.predict(source=[frame], conf=0.25, save=False)
if detect_params:
boxes = detect_params[0].boxes
for box in boxes:
clsID = int(box.cls.cpu().numpy()[0])
conf = box.conf.cpu().numpy()[0]
bb = box.xyxy.cpu().numpy()[0]
cv.rectangle(
frame,
(int(bb[0]), int(bb[1])),
(int(bb[2]), int(bb[3])),
detection_colors[clsID],
5,
)
cv.putText(
frame,
f"{class_list[clsID]} {round(conf * 100, 2)}%",
(int(bb[0]), int(bb[1]) - 10),
cv.FONT_HERSHEY_COMPLEX,
1,
(255, 255, 255),
2,
)
cv.imshow("Object Detection", frame)
elapsed_time = time.time() - start_time
frame_delay = max(1, int((1 / fps_limit - elapsed_time) * 1000))
if cv.waitKey(frame_delay) == ord("q"):
break
video_stream.stop()
cv.destroyAllWindows()
Does anyone have a recommendation for a theory-based, up-to-date book on Computer Vision based on deep learning techniques? My main topic of interest is object detection.
Hi there :)
I got something cool to share with you, over the past few months i have been running around trying to find a way to make a dream come true
Im creating a online hub for people in ai that care about technological innovation and having a positive impact by building and contributing on projects
This is hub will be a place to find like minded people to connect with and work on passion projects with.
Currently we are coding a platform so that everyone can find each other and get to know each other
After we got some initial users we will start with short builder programs where individuals and teams can compete in a online competition where the projects that stand out the most can earn some prize :)
Our goal is to make the world a better place by helping others to do the same
If you like our initiative, please sign up below on our website !
https://www.yournewway-ai.com/
And in some weeks, once we're ready we will send you a invite to join our platform :)
Hi, given 2 camera2world matrices, I am trying to compute the rotation degree of camera from first image to second image, for this purpose I calculated the relative transformation between the matrices(multiplying second matrix by the inverse of the first), and took the sub matrix(:3,:3 of the 4*4 relative transform matrix), I have the ground truth rotation value but for some reason they do not match the Euler degrees I compute using scipy's rotation package, any clue what I am doing wrong mathmatically?
*the values of cam2world are the output obtained from Dust3r if that makes a difference
hello guys, my first post on here and I just want to say I freaking hate my amd gpu (running on windows) so damn much, I have been trying for 6 weeks now to train a simple face detection model using a public dataset, but my amd gpu refuses to elaborate! I wish I knew how bad amd was when it comes to machine learning and computer vision before I bought it 😔😔 I can’t even download linux due to other reasons, I also tried directML but that failed miserably for some reason, not really looking for help but if anyone is considering buying a build for computer vision (which I was not when I got mine) please avoid amd at all costs.
You can check it out here: https://www.coursera.org/learn/hands-on-data-centric-visual-ai
I want to measure fps to benchmark different versions of YOLO and I do this by running inference 5 times on a video and then averaging fps for each frame. To be sure that this task is not interrupted by the scheduler, I put sudo nice -n -20
before yolo predict
and I check processes with jtop
(and ofc power mode is fixed). However, under these conditions I sometimes get big differences for the same model (i.e. 50<->75 fps).
Do you know which is the reason? Temperature? Or is there a more robust way to achieve my goal?
Hello CV,
I'm currently in the process of training YOLO to identify which industrial complexes does NOT have solar panels on their roof. I want it feed it training data of google maps satellite images, but I'm unsure how to go about this.
The questions that I have:
- How do I determine the correct size (pixel) for my training data?
- Is there any available API that can help me make the process easier?
- Is there a way to use the globe/3d view to help identify the model identify if the roof is flat or slanted?
Thank you, hope someone can help me