/r/computervision
Computer Vision is the scientific subfield of AI concerned with developing algorithms to extract meaningful information from raw images, videos, and sensor data. This community is home to the academics and engineers both advancing and applying this interdisciplinary field, with backgrounds in computer science, machine learning, robotics, mathematics, and more.
We welcome everyone from published researchers to beginners!
Content which benefits the community (news, technical articles, and discussions) is valued over content which benefits only the individual (technical questions, help buying/selling, rants, etc.).
If you want an answer to a query, please post a legible, complete question that includes details so we can help you in a proper manner!
Related Subreddits
/r/computervision
I'm interested in understanding the math behind image processing scripts and modules used in Pixinsight and was wondering if there are any resources that can help me get started. Let's say one of my goal would be to write scripts for PixInsight. Currently I'm going through the book Digital Image Processing: Gonzalez, Rafael, Woods, Richard, but I also wanted to know if there are any books that's more catered towards astrophotography image processing.
Thanks!
I have data that is screenshots of tables holding images in these tables and so i want to extract the images from the table and the data relevant to the row that hold the image keep in mind the table image format is png.
I'm working on a school project and the aim is to essentially extract a big circular object that rotates (like the New Years ball). then to calculate a bunch of other things after creating a binary mask of it. I've done a decent implementation by just applying a Gaussian blur and then using a Hough circle transform to extract the object. However, it's not perfect and often either cuts a part of the object out or includes the background (which is of a similar color to the obiect in some cases since the object is rotating). Was wondering what other methods would be beneficial for this task. Thanks in advance 🙏
I’m working on a school project and the aim is to essentially extract a big circular object that rotates(like the New Years ball). then to calculate a bunch of other things after creating a binary mask of it. I’ve done a decent implementation by just applying a Gaussian blur and then using a Hough circle transform to extract the object. However, it’s not perfect and often either cuts a part of the object out or includes the background(which is of a similar color to the object in some cases since the object is rotating). Was wondering what other methods would be beneficial for this task.
Thanks in advance 🙏
i have multiple detection and classification models running on opencv dnn backend (onnx), but cannot run them parallely.
suggest a way to run the models parallely, and be available to run on both gpu and cpu.
Currently working on an AI model that will be designed to detect potholes per 50m - 100m strips of a concrete road. Stereo Imaging was the first choice for data collection and we were informed that Stereo Cameras don't do well outdoors especially in sunny outdoors and StereoPi was suggested in order to collect stereoscopic images/footage. Been searching around about StereoPI but did not find definite answers to my questions. Is it okay for outdoor data collection? Able to do depth measurement? Will it work/perform well when mounted on a drone/moving vehicle?
Hello all, sahi, yolov11 full model and input resolution of 800 pixel and below will cause a lot of latency, but if you did a similar project was it too slow for real time applications? Whatever your setup was I am happy to learn from them all. The papers I read chose a pixel count at 1000 above, detected by RTX 3080 and 3090, with video transmission done through 4g and a livestream API.
Hello I have recently started learning on Computer Vision as part of my course but I feel as though the materials provided aren't as detailed and every lecture slide has Deep Learning for Vision Systems by Mohamed Elgendy as its reference, is there a way I can get my hands on the pdf of this book on the internet
I am starting a new challenging project in object detection that is going to analyze basketball games. It should detect both teams players (potentially even player name based on jersey number), referees (potentially referee gestures and based of that could track fouls, violations and other staff) and ball (tracking ball and if goes through hoop it counts as a point). Also I am going to map court lines and based of that I am going to guess whether it's 2 or 3 point shot.
I have ideas for a lot of more implementations, but I want to first make it work and later upgrade it with more of functionalities.
Q1: What workflow are you advising me to follow?
Q2: What technologies/frameworks/models are you recommending for this project?
Q3: What are some obstacles (or non trivial implementations) that you think I am going to face as beginner/intermediate CV developer and ways to overcome them?
I am extracting a vanishing point and need to calculate a lot of edge directions per each frame of the video. Only directions are important, they are later sent to the accumulator, full edges are not needed. Currently I am using sobel to calculate gradients, select the strongest ones, which are then turned into perpendicular edges directions. However, the method is not accurate for almost horizontal/vertical edges, as it turns them into plain horizontal/vertical even with 9x9 kernel.
I know there is a way to calculate edgelets instead by doing SVD in the neighborhood of each strong edge. But this does not seem vectorizable and thus too slow, as I have 100s of strong edges per frame. Are there any other methods to extract more accurate edge/gradient directions than sobel?
Hi everyone,
I'm new to Vision Transformers (ViTs). Previously, I used UNet for medical image segmentation. Now, I'm trying to create a hybrid model where SegFormer will be used as the encoder for UNet, and the default UNet decoder will remain. However, I'm unsure how to import SegFormer, use it, and create patches for it. I've already preprocessed my images and masks.
Also, I'm not sure how to use SegFormer as an encoder. If needed, I can share my UNet code. Currently, I want to use this hybrid network for image segmentation, but later on, I might also want to classify the images using a pre-trained classification model. In that case, how can I integrate the segmentation and classification models so that both tasks are performed?
I've tried searching on GitHub and YouTube but couldn't find any resources. Additionally, I'm using TensorFlow.
Any help would be greatly appreciated! (DM's are open)
Here is my code for preprocessing the images and masks:
#Preproceessing image and masks
I don't know how to import the Segformer and then use it, create patches for it
def preprocess_images_and_masks_batch(image_paths, mask_paths, batch_size=32, target_size=(256, 256)):
"""
Generator to yield batches of images and masks, reducing memory usage.
"""
num_images = len(image_paths)
while True:
# Infinite loop to allow multiple epochs
for start in range(0, num_images, batch_size):
end = min(start + batch_size, num_images)
batch_images = []
batch_masks = []
for img_path, mask_path in zip(image_paths[start:end], mask_paths[start:end]):
# Load and resize the image only if necessary
img = load_img(img_path)
if img.size != target_size:
img = img.resize(target_size)
# Resize image if it's not already 256x256
img = img_to_array(img) / 255.0
# Normalize image to [0, 1]
batch_images.append(img)
# Load and resize the mask only if necessary
mask = load_img(mask_path, color_mode="grayscale")
if mask.size != target_size:
mask = mask.resize(target_size)
# Resize mask if it's not already 256x256
mask = img_to_array(mask) / 255.0
# Normalize mask to [0, 1]
mask = np.clip(mask, 0, 1)
# Ensure mask values are clipped to [0, 1]
batch_masks.append(mask)
yield np.array(batch_images), np.array(batch_masks)
# Free up memory after processing a batch
gc.collect()
def load_split_data(csv_path, batch_size=32, target_size=(256, 256)):
"""
Load and split the dataset based on the 'split' column in the CSV file and return data in batches.
"""
# Read CSV file
data = pd.read_csv(csv_path)
# Filter by split
train_data = data[data['split'] == 'train']
valid_data = data[data['split'] == 'valid']
test_data = data[data['split'] == 'test']
# Get image and mask paths
X_train_paths = train_data['image_path'].tolist()
Y_train_paths = train_data['mask_path'].tolist()
X_val_paths = valid_data['image_path'].tolist()
Y_val_paths = valid_data['mask_path'].tolist()
X_test_paths = test_data['image_path'].tolist()
Y_test_paths = test_data['mask_path'].tolist()
# Generate batches for each split
train_generator = preprocess_images_and_masks_batch(X_train_paths, Y_train_paths, batch_size, target_size)
valid_generator = preprocess_images_and_masks_batch(X_val_paths, Y_val_paths, batch_size, target_size)
test_generator = preprocess_images_and_masks_batch(X_test_paths, Y_test_paths, batch_size, target_size)
# Calculate steps per epoch
train_steps = math.ceil(len(X_train_paths) / batch_size)
valid_steps = math.ceil(len(X_val_paths) / batch_size)
test_steps = math.ceil(len(X_test_paths) / batch_size)
print(f"train steps per epoch: {train_steps}")
print(f"valid steps per epoch: {valid_steps}")
print(f"test steps per epoch: {test_steps}")
return train_generator, valid_generator, test_generator, train_steps, valid_steps, test_steps
# Path to your CSV file
csv_path = "/kaggle/input/aug-balance-unet-csv/isic2016_balanced_groundtruth0002 (1).csv"
# Load the data splits in batches
train_batches, valid_batches, test_batches, train_steps, valid_steps, test_steps = load_split_data(csv_path, batch_size=32, target_size=(256, 256))
I didn't find classes online yet, do you have books/articles/youtube videos to recommend? Thanks!
I a trying to submit a paper. And I think the ones with recent deadline are CVPR workshop and ICCP. Is there other options and how hard is CVPR workshop?
I'm currently deciding research topic in computer vision and deep learning for AV and I want to know more topics that might be useful for the field. So far all I can think of are
and I need help exploring more. Also, any advice for doing research on this topic or working in the field will be appreciated.
p.s. I have interests in all areas (more specifically vision for autonomous systems, not just cars) and I need to pick one for graduate studies and research application. Any advice on how to pick a research topic and advisor helps as well
hello, I would be curious to know what you think will be the major future directions of computer vision, those that will gain momentum within 5 to 10 years
I've open-sourced a product I was developed/selling for a while (Roadometry VTC):
https://github.com/asfarley/vtc_lfs
https://www.youtube.com/@roadometry2011
This is a desktop Windows application capable of counting traffic in video footage. The application uses Darknet Yolo and a network trained used my own training set.
The application generates count reports and a SQLite database.
Users can modify the types of objects detected by providing their own Darknet files.
The application uses Multiple Hypothesis Tracking combined with a trained network for detection. I started working on a tool-chain for training a network to perform combined detection and association, but I don't have the time to finish it now. I think combined detection/association is probably the next step for a big performance increase.
So I'm a graphic designer who lately has been thinking about the possibility of digitizing newspapers or other publications.
Once typefaces are identified, it could increase the accuracy of the algorithm.
For example, most used one on the titles is this one https://fontsinuse.com/typefaces/31793/inserat-grotesk-schmal-aurora-grotesk-viii and I'm not sure about the text one, but I know this publication used the ubiquitus TImes New Roman years after but before the full digital era.
Mi idea is having two products, one would be a plain text version for lookup, like digital version of the newspaper. Second would be a PDF to read more confortably without the analog problems of wonky lines. I'm not sure how it would be for CV and some AI like llava to get an ideal version of the layout, ie. 7 equally sized columns (whole picture idea), sometimes titles extend two of them, etc...
I think at least getting the texts without hyphens when line breaks, and double spaces, understanding of text flow, etc... would be a great deal in order for this not to become a gargantuan task.
Hello!
I am doing object tracking with PyTorch CNN models for a project. I want my model to output a variable number of classifications and bounding box results per frame, but I have only ever worked with fixed output sizes, meaning my model can only currently detect a fixed number of objects per frame.
Model input: images
Desired model output: variable list of detected objects with (classification, bounding box info)
Does anyone know how variable output would be incorporated?
Hey everyone,
I'm curious about how you all handle image compression in your computer vision projects?
What are your go-to techniques for compressing images without sacrificing too much detail? Do you have any tips or tricks to share?
I'm particularly interested in how compression affects the performance of different computer vision models. Has anyone experimented with different compression algorithms and their impact on accuracy?
I developed a model that can perform detections on video streams, I want to deploy this in practical scenarios on cctv , let’s say I take the stream input as RTSP and create a pipeline , I want a rough estimate on how much this will cost me for deploying this model on AWS , just to give a broader perspective , the end result will look like let’s say a website where you can put in your rtsp link and it will start performing detections and give you the detection outputs.How do I calculate these costs ??
I'm hoping someone here has experience with working w/ VR headset APIs. Both Apple and soon Meta have an API to access the front camera of their respective headsets, while other brands (HP Reverb series, Vive) appear to outright have no access. I understand this is largely for privacy reasons, but I am working on a project with important applications in VR (fast optical flow / localization stuff) and I would really benefit from access to the camera streams in any inside-out tracked VR headset.
If I cannot access these streams directly, I think I will attempt to simulate these cameras and their fovs in a CGI environment. This endeavor would benefit from documentation of their relative positions and FOVs (which of course vary from headset to headset).
TL;DR - Know of any dev-friendly VR headsets with an open api for the inside out cameras? Alternatively, any headset with documented inside-out camera intrinsics/relative extrinsics for calibration or simulation?
I am trying to set up my system using deepstream
i have 70 live camera streams and 2 models (action Recognition, tracking) and my system is
a 4090 24gbvram device running on ubunto 22.04.5 LTS,
I don't know where to start from.
I'm wondering what is the best model to deploy for a detection tool I'm working on. I'm current working with the Yolo-NAS-S architecture to minimise latency, but am I right in saying the Quantised version of Yolo-NAS-L should perform just as well from a latency point of view, but with improved accuracy?
I wonder is it able to turn on this camera for 24hours with poe? I'm using a2A640-240gmSWIR
and this camera gets hot so easily. got to 60~69degrees(celsius) too soon
I have 4 fisheye camera that is located each corner of a car. I want to stitch the output of the cameras to create a 360 degrees surround view. So far, i have managed to undistort the fisheye cameras with a loss in FoV. Because of that, my system fails when stitching them since the intersection region of the cameras may not contain enough features to match. Are there any other methods that you might propose? Thanks
I joined my uni's research program for undergrad. Me and my team of 3 people have to collect numerous different videos of objects with different lighting condition etc and run nerf/3dgs on them. As you know, researchers don't maintain the codes after their papers are accepted and thus, I have to run dockers to make them run. (Btw nerfstudio doesn't work at all) Also there are so many bugs so I have to read the codes and fix the bugs myself. Two other team members don't contribute at all. They literally don't do anything so I have to do all this by myself. Now my professor is asking us to read through the code and insert custom camera position to test the novel view synthesis near the ground truth image. This means I would have to understand the abysmal codes of the researchers wrote better than themselves... She is asking me to get the PSNR/LPIPS etc but that part won't be too hard.
I asked my prof if this is going to be published and she told me that this project lacks novelty so it probably won't be published. She told me this project is for her to better understand these models and that's it.
I was originally interested in 3d reconstruction and novel view synthesis but this project is making me hate it. This is just a grunt work with no real novel ideas to try and eating up so much of my time. I recently talked to the professor I really wanted to work with and he told me that he will let me into his lab if I do well in his class next semester and I am worried this project, that I have no passion anymore, will waste too much of my time which would be better spent on doing well on the class....
What do you think? Should I put in 20+ hours/week and the entirety of my winter break for the project that only serves to enhance the practical knowledge of my professor with absolutely no help from teammates?
* I am on a time crunch so I don't have enough time to get a firm knowledge on the foundations of model and just skim through the papers multiple times to understand the code
Currently, I am working on a project to recover license plates with many initial inputs. Can someone suggest me some source code or some related articles?
Thanks everyone
I'm a seasoned software developer but pretty late comer to computer vision field. It's been just a few weeks but I'm truly loving it. Here is an important question I have in mind and couldn't find an "obvious" answer so wanted to check what community thinks in general.
What do you think about the future of inference? Is it possible to expect that inference will mostly move to edges? Or will it be mostly cloud based? For example; for robotics it must have been on the edge somehow, right? But for web use cases probably mostly cloud?
I'm guessing that there wouldn't be a one-size-fits-all answer to this question but I'm hoping to see some informative replies about different aspects to it.
Here is an example sub-question:
I'm thinking of creating a custom trained object-detection/classification model for a certain industry and I'm going to create a mobile app for users to use it. I'm guessing that the common practice would be "use local inference whenever possible", is this correct? But if the implementation is easier/more feasible cloud inference can be used as well? So the sub-question is; what are the main questions you ask when deciding on picking a inference infra for a project?