/r/computervision
Computer Vision is the scientific subfield of AI concerned with developing algorithms to extract meaningful information from raw images, videos, and sensor data. This community is home to the academics and engineers both advancing and applying this interdisciplinary field, with backgrounds in computer science, machine learning, robotics, mathematics, and more.
We welcome everyone from published researchers to beginners!
Content which benefits the community (news, technical articles, and discussions) is valued over content which benefits only the individual (technical questions, help buying/selling, rants, etc.).
If you want an answer to a query, please post a legible, complete question that includes details so we can help you in a proper manner!
Related Subreddits
/r/computervision
Hey everyone,
I'm just getting started with Basler cameras for a computer vision project, and I'm pretty new to image acquisition. There are a lot of concepts I need to learn to properly set up the camera and environment for optimal results—like shutter speed, which I only recently discovered.
Does anyone know of any good courses or structured learning materials that cover image acquisition settings and techniques?
I'm calibrating my camera with a (9×9) chess board(square), but I have noticed that many articles use a rectangular shape(9×6)(rectangular), does the shape matter for the quality of calibration?
Given the following image
when using harris corner (from scikit-image) it mostly got the result but missing the two center points. maybe because the angle is too wide and doesn't consider to be a corner
The question is can it be done with corner approach? or should I detect lines instead (have try using sample code but not get good yet.
Edit additional info: the small line section outside is for known length reference so I can later calculate the area of the polygon.
Hi, are there any algorithms you would recommend for placing wireframes on a person from a bird-eye view? The algorithms I’ve tried so far don’t seem that robust.
Given a batch of images with different sizes and ratios, make them in batch. But
- ratio keep;
- no pad;
Anyone knows anyway to do this?
(Or how does qwen2vl able to do this?)
Hello everyone.
For context, I am currently working on a project about evaluating SFM methods in various ways and one of them is to produce something new to me called novel view synthesis.
I am exploring NeRF and Gaussian Splatting but I am not sure which is the best approach in the context of novel view synthesis evaluation.
Does anyone have any advice or experience in this area ?
I'm building a tool where you upload a bunch of video games, and gpt4 extracts the title of each game from the image. Then it gets price data for each game.
I'm running into a problem and need some help. When the image contains too many games, gpt starts to perform poorly. I've found that when I manually crop those same images and send in just one game at a time, it's perfect.
How can I do pre-processing so that it will crop or segment each game and increase the accuracy? Is there a good service for this?
Btw, here is the tool so you can see how it works:
https://frontend-production-bca1.up.railway.app/
Hi everyone. May I ask if it possible to accelerate stablevideo diffusion single video generalization speed with multiple GPUs. I have been reading papers and trying to figure out this problems for a few days. It seems the video generalization process follow a strong sequence in both denoising process and video generate sequence. Making it impossible to acclerate like using different gpus to generate different frames.
It seems the only possiblity if to acclearte the denoising process through something like tensor parallel, this also seems hard since the U map are not regular attention block (MLP+mutihead attention).
Does anyone have some related experience? Any suggestion helps. Thank you!
Hi All,
I've been following this channel for months, and have loved seeing the amazing work happening here. As someone deeply involved in synthetic data generation, I want to give back to this awesome community.
I work for a company that specialize in creating synthetic datasets, and I'm reaching out to understand exactly what you need. Our recent Pose Estimation dataset was to help the community, and now we want to tackle the datasets that will truly move your projects forward.
Some areas we're particularly interested in exploring:
Your input is crucial. What datasets would make your CV work easier, faster, or more precise? What specific challenges are you facing in data collection?
https://huggingface.co/posts/DualityAI-RebekahBogdanoff/175052732651947 - This is the post we shared on HF to get more information.
For the comments that get traction I will update and share the datasets on HF and our site. Drop in your requests and I will love to help!
I see a lot of repositories 5-6 years ago, such as flownet+semantic segmentation.
Does anyone know of any recent models that are temporal-consistent and open source for use? Thank you!
I've been following the rapid progress of LLM with a mix of excitement and, honestly, a little bit of unease. It feels like the entire AI world is buzzing about them, and rightfully so – their capabilities are mind-blowing. But I can't shake the feeling that this focus has inadvertently cast a shadow on the field of Computer Vision. Don't get me wrong, I'm not saying CV is dead or dying. Far from it. But it feels like the pace of groundbreaking advancements has slowed down considerably compared to the explosion of progress we're seeing in NLP and LLMs. Are we in a bit of a lull? I'm seeing so much hype around LLMs being able to "see" and "understand" images through multimodal models. While impressive, it almost feels like CV is now just a supporting player in the LLM show, rather than the star of its own. Is anyone else feeling this way? I'm genuinely curious to hear the community's thoughts on this. Am I just being pessimistic? Are there exciting CV developments happening that I'm missing? How are you feeling about the current state of Computer Vision? Let's discuss! I'm hoping to spark a productive conversation.
HI! Can anyone help me out in finding some weights trained to localize and classify blood cells for an RT-DETR based detection algorithm?
Cv noob here,i may have to take a course in cv next and i was wondering is cv the same (when working with it) with graphical representations (like in games, animations, rotation, translation where you work with matrices etc) I didn’t really enjoy working with games and graphics so if its too much like it then cv is not for me.
Hey everyone,
I'm working on an image classification and object detection model, but I’m running into issues with image reflections and dirty camera artifacts (e.g., sand, dust, smudges). These distortions are causing a lot of false positives and impacting model performance.
Im trying to add new data augmentation techniques in order to simulate these distortions but the results are still not good.
Has anyone dealt with similar problems before? Do you know any other technique that can help me in this situation?
I am new to ml and I making a project for vehicle detection using drone videos as input at about height 200meters so i am thinking about models i should train for this application. And processing is done after the flight. So i am currently thinking to train yolon8x on visdrone data and later train it on custom data after collecting. final output is going to be entire trajectory of the vehicle in that video.
can someone help me out like is this a current direction. or I need to train some different model. Accuracy is a priority. give some general advice on how u would approach this or things i need to watchout for .
I have this image containing overlapping objects. I want to find out the mask of each object.
What I tried -
- SAM doesn't segment properly when given the image. It segments properly when some points covering each part of the object is given as input along with the image.
- Trained yolo and detectron models on my data. Yolo doesn't even detect each object properly. Detectron detects and gives bounding box better than yolo (but not best) but fails in segmentation. I have a dataset of 100 images which i augmented to thousands of images and trained the models.
- I could take the segmentation points from detectron and give it to sam as input with image. But detectron doesn't segment that properly to cover each part of overlapping object so that sam can perform well.
Help me approach this problem. Any suggestions or links to research papers related to this are appreciated.
Ros 2 Packages to Raspberry Pi? I don't get how it works. I have a project building a search and rescuse robot using Oak D Pro 9782, and we're going to use Linux. Any suggestions?
BTW any advice on how to categorize data types for a stereo depth camera? I'm a volunteer for a Senior Design Project and I don't understand what the Professor is saying. Any assistance is all appreciated, thank you!
DINOv2 for Semantic Segmentation
https://debuggercafe.com/dinov2-for-semantic-segmentation/
Training semantic segmentation models are often time-consuming and compute-intensive. However, with the powerful self-supervised DINOv2 backbones, we can drastically reduce the training compute and time. Using DINOv2, we can just add a semantic segmentation head on top of the pretrained backbone and train a few thousand parameters for good performance. This is exactly what we are going to cover in this article. We will modify the DINOv2 backbone, add a simple pixel classifier on top of it, and train DINOv2 for semantic segmentation.
Hi,
I’ve been thinking a lot about this lately and wanted to get some insights on the current state of computer vision and image processing in medical imaging. I was recently offered an internship in cardiac video segmentation, but I’m wondering if this field still has a strong future given the rapid advancements in AI
What are your favorite computer vision papers?
Gotta travel a bit and need something nice to read.
Can be any paper also just nice and fun to read ones.
I want to start learning about vision transformers. What previous knowledge do you recommend to have before I start learning about them?
I have worked with and understand CNNs, and I am currently learning about text transformers. What else do you think I would need to understand vision transformers?
Thanks for the help!
Is it possible to use a Siamese Neural Network for finding all instances of a given template in an image? Similar to the Template Matching with Multiple Objects at the end of https://docs.opencv.org/4.x/d4/dc6/tutorial_py_template_matching.html but also output a score of how similar the object is to the template?
My thought is that SNNs are more robust to changes in object appearance / size / lighting than classical template matching. The actual objects I'm looking for are arbitrary items in a video feed. i.e. a person draws a bounding box around an object (one that may not be part of the training set of a normal object detector/classifier like yolo) in one of the first frames and then the tracker searches for that item in subsequent frames. I know there are trackers like DaSiamRPN which works pretty good as long as the object stays in the frame and is maybe only briefly occluded (I tested the implementation in OpenCV), but I want to account for the object completely leaving the frame possibly for hundreds of frames.
I played around with DeepSORT and it supposedly uses "Re-ID" but it seems like they mean re-id to mean performing association via visual similarity between an object that of the track it's maintaining as opposed to long term re-id.
Hi all, I have been getting only rejections from all the relevant CV/perception roles that I have been applying. Some require PhDs or papers in top conf. It seems like my resume might not be up to the mark.
So I would request a honest roast or review of the resume, and if you have any suggestions on improving the profile.
Thank you for your time. ANY SUGGESTION IS GREATLY APPRECIATED!
I am looking for a computer vision AI engineer to join me as a co-founder on a project that I am calling Augmented Endodontics.
The overarching goal is to take stereo image data of surgical microscope videos from root canal procedures, re-project the scene depth and use depth to register 3D CBCT model overlays. (see video)
I am a DDS/Phd and Endodontist and have been working on this project for many years now. If you are interested discussing the project more please email me at jsimon.endo@gmail.com
Hello! I’m the founder of a YC backed company, and we’re trying to make it very easy and very cheap to train ML models. Right now we’re running a free beta and would love some of your feedback.
If it sounds interesting feel free to check us out here: https://github.com/tensorpool/tensorpool
TLDR; free GPUs😂
Im basically looking to hear what worked for people with similar limitations, I can generate synthetic data of the task, but annotating the real data (a regression task which require many sensors) is an exuberantly expensive task, and might be even impractical due to the conditions of the setting.
I was thinking about using adversary training as part of the architecture, encoder with two heads, one for the target task and one to classify the domain of the image (synthetic vs target domain) where we try to maximize the loss for the latter, with the goal for the encoder to extract minimal non invariant features that are used to calculate the target.
But this feels outdated and maybe finicky, so I wondered if you guys could share from your experience.
Hi All, Thanks a lot for this community, it has helped me a lot in coursework and work as well. I am currently working on obstacle avoidance and using yolov9 for detection and was trying to figure out which edge device to use so it can run models like YOLO and Depth Anything parallelly. I want to achieve a "good" FPS(considering maritime scenarios), so lower fps of 5 will do the job as well. I have looked for options but am unsure of the real-time performance with both the models. Any help relating to these aspects would be highly appreciated. Thanks a lot. I have some amount of funding for this and can go up to $1500 for the edge device(GPU).
What is, in your experience, the best alternative to YOLOv8. Building a commercial project and need it to be under a free use license, not AGPL. Looking for ease of use, training, accuracy.
EDIT: It’s for general object detection, needs to be trainable on a custom dataset.