/r/computervision
Computer Vision is the scientific subfield of AI concerned with developing algorithms to extract meaningful information from raw images, videos, and sensor data. This community is home to the academics and engineers both advancing and applying this interdisciplinary field, with backgrounds in computer science, machine learning, robotics, mathematics, and more.
We welcome everyone from published researchers to beginners!
Content which benefits the community (news, technical articles, and discussions) is valued over content which benefits only the individual (technical questions, help buying/selling, rants, etc.).
If you want an answer to a query, please post a legible, complete question that includes details so we can help you in a proper manner!
Related Subreddits
/r/computervision
Hi everyone! đ
I recently wrote an article "Computer Vision 101" for beginners curious about computer vision. It's a guide that breaks down foundational concepts, practical applications, and key advancements in an easy-to-understand way.
I'd appreciate it if you could review this and share your thoughts on the content and structure or suggestions for improvement. Constructive criticism is welcome!
đ Read "Computer Vision 101" Here
Let me know:
â˘Does the article flow well, or do parts feel disjointed?
⢠Are there any key concepts or topics you think I should include?
⢠Any tips on making it more engaging or beginner-friendly?
Thanks so much for your time and feedbackâit means a lot! đ
Got a NAS (I use a Ugreen DXP6800) for my on-prem solution + self host to manage the datasets & train files for my projects, and it works really well. Here's how it goes:
If youâre dealing with small group storage/ storage issues and want to level up your efficiency, you can defintely try a NAS.
So I recently upgraded from a 3070Ti to a 3090 for the extra VRAM to train my transformer networks. I know that the 3090 has almost 1.5 times more cuda and tensor cores than the 3070Ti, along with maybe higher core and memory clocks.
However, with increased batch sizes, I am not seeing a non-trivial amount of training time reduction after this setup. Thus, I am suspecting other components in my rig that might be causing the issue
Asus X570 TUF Wifi plus
Ryzen 7 3800XT
Corsair vengeance 2x16GB LPX 3000Mhz
650 Watt PSU (I know it should be higher, but would it affect performance?)
The codes are executed on 256 Gb Samsung Sata SSD (probably does not matter)
I see that the RTX3090 is fully utilized in the task manager. The 3D section is fully utilized, but the memory is not since I enable memory growth to prevent pre allocation of the entire 24Gb. The CPU holds steady at around %14 utilization.
Do you guys think that upgrading a specific component in my rig would boost my training speeds, or am I at the point of diminishing returns?
Thanks!
What accuracy do you obtain, without pretraining?
other interesting datasets?...
When I add more parameters, it simply overfits without generalizing on test and val.
I've tried scheduled learning rates and albumentations (data augmentation).
I use a standard vision transformers (the one from the original paper)
https://github.com/lucidrains/vit-pytorch
thanks
EDIT: you can't go beyond that, when training from scratch on CIFAR100
"With CIFAR-100, I was able to get to only 46% accuracy across the 100 classes in the dataset."
https://medium.com/@curttigges/building-the-vision-transformer-from-scratch-d77881edb5ff
https://github.com/s-chh/PyTorch-Scratch-Vision-Transformer-ViT
Hi everyone,
As part of a project my friend and I are working on, I created a PyTorch Video Dataset Loader capable of loading videos directly for model training. While it's naturally slower than pre-extracting video frames, our goal was to create a loader that simplifies the process by skipping the intermediate step of frame extraction from user's end.
To test its performance, I used a dataset of 2-3 second videos at 1920x1080 resolution and 25 fps. On average, the loader took 0.7 seconds per video. Reducing the resolution to 1280x720 and the frame rate to 12 fps improved the loading speed to 0.4 seconds per video. Adjusting these parameters is straightforward, requiring only a few changes during dataset creation.
Hardware Note:Â These benchmarks were measured on my setup.
One interesting potential use case is adapting this loader for live video recognition or classification due to its fast loading speed. However, I havenât explored this possibility yet and would love to hear your thoughts on its feasibility.
Iâm looking for feedback and suggestions to improve this loader. If youâre curious or have ideas, please take a look at the project here: PyTorch Video Dataset Loader
Thanks in advance for your input!
Hi All,
I'm doing a research project and I am looking for a model that can determine and segment an object based on its material ("this part looks like metal" or "this bit looks like glass" instead of "this looks like a dog"). I'm having a hard time getting results from google scholar for this approach. I wanted to check 1) if there is a specific term for the type of inference I am trying to do, 2) if there were any papers anyone could cite that would be a good starting point, and 3) if there were any publicly available datasets for this type of work. I'm sure I'm not the first person to try this but my "googling chops" are failing me here.
Thanks!
I need help regarding implementing the methodology as it is from the research paper as it is. The link to research paper is this.
https://ieeexplore.ieee.org/document/10707662
1ăUtilize YOLOPose for Transfer Learning in FLD
Apply YOLOPose to achieve Facial Landmark Detection (FLD). YOLOPose, which combines object detection with keypoint regression, can be adapted for real-time facial keypoint detection tasks.
2ăFocus on Eye and Mouth Keypoints for Fine-tuning
Extract eye and mouth keypoints from the FLDs.
Use EAR (Eye Aspect Ratio) and MAR (Mouth Aspect Ratio) to determine states such as eye closure and yawning, which can be indicators of drowsiness or fatigue.
The link for the research paper is:Â https://ieeexplore.ieee.org/document/10707662
We have to design a CNN model then train it and fine tune it.
I am at a very crucial stage of my project where I have to complete it withing stipulated time and don't know what to do. Asked ChatGPT and all but no use.
I am pasting the methodology screenshots of the stem, head, bakcbone and bottleneck of the model.
This is the overall framework I have to design for the CNN Model
As title, I want to know how hard or easy is it to get a job(in this job market) in Computer Vision without prior Computer vision work experice and without phd just with academic experince.
I'm trying to extract data from an image which has a simple table. I already was able to detect the table in the image (OpenCV). My question is how should I continue in order to detect the cells and extract the text/number for each one?
Does anyone has an idea or solution?
I'm currently designing and sourcing parts for a robot that picks frisbees up off the ground and moves them to another location. I'll be using it so I can practice throwing by myself, kinda like a rebound machine but for frisbees.
I plan to use SLAM with a front + rear camera as well as an IMU to localize the robot within the field (I believe this combination is usually called VIO). I have a few concerns about this approach and was wondering if anyone might be willing to offer their input.
Any thoughts or ideas are welcome!
I am confused about what to do next now. Here is my brief introduction.
I am a second year undergraduate and I am learning deep learning (specifically computer vision) from past 8 months or so. I have a good grasp of coding and basic data related stuff like EDA, cleaning, etc. I am following computer vision from like past 3 months now. I have the theoretical basics covered about the topics like CNN, attention, etc and I have also implemented a paper(not full paper) about a model that fine tunes a Stable Diffusion Model and then uses it to generate images and then trains a image recognition model on those images and then shows that the performance is improved. Now I don't know what to do next. Should I refer to some professor for a research intern, should I go for a professional intern, should I start writing a research paper. Please guide me
Exploring HQ-SAM
https://debuggercafe.com/exploring-hq-sam/
In this article, we will explore HQ-SAM (High Quality Segment Anything Model), one of the derivative works of SAM.
The Segment Anything (SAM) model by Meta revolutionized the way we think about image segmentation. Moving from a hundred thousand mask labels to more than a billion mask labels for training. From class-specific segmentation to class-agnostic segmentation, it paved the way for new possibilities. However, the very first version of SAM had its limitations. This also led the way for innovative derivative works, like HQ-SAM. This will be our primary focus in this article while absorbing as much detail as possible from the released paper.
Need help.
Does anybody know of a model that can achieve this?
Hi guys,
I am working on a personal project where I need to calculate depth values of each pixel when the camera is oriented in a top-down fashion from an existing depth map that was taken when the phone was tilted (non zero pitch and roll angles).
I can recreate the scene 3D points (in camera coordinates) with these equations:
X = (u - cx) * depth_map / fx
Y = (v - cy) * depth_map / fy
Z = depth_map
So now do I simply multiply the 3D points with inverse of rotation matrix to simulate camera being at normal to the capture plane?
I do have the camera intrinsic and extrinsic matrix.
So hello everyone i am in my third year and making a project using raspberry pi 4b + google coral, raspberry cam v3 etc i want to make an object detection, reading sunglasses in real time i have used yolo , tenserflow lite they seems little bit inaccurate i dont have problem with that but i have no experience with iot projects i need help am i doing something wrong will it complete please anyone who has experience in it please help me and suggest me something please
I'm trying to infer monocular metric depth for outdoor scenes, and am struggling to obtain good results. Depth Anything V2 trained on virtual KITTI seems to clip everything above ~80 meters even though the training data extends to 655.35 meters.
Any ideas? I'm experimenting with scaling the output from the relative version of Depth Anything per the output from the metric version, within a range of values, but have not had much luck yet. Maybe linear scaling is inappropriate?
https://huggingface.co/depth-anything/Depth-Anything-V2-Metric-Outdoor-Large-hf
https://europe.naverlabs.com/research/computer-vision/proxy-virtual-worlds-vkitti-2/
Hey everyone,
Sorry if this has been asked a bunch of times before.
I wanted to ask the CV community if it's possible to measure a box from an angle.
I have hired someone to train an AI model, implement some measurement logic and develop a python app for this, however we currently have a version that does detect a box, but it does not measure the dimensions accurately.
(It does have issues detecting the box through an AI model that was trained on 14k images too)
I just wanted to confirm if this concept is even possible with a singluar Luxonis OAK camera.
Alternatively, is mounting the camera to look down at the camera (birdseye) a better option to look into? (I suppose this may make it simpler) - ewhich is what the developer wants to look into now
Apologies if this is a half arsed question, I am new to the CV world and am still learning :)
I'd appreciate any pointers,
Thanks
UPDATE 1: Sooooo I looked into this more and I am convinced that a 3d angluar view of a box should yield accurate results, so I'll put this out there. If any developers or hobbyists want to give this a shot, I'll be more than happy to message to see how we can make this happen!
Urgent: I am working on a face detection and recognition project. For the same code that runs locally on my system and when it runs on a cloud instance, why is there a âdifference in the number of records on Pineconeâ? I am uploading the face embeddings for all faces in all the images on pinecone.
Is it because of the different cuda, cudnn versions? Or due to the difference in GPU? Or due to the versions of the dependencies? Or due to - locally it runs on windows and on the cloud instance on Linux.
Canât find a solution to this.
New Paper Alert!
Explainable Procedural Mistake Detection
With coauthors Shane Storks, Itamar Bar-Yossef, Yayuan Li, Zheyuan Zhang and Joyce Chai
Full Paper: http://arxiv.org/abs/2412.11927
Super-excited by this work! As y'all know, I spend a lot of time focusing on the core research questions surrounding human-AI teaming. Well, here is a new angle that Shane led as part of his thesis work with Joyce.
This paper poses the task of procedural mistake detection, in, say, cooking, repair or assembly tasks, into a multi-step reasoning task that require explanation through self-Q-and-A! The main methodology sought to understand how the impressive recent results in VLMs to translate to task guidance systems that must verify where a human has successfully completed a procedural task, i.e., a task that has steps as an equivalence class of accepted "done" states.
Prior works have shown that VLMs are unreliable mistake detectors. This work proposes a new angle to model and assess their capabilities in procedural task recognition, including two automated coherence metrics that evolve the self-Q-and-A output by the VLMs. Driven by these coherence metrics, this work shows improvement in mistake detection accuracy.
Check out the paper and stay tuned for a coming update with code and more details!
[Link to hosted tool] (https://bbelk.github.io/ARTagPlacementTool/)
We've all been there, wasting time manually placing AR tags into an image only to struggle later when trying to recreate the same layout. Measuring pixels or relying on approximations is frustrating and time-consuming, and can lead to inconsistent results.
Introducing ARTagPlacementTool â a minimalist image editor designed to simplify the generation and placement of AR tags, including ArUco, AprilTags, and QR codes. This tool allows you to generate markers, create layouts, and save time with its exporting and importing features. No more struggling with copying over marker images or messing with placement, you can instantly recreate exact marker setups for future use.
The application also includes a "finder" tool, allowing you to manually toggle cells to identify marker IDs. These can be markers you've seen in the wild, ones you've forgotten the ID of, or you can try creating any random marker image and seeing if there's a match out there (harder than you think!).
I developed ARTagPlacementTool to solve the common problem of manually placing tags, which often led to inconsistencies and wasted time. Whether you're working on personal projects or professional AR applications, this tool aims to enhance your workflow and maintain cohesive marker setups.
Give it a try and let me know your thoughts!
Iâm currently working on a robot simulation developer talent marketplace.
Since computer vision is also in the robotics space, Iâd love to chat for about 15 minutes to better understand the problem Iâm aiming to solve.
If anyone is available to chat, kindly comment below or DM here on Reddit please.
Looking forward to hearing from you.
Best regards,
Eli
Hello,
I want to create a light weight model, preferably which can run in the browser, and which can detect "website logo" on the websites from their "screenshots".
Example:
Here are the things that I have tried so far. Would be great if I can get some feedbacks on whether I am approaching this right.
I am exploring variants of YOLO for my use case. Since the model is trained on COCO dataset, I won't be able to infer using zero shot. I will have to fine tune the model. I am using ultralytics APIs to train the model.
For dataset, I could not find a dataset on internet, which has website screenshot with annotated logos, so I am thinking of creating one myself for top 100 websites. I am not sure if this data will be sufficient, but I can try to get started and see how the performance looks like.
I am using roboflow to annotate the images and the download the dataset so that I can train my YOLO model.
My questions are: Is this the right approach or are there simpler approach to this problem?
Thank you!
I'm using YOLOv11s to detect soccer balls in a video, but it keeps detecting multiple false positives. How can I ensure it detects only one ball per frame?
Hi everyone.
We're working on a Yolo v8 model for object detection, we need help, preferably someone experience in YoloV8 and have worked on image detection and OCR models.
Please drop me a message if you fit the above criteria.
Thanks.
Hello Everyone,
Iâm currently working on a project that involves ray tracing features from multiple images. To give you an idea, I have around 100 images from which Iâve already generated a point cloud. Out of these images, Iâve identified a person appearing in approximately 20 frames. What I want to achieve now is to differentiate the point cloud associated with this person by assigning it a distinct color. This way, I can build a system where, if a user hovers their mouse over that specific part of the point cloud, it will display the corresponding human image that was identified.
Iâm looking for advice or suggestions on how to approach this. Are there specific tools, libraries, or workflows that could help me effectively achieve this? Any tips or pointers would be greatly appreciated! Thank you in advance for your help!
Has anyone had any success running ORBSLAM3 with multiple cameras active at the same time? Reading the documentation it doesn't seem to be supported natively but should be doable based on how the long term loop closures work.