/r/computervision
Computer Vision is the scientific subfield of AI concerned with developing algorithms to extract meaningful information from raw images, videos, and sensor data. This community is home to the academics and engineers both advancing and applying this interdisciplinary field, with backgrounds in computer science, machine learning, robotics, mathematics, and more.
We welcome everyone from published researchers to beginners!
Content which benefits the community (news, technical articles, and discussions) is valued over content which benefits only the individual (technical questions, help buying/selling, rants, etc.).
If you want an answer to a query, please post a legible, complete question that includes details so we can help you in a proper manner!
Related Subreddits
/r/computervision
I developed a model that can perform detections on video streams, I want to deploy this in practical scenarios on cctv , let’s say I take the stream input as RTSP and create a pipeline , I want a rough estimate on how much this will cost me for deploying this model on AWS , just to give a broader perspective , the end result will look like let’s say a website where you can put in your rtsp link and it will start performing detections and give you the detection outputs.How do I calculate these costs ??
I'm hoping someone here has experience with working w/ VR headset APIs. Both Apple and soon Meta have an API to access the front camera of their respective headsets, while other brands (HP Reverb series, Vive) appear to outright have no access. I understand this is largely for privacy reasons, but I am working on a project with important applications in VR (fast optical flow / localization stuff) and I would really benefit from access to the camera streams in any inside-out tracked VR headset.
If I cannot access these streams directly, I think I will attempt to simulate these cameras and their fovs in a CGI environment. This endeavor would benefit from documentation of their relative positions and FOVs (which of course vary from headset to headset).
TL;DR - Know of any dev-friendly VR headsets with an open api for the inside out cameras? Alternatively, any headset with documented inside-out camera intrinsics/relative extrinsics for calibration or simulation?
I am trying to set up my system using deepstream
i have 70 live camera streams and 2 models (action Recognition, tracking) and my system is
a 4090 24gbvram device running on ubunto 22.04.5 LTS,
I don't know where to start from.
I'm wondering what is the best model to deploy for a detection tool I'm working on. I'm current working with the Yolo-NAS-S architecture to minimise latency, but am I right in saying the Quantised version of Yolo-NAS-L should perform just as well from a latency point of view, but with improved accuracy?
I wonder is it able to turn on this camera for 24hours with poe? I'm using a2A640-240gmSWIR
and this camera gets hot so easily. got to 60~69degrees(celsius) too soon
I have 4 fisheye camera that is located each corner of a car. I want to stitch the output of the cameras to create a 360 degrees surround view. So far, i have managed to undistort the fisheye cameras with a loss in FoV. Because of that, my system fails when stitching them since the intersection region of the cameras may not contain enough features to match. Are there any other methods that you might propose? Thanks
I joined my uni's research program for undergrad. Me and my team of 3 people have to collect numerous different videos of objects with different lighting condition etc and run nerf/3dgs on them. As you know, researchers don't maintain the codes after their papers are accepted and thus, I have to run dockers to make them run. (Btw nerfstudio doesn't work at all) Also there are so many bugs so I have to read the codes and fix the bugs myself. Two other team members don't contribute at all. They literally don't do anything so I have to do all this by myself. Now my professor is asking us to read through the code and insert custom camera position to test the novel view synthesis near the ground truth image. This means I would have to understand the abysmal codes of the researchers wrote better than themselves... She is asking me to get the PSNR/LPIPS etc but that part won't be too hard.
I asked my prof if this is going to be published and she told me that this project lacks novelty so it probably won't be published. She told me this project is for her to better understand these models and that's it.
I was originally interested in 3d reconstruction and novel view synthesis but this project is making me hate it. This is just a grunt work with no real novel ideas to try and eating up so much of my time. I recently talked to the professor I really wanted to work with and he told me that he will let me into his lab if I do well in his class next semester and I am worried this project, that I have no passion anymore, will waste too much of my time which would be better spent on doing well on the class....
What do you think? Should I put in 20+ hours/week and the entirety of my winter break for the project that only serves to enhance the practical knowledge of my professor with absolutely no help from teammates?
* I am on a time crunch so I don't have enough time to get a firm knowledge on the foundations of model and just skim through the papers multiple times to understand the code
Currently, I am working on a project to recover license plates with many initial inputs. Can someone suggest me some source code or some related articles?
Thanks everyone
Looking for a tool to extract text from images or PDFs using OCR. What do you use?
I'm a seasoned software developer but pretty late comer to computer vision field. It's been just a few weeks but I'm truly loving it. Here is an important question I have in mind and couldn't find an "obvious" answer so wanted to check what community thinks in general.
What do you think about the future of inference? Is it possible to expect that inference will mostly move to edges? Or will it be mostly cloud based? For example; for robotics it must have been on the edge somehow, right? But for web use cases probably mostly cloud?
I'm guessing that there wouldn't be a one-size-fits-all answer to this question but I'm hoping to see some informative replies about different aspects to it.
Here is an example sub-question:
I'm thinking of creating a custom trained object-detection/classification model for a certain industry and I'm going to create a mobile app for users to use it. I'm guessing that the common practice would be "use local inference whenever possible", is this correct? But if the implementation is easier/more feasible cloud inference can be used as well? So the sub-question is; what are the main questions you ask when deciding on picking a inference infra for a project?
I'm a second year PhD student doing my research in animal behavioral genetics. My project has taken a turn where most of my job is to now create models that can extract behavioural features from videos of animals interacting.
I want to stay in this field in the future and try to make it in academia/industry but focusing on computer vision for animal behavior quantification.
What skills should I make sure to acquire and where in the world should I try to apply for jobs after my PhD?
Anyone else who went through the same thing as me? All advice is welcome!
Thanks in advance!
I work for a health-insurance adjacent company in the United States. As a part of our workflow, we need to ingest claims reports from various insurance providers. The meat of these PDF-formatted reports is the “Claims” table.
The “Claims” table has a “row” for each claimant. However, this “row” is actually two rows. For each claimant, the top row has cells with basic identifying information. The bottom row is actually a second subtable with one or more claims.
I’ve included a mockup in Excel of what a “Claims” table might look like. Actual claims reports are PDFs with headers/footers/page breaks/etc. The mockup has two claimants. The first claimants has two claims, the second just has one.
What’s approach can I take to extract claim and claimant data from this nested table? I’ve tried form parsing software like Azure Form Recognizer/Document Intelligence, but these solutions don’t support nested tables. I also tried a multi-modal LLM-based approach with Claude but got terrible results.
Do I need to build a custom parser for this? Maybe use a Python OCR library to iterate over rows and apply a different parsing model to the claimant info and the claims info?
TIA
Hi guys.
I am developing a model that detects fidgeting movement when sitting for the upper body part ( so like fidgeting left to right, back and forth on a chair ). The goal of the project is to see whether the student has some sort of ADHD related behaviors.
Currently, i am thinking to use a rule-base system that has a threshold for left and right shoulder landmarks. But this approach does not seems to work for bigger frame student. I want to ask reddits for help, and hopefully find the right answer.
Thank you for reading
Hello everyone, I am working on a project where I have to label data to segment lodging and weeds in sugarcane crop. I have some drone images and I am trying to label them using tradition computer vision techniques like using height data available in drone images to highlight lodged crops. I am still not getting very accurate results due to varying height of the crops and now I feel stuck.
I would love to know if anyone has done something similar and how did they overcome the issue?
...try to characterize the images within a dataset that their model performs better (or worse) on compared to other models?
For all we know the author's contributions might have a huge impact on images with certain characteristics and hurt performance on other images...but we never see any analysis towards that.
This seems so obvious...why can't I find any papers on the topic?
Hi everyone!
I was checking out some cool computer vision projects recently (like that insane DALL-E training data and those medical imaging projects) and wow - the amount of images needed for training these models is mind-blowing!
How do you all handle this? How much data are you typically working with? What's your setup for managing it all? I'm really curious about the costs too - do you use cloud storage or have your own setup? What kind of challenges do you run into with all this data?
Would love to hear your stories and solutions!
Hey guys I was given a dataset of several different type of construction cracks and I need to create a model that identifies each one. I’m a beginner in CV and none of them are label.
The goal is to take this to production. I have background in ML and doing backend using fastapi but what algorithm should I use for such a use case and what do I need to consider for deploying such a project in production?
Hey Redditors,
I’ve been exploring a simple and effective way to implement eye-gaze tracking using deep learning. My goal is to use this system to control a wheelchair, making it more accessible for individuals with limited mobility.
I am new to Python, so I’m still learning the basics. If anyone has advice or suggestions on how to start this project, or even some Python-friendly libraries or tutorials that could help, please let me know!
I’d love to hear your thoughts, feedback, or any resources you think would be useful. Let’s make mobility accessible for everyone!
How I can do this task any one guide me plz
I am working on a CV problem, where I have to detect all the faces present in the image...
Attached an example image for reference...I exactly have images like this...there are photos on camera roll and have to ectract faces from it
Please Suggest me some way to achieve this with good accuracy and speed. Also should I do any image preprocessing step ?
i am new to all this i have to work on a project that entails Landmark And Object Detection Via Satellite Images
Please any advice and guidance on courses or videos or codes would be really appreciated im really at a loss
Hello,
I'm looking for recommendations on a project we are working on.
We need to feed the direction and the location of the objects with low precision (5 cm, 5 degree error is okay) into a central computing unit at 30+ FPS.
We are thinking about using
I need feedback about
Thank you guys!
Long time in the making, finally found time to finish writting the last post on this project. This final post is about including a Kalman Filter to timprove tracking and smooth and stabilize the projection:
https://bitesofcode.wordpress.com/2024/11/30/augmented-reality-with-python-and-opencv-part-3/
Code: https://github.com/juangallostra/augmented-reality/tree/kf-tracking/src
Hi everyone.
So, I am confused about this mAP metric.
Let's consider AP@50. Some sources say that I have to label my predictions, regardless of any confidence threshold, as tp,fp, or fn, then sort them by confidence (with respect to iou threshold of course). Next, I start at the top of the sorted table and compute the accumulated precision and recall by adding predictions one by one. This gives me a set of pairs. After that, I must compute the area under the PR Curve, which is resulted from a unary function of f(precision)=recall_per_precision (for each class).
And then for a mAP@0.5:0.95:0.05, I do the steps above for each threshold and compute their mean.
Some others, on the other hand, say that I have to compute precision and recall in every confidence threshold, for every class, and compute the auc for these points. For example, I take thresholds from 0.1:0.9:0.1, compute precision and recall for each class at these points, and then average them. This gives me 9 points to make a function, and I simply compute the AUC after that.
Which one is correct?
I know Kitti uses something, VOC uses another thing and COCO uses a totally different thing, but they are all the same about AP. So which of the above is correct?
EDIT: Seriously guys? not a single comment?
I have a camera module (attached to an eyewear) that scans the text (using OCR?) e.g. if the scanned text is "abc" i need to output the ascii value of each letter i.e. "a - 97, b = 98, c = 99")
p.s it needs to be fast.
as the title says, i need a step by step approach since i am very new to this project.
i trained a yolov7 to detect damages on roads. i want to publish my work on my github but looking at it its not really my code? the only thing i changed were the necessary config files to locate my training set. and one python file that just has one line of code to run the model?
should i just publish my results folder? if i upload the whole project it just looks like im stealing someone else's work since i barely changed anything.
I need a way to extract the circled text (in the REV box there will usually be a single letter or number). The boxes will often change size and position. The structure of the text will also change along with position, size, and font. Generally, the text will always be in the bottom right.
Problem is that I cannot rely on keywords, positions or regex.
I’m using tesseract and openCV but I am open to other stuff (can only use azure for cloud computing)
I’m just looking for suggestions on how y’all would tackle this. I am a beginner.