/r/computervision

Photograph via snooOG

Computer Vision is the scientific subfield of AI concerned with developing algorithms to extract meaningful information from raw images, videos, and sensor data. This community is home to the academics and engineers both advancing and applying this interdisciplinary field, with backgrounds in computer science, machine learning, robotics, mathematics, and more.

We welcome everyone from published researchers to beginners!

Content which benefits the community (news, technical articles, and discussions) is valued over content which benefits only the individual (technical questions, help buying/selling, rants, etc.).

If you want an answer to a query, please post a legible, complete question that includes details so we can help you in a proper manner!

Related Subreddits

Computer Vision Discord group

Computer Vision Slack group

/r/computervision

103,238 Subscribers

9

What was the strangest computer vision project you’ve worked on?

What was the most unusual or unexpected computer vision project you’ve been involved in? Here are two from my experience:

  1. I had to integrate with a 40-year-old bowling alley management system. The simplest way to extract scores from the system was to use a camera to capture the monitor displaying the scores and then recognize the numbers with CV.
  2. A client requested a project to classify people by their MBTI type using CV. The main challenge: the two experts who prepared the training dataset often disagreed on how to type the same individuals.

What about you?

8 Comments
2024/11/16
06:45 UTC

4

what is a good approach for simple furniture recommender

hey guys i am trying to make a simple recommender that recommends a furniture from a set of furniture for a particular spot in the room. I tried cosine similarity between the spot and the furniture but it seems it doesn't work as expected, what other approaches can i try?

1 Comment
2024/11/16
02:03 UTC

1

Faster Image Reading Options for UHD Sources

I’m running into an issue with a CV project where we’re able to to read in HD ProRes files at 200-300 fps, but moving to UHD takes us down to ~20-30 FPS. This is with heavy multi threading on a M1 Ultra Mac Studio fully maxing out the processor on OpenCV. I’d expect 1/4th performance, but 1/10th seems a bit excessive.

Has anyone worked out a way to utilize the hardware Media Engine for ProRes acceleration. The same machine can read and write a UHD ProRes in Resolve at over 300fps. Processing the actual CV tasks is extremely fast (200+ FPS) after the read, but the read is super slow. Tried FFMPEG-Python, and it was comparable to multithreaded CV2.

If anyone has found a library that can utilize the Mac’s Media Engine to get a NP of the uncompressed frames, I’d be eternally grateful.

0 Comments
2024/11/16
00:42 UTC

6

Which is the best traditional image segmentation algorithm?

I am trying with Watershed, K-means applying various pre-processing steps, but the segmentation is not upto the mark. For edge detection I used multi-scale edge detection with CLAHE and canny but the results are very bad.

The images are not very crowded, but not very simple either.

(This is just for my learning purposes, exploring with traditional CV segmentation methods)

3 Comments
2024/11/15
22:45 UTC

6

Homography for matching images of two sides of a football field

Hi, everyone. Hope you're doing well. I'm from Argentina and I'm doing some digital image processing stuff applied to football. I want to extract some information about the distance that a player travelled on a match, and for that I took some videos from a fotball match. The camera couldn't take all the pitch, even with the "fish eye" lens so I had to record two videos from the match.

https://preview.redd.it/h9jawa0k451e1.png?width=1351&format=png&auto=webp&s=026fc9aa3cec1662c18a889bcd29c30760a9e750

The main problem is that the two images warp a lot! That makes very difficult the post-processing.

I have two videos from the left and right side of a football pitch and I want to join them. I used this implementation of a

https://github.com/OpenStitching/stitching

I load two images from an instant of the match and then use the feature detector and later I match the features form the two images. Next step is warping the images, and here is the main problem, where the images bend a lot for matching the features.

https://preview.redd.it/klvczhll451e1.png?width=1361&format=png&auto=webp&s=0d287e1a98d070b83ac7917f731eac479812cd81

The final result is like this:

https://preview.redd.it/bqza9scn451e1.png?width=1370&format=png&auto=webp&s=71370314d01c170685007215b82d6f0a7c09b8e0

I would like to know if there's something better in terms of less warping that I could use. I found a page that makes the joining of the two images with less distortion:

https://preview.redd.it/ap8x8w5r551e1.png?width=1415&format=png&auto=webp&s=0fdd45e104568cf2eba15e9e0db3d9f474800d17

I'm still not sure how they achieve that, any help would be very appreciated! Here is the code of the homography that I described before:

https://github.com/agusrol/homography_football/blob/main/homography_final.ipynb

5 Comments
2024/11/15
22:44 UTC

2

Need advice on building a dataset for an image classifier

Hi everyone! I am kind of new to CV and I wanted to try my hand at building an image classifier for the original Pokémon starters (Bulbasaur, Charmander, and Squirtle). I started off by building my dataset by creating a dataset by downloading all the images of every Pokémon card of the three starters but there is a surprising lack of images for those three. I only got a range of 20-30 different images for each. I plan to use data augmentation to increase the dataset size, but I think that this will still result in way too small of a dataset.

If anyone had some advice on some ways I can help increase my dataset size that would be fantastic.

Note: I haven't included any pictures from the shows or fan art yet since I had some concerns about adding pictures like that for my dataset

4 Comments
2024/11/15
21:32 UTC

1

UK Networking events, computer vision professionals?

Anyone been to any good networking events in the UK (especially London)? Cheers!

2 Comments
2024/11/15
17:05 UTC

1

Perspective N-Point for object pose estimation

Hi everyone! Sorry if my question is trivial, but I can't really understand this matter.

I know that Perspective N-Point is very useful to find the camera pose with respect to an object (and so, the object pose relative to the camera), but I can't really get how to automatically select the most important points to get a 2d-3d correspondence and effectively use PnP.

I'll give an example: I have the 3d models of some objects, and I can manually select some points on them. Then, with a Blender add-on I can select the same points on a photo and calculate the correct camera pose. At the same time, I don't really get how to automatize this process (without manually choosing the points) to match 2d and 3d points. Some researches made me discover solutions like feature detection and matching in images, but I don't get how to implement them.

Thanks in advance!

4 Comments
2024/11/15
16:43 UTC

3

Improving Accuracy

https://preview.redd.it/79x0v7vn931e1.jpg?width=1912&format=pjpg&auto=webp&s=aa4759c53bf7f3de585ddd0f123654a9af5402d2

There are a handful of reasons why computer vision models achieve low Mean Average Precision (mAP) ratings. One way to overcome this challenge is by using synthetic image datasets.
But where do you find the best service provider?
Great news!
I just launched my online directory of service providers in the synthetic image data generation and simulation industry. (More listings coming this week)

It's free to browse all providers, and I would appreciate it if you could check it out and share your feedback with me.
Link: https://www.inkmanworkshop.com/
Thank you!
-Eli

0 Comments
2024/11/15
16:05 UTC

6

Badminton court detection using pytorch and resnet50

https://preview.redd.it/o4408rbjt21e1.png?width=3018&format=png&auto=webp&s=aa9134f589e537831d1ce12b8185386e47e8516d

Hi everyone,

I'm new in computervision and I'm blocked during several hours on this problem.

I try to detect 32 Keypoints on a badminton court : all lines intersections and the feet of the net. For that I used pytorch on a resnet50 pretrained model. I have a Dataset of annotates images and I train the model like that :

class KeypointsDataset(Dataset):
    def __init__(self, img_dir, data_file):
        self.img_dir = img_dir
        with open(data_file, "r") as f:
            self.data = json.load(f)
        
        self.transforms = transforms.Compose([
            transforms.ToPILImage(),
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        ])
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = self.data[idx]
        img = cv2.imread(f"{self.img_dir}/{item['id']}.jpeg")
        if img is None:
            raise ValueError(f"Image {item['id']} could not be loaded.")
        
        h, w = img.shape[:2]
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
        img = self.transforms(img)

        # Ensure keypoints are consistent
        kps = np.array(item['kps']).flatten()
        if len(kps) != 64:  # Check that we have 32 keypoints (x, y)
            raise ValueError(f"Expected 64 keypoint values, but got {len(kps)} for item {item['id']}")
        
        kps = kps.astype(np.float32)
        kps[::2] *= 224.0 / w  # Adjust x coordinates
        kps[1::2] *= 224.0 / h  # Adjust y coordinates

        return img, kps

train_dataset = KeypointsDataset("data/images","data/data_train.json")
val_dataset = KeypointsDataset("data/images","data/data_val.json")

train_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=8, shuffle=True)

from torchvision.models import resnet50, ResNet50_Weights

# Load the model with pretrained weights
weights = ResNet50_Weights.DEFAULT
model = resnet50(weights=weights)
model.fc =  torch.nn.Linear(model.fc.in_features, 32*2)

criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-4)

epochs=100
best_val_loss = float("inf")
patience = 10  # Number of epochs to wait for improvement before stopping
trigger_times = 0

for epoch in range(epochs):
    for i, (imgs,kps) in enumerate(train_loader):
        imgs = imgs.to(device)
        kps = kps.to(device)

        optimizer.zero_grad()
        outputs = model(imgs)
        loss = criterion(outputs, kps)
        loss.backward()
        optimizer.step()

        if i % 10 == 0:
            print(f"Epoch {epoch}, iter {i}, loss: {loss.item()}")

    # Validation Phase
    model.eval()
    val_loss = 0.0
    with torch.no_grad():
        for imgs, kps in val_loader:
            imgs = imgs.to(device)
            kps = kps.to(device)

            outputs = model(imgs)
            loss = criterion(outputs, kps)
            val_loss += loss.item()

    val_loss /= len(val_loader)
    print(f"Epoch {epoch}, validation loss: {val_loss}")

    # Early Stopping Logic
    if val_loss < best_val_loss:
        best_val_loss = val_loss
        trigger_times = 0
        torch.save(model.state_dict(), "best_keypoints_model.pth")  # Save best model
    else:
        trigger_times += 1
        if trigger_times >= patience:
            print(f"Early stopping triggered at epoch {epoch}")
            break

When I called the model, I make the resizing back like this :

class CourtLineDetector:
    def __init__(self, model_path):
        self.model = models.resnet50(pretrained=False)
        self.model.fc = torch.nn.Linear(self.model.fc.in_features, 32*2)  # Adjust for the number of keypoints
        self.model.load_state_dict(torch.load(model_path, map_location='cpu'))
        self.transform = transforms.Compose([
            transforms.ToPILImage(),
            transforms.Resize((224, 224)),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        ])
    
    def predict(self, frame):
        img_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
        image_tensor = self.transform(img_rgb).unsqueeze(0)
        with torch.no_grad():
            self.model.eval()
            outputs = self.model(image_tensor)
        
        keypoints = outputs.squeeze().cpu().numpy()
        original_height, original_width = frame.shape[:2]

        # Scale keypoints back to the original frame dimensions
        keypoints[::2] *= original_width / 224.0
        keypoints[1::2] *= original_height / 224.0

        return keypoints
    
    def draw_keypoints(self, image, keypoints):
        # Plot keypoints on the image
        for i in range(0, len(keypoints), 2):
            x = int(keypoints[i])
            y = int(keypoints[i+1])
            if x is not None and y is not None:
                cv2.putText(image, str(i//2), (x, y-10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255), 2)
                cv2.circle(image, (x, y), 5, (0, 0, 255), -1)
        return image
    
    def draw_keypoints_on_video(self, video_frames, keypoints):
        output_video_frames = []
        for frame in video_frames:
            frame = self.draw_keypoints(frame, keypoints)
            output_video_frames.append(frame)
        return output_video_frames

But, as you can see, the dot look kind of in the good position, but not in good scale.

I already try several method in order to correct that like padded the picure before training (and after) in order to keep the correct scale... but that don't work either.

My original dataset is really simple with

[
    {
        "id": "badminton_court_keypoints_167",
        "metric": 0.0,
        "kps": [
            [
                385,
                353
            ],
            ...
        ]
    },
    ...
]

Futhermore, do you think there is a better way to achieve the same results, I'll be happy to understand better this field.

Thank you so much for taking the time to help me.

1 Comment
2024/11/15
14:48 UTC

2

Building a wardrobe inventory & virtual try-on app: which technologies are there? where to start from?

Hi, I am wondering if it is possible to build an app that allows users to take a picture of themselves in a mirror that automatically adds their clothing items into an inventory, ideally a photo of the item and tags such as clothing category, color, fabric etc. Then the user should be able to mix and match the items in the wardrobe and see the outfit on a realistic avatar of themselves. Is this possible from a technological standpoint? If so, could you recommend which concepts, resources etc. to look into? What kind of expertise would one need to develop this product? Please bear in mind that I have engineering background and coding experience but I don't know anything about computer vision itself.

1 Comment
2024/11/15
14:40 UTC

7

Real-world challenges of AI-driven visual sensing

I am in the process of writing a survey paper that explores the real-world challenges of AI-driven visual sensing across various sectors, such as wearable devices (VR/AR/smart glasses), construction, mining, oil, robotics, retail, and more. My focus is on the limitations and constraints posed by camera technology in these applications. Any insights or contributions on this topic would be greatly appreciated!

5 Comments
2024/11/15
11:25 UTC

1

Quaternion rotation for each of skybox panorama views

I have a skybox panorama image ( 360 view in bottom/up/left/right/front/back view ). I also have the camera position and rotation vectors, and I've noticed that the rotation vector is for camera "bottom" view.

I'd want, having the bottom view rotation vector, to calculate rotation vectors for all other views ( left/right/up/front/back ), but I'd like to start with "left" view. The problem is if I only manipulate y axis, and rotate it 90 degrees, objects that must be on the bottom are on the left side of the image rendered from camera perspective, and if I additionally rotate it 90 degrees on Z axis, objects that must be on the bottom are slightly on the right if it makes sense.

As I understand, it happens because the Z axis rotates with Y rotation, and it is not perfectly aligned now. Is there a way to properly calculate rotation for panorama view?

PS. Sorry if I explained that poorly, I'll try to create example if what I'm saying does not make sence.

2 Comments
2024/11/15
11:06 UTC

2

Blog post: Use cases of AI in driverless cars.

This blog post explores how CV solutions are used in autonomous driving cars. It is not a technical post, but it can be useful for general knowledge. If you have more interesting Use cases or projects, please add them to the thread. It would be very useful to me. Thank you.

1 Comment
2024/11/15
10:03 UTC

3

Edge detection using MultiThreading

Hi, I've been trying to implement Edge detection using convolution (sobel operator) in C and utilizing the pthreads library for multithreading but when I compare the execution time of the program with and without threads I'm getting a slower execution time for the one without threads. I've tried to reduce the number of threads so less overhead to create and manage them, Increased the image resolution but nothing worked.

Here is my code

Image mul_convolution(int **input_image, Kernel k, int width, int height) {
    int num_threads = 4; // Number of threads
    pthread_t threads[num_threads];
    ThreadData thread_data[num_threads];
    int rows_per_thread = height / num_threads;


    Image img;
    img.width = width;
    img.height = height;
    img.data = (int **)(malloc(sizeof(int *) * height));
    for (int i = 0; i < height; i++) {
        img.data[i] = (int *)(malloc(sizeof(int) * width));
    }


    for (int i = 0; i < num_threads; i++) {
        thread_data[i].input_image = input_image;
        // thread_data[i].output_image = img.data;
        thread_data[i].k = k;
        thread_data[i].width = width;
        thread_data[i].height = height;
        thread_data[i].start_row = i * rows_per_thread;
        thread_data[i].end_row = (i == num_threads - 1) ? height : (i + 1) * rows_per_thread;


  thread_data[i].output_image = malloc((thread_data[i].end_row - thread_data[i].start_row) * sizeof(int *));
        for (int j = 0; j < (thread_data[i].end_row - thread_data[i].start_row); j++) {
            thread_data[i].output_image[j] = malloc(width * sizeof(int));
        }

        pthread_create(&threads[i], NULL, thread_convolution, &thread_data[i]);
    }

    for (int i = 0; i < num_threads; i++) {
       pthread_join(threads[i], NULL);
    }


    return img;
}

void* thread_convolution(void* arg) {
    ThreadData* data = (ThreadData*)arg;
    int **input_image = data->input_image;
    Kernel k = data->k;
    int width = data->width;
    int height = data->height;
    int start_row = data->start_row;
    int end_row = data->end_row;

    for (int i = start_row; i < end_row; i++) {
        for (int j = 0; j < width; j++) {
            int sum = 0;
            for (int m = 0; m < k.size; m++) {
                for (int n = 0; n < k.size; n++) {
                    int x = i + m - k.size / 2;
                    int y = j + n - k.size / 2;
                    if (x >= 0 && x < height && y >= 0 && y < width) {
                        sum += input_image[x][y] * k.data[m][n];
                    }
                }
            }
            data->output_image[i - start_row][j] = sum;
        }
    }

    pthread_exit(NULL);
}
1 Comment
2024/11/15
07:41 UTC

8

Papers on calibrated multi-view geometry for beginners

Hi all, I'm looking for some papers that are beginner-friendly (I am only familiar with basic neural network concepts) that discuss the process of combining multiple perspectives of a photo into a 3D model.

Ideally, I'm looking for something that supports calibration beforehand, so that the reconstruction is as quick as possible.

Right now, I need to do a literature survey and would like some help in finding good direction. All the papers I've found were way too complicated for my skill level and I couldn't get through them at all.

Here's a simple diagram to illustrate what I'm trying to look into: https://imgur.com/a/MJue7I2

Thanks!

4 Comments
2024/11/15
04:31 UTC

3

Best Image Inpainting methods to naturally blend objects

Hi Folks,

I have a use case where I am given two images. For notations let's call IMAGE1 and IMAGE2. My task is to select an object from IMAGE1 ( by selection, I mean to obtain the segmented mask of the object ). Place this segmented mask object naturally in IMAGE2, where a masked region is provided by the user. We have to ensure that the object from IMAGE1 should be naturally blended into IMAGE2. Can someone shed light on what might be the best model or group of models to do this?

Example: Place a tree from IMAGE1 into IMAGE2 ( group of people taking selfie on a grassland)

  1. I have to segment the tree from image1
  2. I have to place the tree in the potion highlighted or provide a mask in IMAGE 2.3. I have to take care of the light, angle, and vibe (like selfie mode, wide angle, portrait, etc). Context awareness

Smooth edge blending, Shadows, etc.

Dataset: For now, I choose to work on the COCO dataset. A subset of 60K images

Since painting has many techniques, It's confusing which set of models I need to pipeline for my use case, which might give a good, realistic, natural image.

I have explored the following techniques but could not settle on one strategy.

  1. Partial Convolutionals.

  2. Generative Adversarial Networks (GANs)

  3. Autoencoders.

  4. Diffusion Models

  5. Context-based attention models etc.

Thanks for checking on my post. Please provide some insights if you have some experience or ideas working on such use cases.

2 Comments
2024/11/15
03:22 UTC

0

Is there a way to get the plates off this car?

I was wondering if there’s anything that could make these plates clear in this video any help would be greatly appreciated

8 Comments
2024/11/15
03:13 UTC

5

Interview with David Forsyth, Computer Vision Giant. He talks about the biggest problem in vision right now

6 Comments
2024/11/15
03:06 UTC

7

Fine-Tune Mask RCNN PyTorch on Custom Dataset

Fine-Tune Mask RCNN PyTorch on Custom Dataset

https://debuggercafe.com/fine-tune-mask-rcnn-pytorch-on-custom-dataset/

Instance segmentation is an exciting topic with a lot of use cases. It combines both object detection and image segmentation to provide a complete solution. Instance segmentation is already making a mark in fields like agriculture and medical imaging. Crop monitoring and tumor segmentation are some of the practical aspects where it is extremely useful. But in deep learning, fine-tuning an instance segmentation model on a custom dataset often proves to be difficult. One of the reasons is the complex training pipeline. Another reason is being able to find good and customizable code to train instance segmentation models on custom datasets. To tackle this, in this article, we will learn how to fine-tune the PyTorch Mask RCNN model on a small custom dataset.

https://preview.redd.it/0wnxvw2tny0e1.png?width=1000&format=png&auto=webp&s=c70db631f4f9c10243ec29711be93183c45d2154

2 Comments
2024/11/15
00:35 UTC

14

Reflectance-Based DIRetinex and Real-ESRGAN Image Enhancement Pipeline

Hi everyone!

I built a pipeline combining a Reflectance-Based Deep Retinex model with Real-ESRGAN to enhance low-light images. The Retinex model separates the image into reflectance and illumination components, allowing us to adjust brightness and contrast based on predicted coefficients. This helps to improve visibility in low-light images while keeping details natural. After this, I thought eh that was kinda just recreating a paper. So, I tried improving it with Real-ESRGAN. It steps in to upscale the images, adding super-resolution for clearer, high-quality results.

The model has shown decent results in handling challenging low-light conditions by producing images with better visibility and refined details. If you're interested, I’ve shared the code here: Project.

I still wasn't exactly able to reproduce the results from the paper here. But the final image is clearer and with a lot less noise than even the ground truth at some points.

Here's an example:

https://preview.redd.it/yhbbywyfyw0e1.png?width=1200&format=png&auto=webp&s=071213fb53b0d8d39a4b53e6b8a373daf1f96d02

I’d love any feedback or thoughts for improvement using this method.

P.S. I'm only a grad student, take it easy on me xD

1 Comment
2024/11/14
18:53 UTC

5

Custom Code for Precision, Recall, and Confusion Matrix for YOLO Segmentation Metrics?

Has anyone written custom code to calculate metrics like precision, recall, and the confusion matrix for YOLO segmentation? I have my predicted label files, but since I've modified the way I'm getting inference results, the default val function in Ultralytics doesn’t work for me anymore. Any advice on implementing these metrics for a custom YOLO segmentation format would be really helpful!

0 Comments
2024/11/14
10:28 UTC

7

Increase accuracy pose estimation

I am struggling to find a pose estimation model that is accurate enough to estimate poses consistently for sports footage (single person, 30fps, 17 key points)

Do you have any tricks/tips for video post processing to increase accuracy?

Thanks!

5 Comments
2024/11/14
05:15 UTC

0

LG Ultra sharp 40" VS the world

I've looked around and haven't found one of the 5K monitors I'm interested in on display. The only retailer that carries anything anymore is Best Buy, and I live in LA. They do have the LG 45" OLED which is big and beautiful in person, although probably too curved, not much of a hub, and sold as a gaming monitor. The size is nice being tall AND wide! I'm not a gamer except for some FPV Drone Simulation on occasion.

What I am is a MAC creative who works in photoshop, InDesign, Illustrator and a fair amount of Premier. I'm looking for a combination of color accuracy, size (but not a fan of narrow 49" monitors) and resolution. I'm currently on an Imac 27" which is what I'm used to with it's 5K resolution, and sometimes text is hard to read. Because I have a 23" sidecar monitor I can't mount a VESA and pull it close to my face when needed. However, I do prefer to keep the monitor a little further from my face for eyeball tanning sake. 5K resolution comes in real handy as I'm often using screen grabs.

What I like about the Dell is the resolution, the hub with ample USB C ports, the ambient light sensor. But Dell is not a name I associate with computer monitors. I'm also a fan of OLED screens. My TV is an LG OLED and it's been sweet! I like the idea of the screen emitting the light rather than an array of LED's from behind. I see that LG has a 5K OLED coming 2025/26

I am still debating between an M2 Studio Ultra or an M4 Mini if you'd like to chime in on that feel free. If I found a screamin' deal on a M2 Ultra studio i'd probably get that. This next computer will likely be a placeholder till the M4 Ultra/Studio or whatever Apple does next is released. So an M4 mini might have better resale when that time comes.

So with black Friday looming, is it worth the extra scratch for the Dell or LG 40"? Or would I be happy with an LG OLED 38" or 45"?

3 Comments
2024/11/14
04:19 UTC

9

3D Mesh inner vertices

I hope this question is appropriate here.

I have a 3D mesh generated from an array using marching cubes, and it roughly resembles a tube (from a medical image). I need to color the inner and outer parts of the mesh differently—imagine looking inside the tube and seeing a blue color on the inner surface, while the outer surface is red.

The most straightforward solution seems to be creating a slightly smaller, identical object that shrinks towards the axis centroid. However, rendering this approach is too slow for my use case.

Are there more efficient methods to achieve this? If the object were hollow from the beginning, I could use an algorithm like flood fill to identify the inner vertices. But this isn't the case.

2 Comments
2024/11/14
01:28 UTC

6

voyage-multimodal-3: all-in-one embedding model for interleaved screenshots, photos, and text

Hey r/computervision community — we built voyage-multimodal-3, a natively multimodal embedding model, designed to handle interleaved images and text. We believe this is one of the first (if not the first) of its kind, where text, photos, figures, tables, screenshots of PDFs, etc can be projected directly into the transformer encoder to generate fully contextual embeddings.

We hope voyage-multimodal-3 will generate interest in vision-language models and computer vision more broadly.

Come check us out!

Blog: https://blog.voyageai.com/2024/11/12/voyage-multimodal-3/

Notebook: https://colab.research.google.com/drive/12aFvstG8YFAWXyw-Bx5IXtaOqOzliGt9

Documentation: https://docs.voyageai.com/docs/multimodal-embeddings

1 Comment
2024/11/13
20:22 UTC

0

Machine recommendation

I am confused between buying an M2 MacBook Air vs Mac mini M4 as one is portable and other is not. The external display would be needed wherever Mac mini goes.

According to you, which will be beneficial in long-term, I have a Windows laptop that is 7 years old (it even froze when loading the python interpreter, and computer vision is kind of a long shot)

I want to do computer vision, machine learning tasks, and software development.

Please write the reason the comments

View Poll

9 Comments
2024/11/13
18:07 UTC

Back To Top