/r/MachineLearning

2,931,803 Subscribers

1

[D] How Can I Train an AI Model to Automatically Parse and Identify Fields in Diverse PDF Invoices Without Manual Bounding Boxes?

Hello AI Community,

I’m working on a project to streamline the processing of a large volume of invoices from various suppliers. Each invoice may have a unique layout and design, depending on the supplier, and I want to train an AI model to automatically identify specific fields like article numbers, gross amounts, unit prices, etc., across these invoices. I’ll outline my situation below and would appreciate any advice on the best approach, relevant models, or practical considerations to help automate this process.

Project Background and Objectives

I have a substantial collection of PDF invoices from different suppliers. Some of these PDFs contain machine-readable text, while others are scanned images requiring OCR processing. Each invoice has a similar set of fields I need to extract, including:

  • Article Number
  • Gross Amount
  • Unit Price
  • Customer Details (Name, Address, etc.)

Additionally, I have corresponding XML files for each invoice that list the correct field values as structured data. This XML data serves as my “ground truth” and is accurate in labeling each field with the correct values.

Goal: Train an AI model that can automatically parse and map values from new invoices to these field labels without needing manual bounding boxes or annotations on each new layout. My ideal solution would learn from the XML data and understand where each value is likely located on any invoice.

Key Challenges

  1. Varied Invoice Layouts: Each supplier uses a different layout, making fixed positional or template-based extraction challenging.
  2. OCR for Scanned PDFs: Some invoices are image-based, so I need reliable OCR as a pre-processing step.
  3. No Manual Bounding Boxes: I’d like to avoid manually labeling bounding boxes for each field on each layout. Ideally, I would only need to provide the model with PDF and XML pairs.
  4. Field Mapping: The model should learn to associate text fields in the invoice with the correct XML labels across diverse formats.

Initial Research and Thoughts

I’ve looked into some potential approaches and models that might be suitable, but I’m unsure of the best approach given my requirements:

  • OCR: I understand OCR is essential for scanned PDFs, and I’ve looked into tools like Tesseract OCR and Google’s Vision AI. Is there a better option specifically for invoice OCR?
  • Pre-trained Models for Document Understanding:
    • LayoutLM (Versions 2 or 3): I’ve read that LayoutLM can handle layout-aware document analysis and might be effective with minimal supervision.
    • Donut (Document Understanding Transformer): This model seems promising for end-to-end document parsing, as it doesn’t require bounding boxes and might align well with my goal to use XML data directly.
  • Other Approaches: I considered custom pipelines, where OCR is followed by text processing with models like BERT, but I’m unsure if this would be flexible enough to handle varied layouts.

Questions

  1. Model Recommendation: Given my need to train a model to handle varied layouts, would LayoutLM or Donut (or another model) be the best fit? Has anyone here fine-tuned these models on invoice data specifically?
  2. Handling OCR Effectively: For those with experience in OCR for diverse invoice formats, are there particular OCR tools or configurations that integrate well with models like LayoutLM or Donut? Any advice on preprocessing scanned documents?
  3. Training Workflow Suggestions: What would a robust workflow look like for feeding labeled PDFs and XML files to the model without manual bounding boxes? Are there best practices for mapping the structured XML data to the model’s expected inputs?
  4. Performance Tips: Any specific tips on optimizing these models for accuracy in field extraction across variable invoice layouts? For example, do certain preprocessing steps improve performance on semi-structured documents?

Example of My Data Structure

To give you an idea of what I’m working with, here’s a basic breakdown:

  • PDF Invoice: Contains fields in varied positions. For example, “Article Number” may appear near the top for one supplier and further down for another.
  • XML Example:
  • <invoice>
  • <orderDetails>
  • <positions>
  • <position>
  • <positionNumber>0010</positionNumber>
  • <articleNumber>EDK0000379</articleNumber>
  • <description>Sensorcable, YF1234-100ABC3EEAX</description>
  • <quantity>2</quantity>
  • <unit>ST</unit>
  • <unitPrice>23.12</unitPrice>
  • <netAmount>46.24</netAmount>
  • </position>
  • </positions>
  • </orderDetails>
  • </invoice>

Thanks in advance for your insights! I’d be especially grateful for any step-by-step advice on setting up and training such a model, as well as practical tips or pitfalls you may have encountered in similar projects.

0 Comments
2024/11/06
13:22 UTC

1

[R] Help with CNN-RNN Architecture for Self-Supervised Matrix Completion

Hi all, I’m working on a self-supervised learning approach to estimate missing or uncertain data in a freeway traffic density dataset, inspired by matrix completion methods.

The dataset is generated from simulated freeway traffic, discretized in time and space to form a grid of cells. Each cell reflects a traffic density value observed from mobile sensors. I have three core arrays:

  1. actual_density_values: Ground truth density, used only for evaluation, not training.
  2. observed_density_values: Traffic density observed from mobile sensors, with some cells unobserved.
  3. certainty_values: Coverage certainty for each observed cell (range: 0 to 1).

with dimensions (T, E, S, L), where:

  • T: Number of time steps
  • E: Movement directions (edges) – expected to be 2 (e.g., forward and reverse)
  • S: Spatial segments
  • L: Lanes

Goal

The goal here is to build a model that can improve the estimation for cells where the certainty is less than 1. I want the model to capture dependencies over time and space, using self-supervision to “fill in” unobserved or uncertain values more accurately.

Proposed Approach

Here’s what I’m thinking in terms of architecture:

  1. Temporal Dependencies: Using a CNN to capture time-based dependencies over time steps (T).
  2. Spatial Dependencies: Using an RNN to model dependencies across spatial segments (S) and lanes (L).
  3. Model Structure:
    • Data Masking: At each time step, mask some of the observed data, especially the lower-certainty cells, so the model learns to predict uncertain values better.
    • CNN-RNN Combo: Combining CNN and RNN layers to learn from both the temporal and spatial aspects.
    • Loss Function: Using a self-supervised loss function that prioritizes accurate reconstruction of observed densities, particularly focusing on uncertain cells. For training, I won’t use the ground truth array (actual_density_values); it’s only for evaluation.
  4. Evaluation: Once trained, I plan to compute the RMSPE (Root Mean Square Percentage Error) between actual_density_values and the model’s predicted observed_density_values. I’m especially interested in the improvements on the lower-certainty cells.

Question

  1. Does this CNN-RNN combination sound like a good fit for this kind of matrix completion task? Are there alternative approaches or tweaks that might make it more effective?
  2. Any recommendations for loss functions that work well in self-supervised setups, especially where I want to prioritize low-certainty values?
  3. Are there best practices for masking observed values in self-supervised learning setups like this?
  4. Any advice on regularization techniques to prevent overfitting, given the self-supervised nature of the task? Also, any tips on ensuring scalability?
0 Comments
2024/11/06
12:44 UTC

13

[D] Want to move away from coding heavy ML but still want to complete the PhD

Hi Folks,

I come from a tradition electrical engineering background doing things like industrial automation and computer vision. I decided to pursue a PhD in ML as I thought it will be a good field to enter given my past experience. Now I have been doing the PhD for the past three years. While I like my group and research, I am getting discouraged/depressed by (1) The publication rat race (2) post graduation opportunities mostly being coding heavy (3) the inability to carve a name for myself in the field given how crowded the field has become.

Thus, ideally I would like to complete my PhD and move into a more relaxed paced (even if it is not as high paying as ML jobs) non coding heavy but technical job, where I do not have to constantly up-skill myself. Do you folks have any suggestion on what jobs I can look into or would you suggest dropping the PhD and doing something else?

TLDR: 4th year ML PhD student unsure of sticking with the PhD as they desire a non coding heavy technical job in the industry post graduation. Seeking advice on what to do.

0 Comments
2024/11/06
12:18 UTC

8

[D] Evolving Matrix Computation Techniques for Modern AI: What's New?

As AI models continue to scale in both complexity and size, I'm interested in how the field of matrix computations is evolving to meet these new challenges. What are some of the latest advancements or strategies in matrix computation that are improving efficiency and adaptability for modern AI systems? Are there any recent breakthroughs or shifts in our approach to these computations that are making a significant impact in AI research and applications?

0 Comments
2024/11/06
11:51 UTC

0

[D] LLM How to search on a subset of vectors corresponding to a given file

Hi, I am currently trying to use LLMs to see how it could help people to query technical documents in French. For my tests, documents are about maintenance of water heaters. The format is PDF files (clean PDF, scans, PDF with tables, ...). There is one PDF file per reference of water heater

I would like the LLM be able to answer to questions like "very little hot water, what could be the problem ?" or "how to drain the heater?"

For the information retrieval module of the RAG part, I am currently trying to use FAISS and SentenceTransformer with the model sentence-camembert-large ( I will probably use mixtral-8x7b-instruct-v0.1 for the answer generation)

In some use cases, I know what the reference of the water heater that the technician is working on (eg "1N11001") is . And I am able to pass this information to my agent.

In such cases, do you have any idea how I can constraint the search to the related notice file (eg "1N11001.pdf")

I noticed that the `search function` on the FAISS index proposes a parameter `labels`. Do you think I could use this in my case? Could anyone explain me how I can add labels with the name of my files during the indexation process in order to be able to search only in the subset of vectors related to a given file ?

Any help, idea or advices would be very appreciated :)

Thanks a lot

0 Comments
2024/11/06
10:45 UTC

0

[D] What if llm's are trained to predict more than 1 token at a time?

is there any reason to train llms to predict only one token? like wouldnt inference be 2 times faster if it was trained to predict just 2? thats huge gain , sure there can be performance loss but for inference we already do quantization to increase speed which decreases performance anyway, will having llm predict more than 1 token decrease it more?

7 Comments
2024/11/06
09:11 UTC

38

[D] As a researcher, how do you become industry-ready?

Being a PhD student, much of my time is spent on supervising students, project management and writing "quick and dirty" code for prototyping. I intend to move to industry after the PhD, but I feel like I'm missing out on key software engineering skills and good coding practices. Does anyone else feel this way? How do you upskill yourself to be industry-ready while doing a PhD?

11 Comments
2024/11/06
07:07 UTC

0

[P] Open-source declarative framework to build LLM applications - looking for contributors

I've been building LLM-based applications, and was super frustated with all major frameworks - langchain, autogen, crewAI, etc. They also seem to introduce a pile of unnecessary abstractions. It becomes super hard to understand what's going behind the curtains even for very simple stuff.

So I just published this open-source framework GenSphere. The idea is have something like Docker for LLMs. You build applications with YAML files, that define an execution graph. Nodes can be either LLM API calls, regular function executions or other graphs themselves. Because you can nest graphs easily, building complex applications is not an issue, but at the same time you don't lose control.

You basically code in YAML, stating what are the tasks that need to be done and how they connect. Other than that, you only write individual python functions to be called during the execution. No new classes and abstractions to learn.

Its all open-source. Now I'm looking for contributors to adapt the framework for cycles and conditional nodes - which would allow full-fledged agentic system building! Pls reach out  if you want to contribute, there are tons of things to do!

PS: you can read the detailed docs here, And go over this quick Google Colab tutorial.

1 Comment
2024/11/06
03:34 UTC

0

On a successful research with low budget [D]

Hi, i got a research idea and applied it on nanogpt repo for lm training and validated transformer generalizes better on validation loss but worse training loss and is more prone to overfitting since like training loss a little worse validation loss a little better, i only applied to full shakespeare_char and a subset on openwebtext bcz 10 usd on runpod only allows me to do this, i am still going to release a paper since i get good results and done some math work, should i do it?

1 Comment
2024/11/06
01:37 UTC

8

[D] Autograd vs JAX? Both are google products aimed at gradient based methods. What’s the main difference? (GPU/TPU?)

Just recently saw Autograd(library) by google people that thinly wraps numpy to offer backprop. JAX also does this but rewrites numpy basically. What’s the difference? Is it the gpu tpu support of JAX? is autograd meant for smaller models?

6 Comments
2024/11/06
01:15 UTC

0

[D] The Role of Dedicated AI Data Centers in Enhancing Model Training and Fine-Tuning

Just read that Kinetic Seas launched a new AI-specific data center—sounds like they’re aiming to make model training and fine-tuning less of a headache. Their setup includes specialized GPUs and CPUs, supposedly built to handle the demands of large, complex models. If traditional data centers feel like running uphill, maybe these AI-specific centers are the downhill version?

With machine learning models becoming more resource-hungry, I wonder if optimized infrastructure like this might change the game. Think about it: training models faster and with fewer limitations could really boost productivity for researchers and data scientists. Kinetic Seas seems to believe it’s worth building infrastructure just for AI, which feels like a pretty interesting bet.

Has anyone here worked with AI-specific setups like this? Curious to know if it’s really as smooth as it sounds!

https://www.prnewswire.com/news-releases/kinetic-seas-fka-bellatora-announces-completion-of-phase-i-of-its-data-center-for-ai-302168707.html

0 Comments
2024/11/06
00:55 UTC

6

[D] Is LoRA merging (and non linear mode connectivity) the key to better transformer hypernets?

Hi guys! I was thinking that, if we could dynamically merge LLM fine-tuning LoRAs depending on type of task at hand, we could fix catastrophic forgetting and maybe even have transformers better able to generalize. The thing is, due to Attention layers being very very non linear on their weights, transformers don't show poor LMC (linear mode connectivity).

Are you aware of the computational complexity of exact LoRA merging? I have seen quite a lot of papers on the subject of LoRA merging but they seem of poor quality and only empirical, with little mathematical grounding.

So if you guys have thought of it, I'd be glad to hear about it!

4 Comments
2024/11/05
21:31 UTC

0

Tools to classify emails - supporting DV victims [Discussion]

Hi all,

Apologies if this is the wrong place to post. I'm looking for tools that can help me support my partner, who has been harassed for a number of years by her ex and father of her child.

She is trying to compile evidence for a restraining order but going back through the years of emails and other messages is psychologically draining for her. I was wondering if there are any tools that have a good use case for analysing and classifying emails, either individually or in bulk, so that I can support her by taking over this work for her?

5 Comments
2024/11/05
20:36 UTC

0

[D] Mastering LLM Testing: Ensuring Accuracy, Ethics, and Future-Readiness for Next-Gen AI Models

Hi everyone! 😊 I just published an article: Mastering LLM Testing: Ensuring Accuracy, Ethics, and Future-Readiness for Next-Gen AI Models. I hope I didn’t miss anything important in there!

I’m planning to turn this into a series on AI model testing and testing in general. Hope you enjoy it, and I’m always open for feedback and discussion! 😄

0 Comments
2024/11/05
19:38 UTC

0

[R] What Are Your Biggest Pain Points in Managing and Scaling Multiple AI Models?

Hey r/MachineLearning! 👋

I’m doing some research to understand the key challenges people face when managing multiple AI models—particularly around scaling, monitoring performance, and handling model failures. I’d love to hear from the community to get a better sense of where the pain points are.

Here are a few questions to start:

  1. Scaling and Load Balancing: Do you find it difficult to scale models for high traffic or load balancing between models?
  2. Model Observability: How challenging is it to monitor multiple models in production?
  3. Fallback and Redundancy: When a model fails or underperforms, what’s your approach? Do you use fallback models, and if so, what would make managing them easier?
  4. User and Permission Management: For those supporting multiple teams or clients, how do you manage access across projects securely? Any struggles with multi-tenant support?

Thanks so much for sharing your experiences—I’m excited to hear your insights!

0 Comments
2024/11/05
18:43 UTC

0

[D] Voice Isolation

Hi!

So ElevenLabs has a pretty good audio isolation API but it is really expensive. Are there any opensource models that can be self-hosted and get near the same quality?

0 Comments
2024/11/05
18:41 UTC

29

[D] To what cross-entropy loss value can LLMs converge?

LLMs are usually evaluated on benchmarks that aim to measure broad abilities. However, most publishers of foundational models do not publish the actual cross-entropy loss value that the model achieves at the end of training. I couldn't find any sources on this, but I would like to know what loss value the LLMs can achieve on human language. Is there anyone who knows more about this? Might there be some lower bound?

17 Comments
2024/11/05
15:19 UTC

88

[R] Never Train from scratch

https://arxiv.org/pdf/2310.02980

The authors show that when transformers are pre trained, they can match the performance with S4 on the Long range Arena benchmark.

22 Comments
2024/11/05
14:02 UTC

0

[P] Getting crazy over a simple problem related to csv formatting

Hi everyone,

I'm facing a frustrating issue with my Python script. I'm processing prices and quantities in a DataFrame, using them to calculate unit prices, and saving the result to a CSV file. Everything seems perfect in Python (correct calculations, high precision), but when I open the CSV file, the values—particularly in the "Unit Prices" column—are incorrect (usually divided by 1000) or rounded, even though I specified high precision.

A few details:

  • I use pd.to_csv() with decimal='.' to ensure dot-based decimal formatting.
  • I'm not specifying a float_format, aiming to retain maximum precision for Unit Prices.
  • The data preview in Python shows the correct values before saving, but the saved CSV has discrepancies.

Example Output: Here’s an example of what I'm seeing:

  • Python Output (before saving to CSV): Unit Prices = 0.696

  • CSV Output (opened in Excel): Unit Prices = 696

The weird thing is that this does not happen consistently. In some cases, rows are correct.

Has anyone faced this issue before? Any tips on ensuring that the CSV retains the exact precision and format as seen in Python?

6 Comments
2024/11/05
12:26 UTC

5

[D] Text classification: N-shot prompt classification VS training a linear classifier on top of an embedder

I need to make a text classifier at work. I have 200 examples for each of the 5 categories. Each example is an email. Two approaches:

  • Classifying emails with n-shot prompt classification, possibly with LoRA finetuning.
  • Use a pre-trained text embedder (e.g. a sentence transformer or OpenAI text-embeddings-3) and a classification head. Train the classifier on the text embeddings.

Which approach is best?

6 Comments
2024/11/05
11:53 UTC

0

[D] Laptops for Theoretical Deep Learning

Hi, I am going for a PhD in theoretical deep learning and I am looking to buy a new laptop. I am unsure how readily the remote servers will be available (I have not been admitted into a program yet), so I am looking for enough compute power to simply test my code before running it on my lab's servers. I am currently contemplating between buying

  1. Asus Zenbook 14 OLED with 32GB RAM, Intel Core Ultra 9 185H Processor (24MB Cache, 16 cores, 22 Threads), 1TB M.2 NVMe SSD and 75WHrs 4-cell Li-ion battery
  2. Macbook Air with 24GB RAM, M2 Chip with 8-core CPU, 10-core GPU, 512GB Storage and 58.2-WHrs Li-polymer battery

I understand it would be better to go for a Nvidia GPU, and that neither of these laptops have a GPU, but I am not looking to invest in one.

My thoughts right now are that the Zenbook 14 has a slightly better processor, and much higher RAM than the MBA. I don't care about the SSD; 512GB is enough for me. However, I frequently see academics use the MBA, which could simply be about the fad, but I am not aware. I am also wondering if I am missing something I am not aware of by not jumping on the MBA train. They are about the same price, so that's not much of decision factor.

I am also not sure if I should look at the cheaper 16GB options. I am currently using a 16GB Zenbook 13 bought 5 years back, but the RAM was limiting me in my Master's thesis project. The processors have improved since then, so I am not sure if 16GB is enough now. Also, I know it would be ideal to wait to learn more about the compute resources available at the lab I join, but my current laptop is in a very poor state, so much so that I cannot carry it anywhere (hardware damage), the screen flickers all the time, and I worry that it will turn off any second and leave my data inaccessible.

Does anyone have any thoughts or suggestions?

28 Comments
2024/11/05
11:11 UTC

2

[d] About the speechbrain WSJ0Mix dataset.

I can't guarantee that the tag is appropriate.

I got tired of searching the WSJ0Mix dataset.

I want to separate multiple speakers.

The separator model of speechbrain doesn't give me the result I want.

So I wanted to build a model with the dataset I have.

However, no matter how much I searched for the WSJ0Mix dataset, it didn't come up.

I only found the *.m file, but I can't find what is included in the dataset or what is written in the *.csv file.

https://speechbrain.readthedocs.io/en/latest/tutorials/tasks/source-separation.html

The link above doesn't have the information I want either.

I'm very curious how you built the model.

3 Comments
2024/11/05
10:15 UTC

47

[D] Do second tier papers have any value when apply for industry research job?

I think I have come across some industry jobs before that required applicants to have top tier paper (NIPS/ICML/ICLR/CVPR/ICCV/ECCV), so my question is do paper from less prestige (AAAI/IJCAI/WACV/BMVC.... or journals) conference have any value when appying for these job? Additionaly, are metrics like h-index or citation matter?

35 Comments
2024/11/05
04:58 UTC

3

[D] Best Resources for Sensitivity Analysis on Large ML Pipelines?

I'm on a team that's launching a large project to examine how an ML pipeline behaves in response to variations in data.

This is the first time I'm doing a sensitivity analysis this large and complex in a while, so I'm looking for help to identify the most up-to-date resources on:

  • Simulated data, and especially any Python tools and how they compare with the best that R has to offer

  • Evaluation tooling

  • Elasticity

  • Any best resources on Sensitivity analysis overall, particularly newer ones from the post couple of years

What are the best resources you've found?

0 Comments
2024/11/05
03:58 UTC

39

[D] Advice on Preparing for Google ML Interview – Key Areas to Focus On?

I’m preparing for a machine learning interview with Google, and the recruiter shared the main areas they’ll focus on:

- Theoretical ML concepts and practical applications – including problem definition, model selection, model tuning, and evaluation.

- Industry-Scale ML – covering performance and cost optimization, data handling, and production-oriented experimentation & debugging.

If anyone has insights on what to expect in these areas or tips on what to focus on, I’d really appreciate it! I’m especially struggling to understand what “Industry-Scale ML” questions could actually be.

Thanks in advance for any advice or resources!

edit: for context: I've already done my two LC style interviews. The first interview was an easy-medium I would say, and the second interview was definitely hard. I think I did well on both but only the second interviewer let me know how I did (I did well apparently). I also did the Googlyness interview which I think went well also. We had some good conversation.

17 Comments
2024/11/05
03:53 UTC

10

Video Input for your local LLMS [P]

What My Project Does

OpenSceneSense-Ollama is a powerful Python package designed for privacy-focused video analysis directly on your local machine. With this tool, you can leverage Ollama’s local models to analyze frames, transcribe audio, dynamically select key frames, and generate detailed summaries — all without relying on cloud-based APIs. It’s ideal for those needing rich, insightful analysis of video content while ensuring data privacy and minimizing usage costs.

Target Audience

This project is tailored for developers, researchers, data scientists, and privacy-conscious users who require in-depth, locally processed video analysis. It's perfect for applications where data security is critical, including:

- Content creation workflows that need automatic video summarization

- Researchers building labeled datasets for machine learning

- Platforms needing context-rich content moderation

- Offline projects in remote or restricted environments

Comparison

OpenSceneSense-Ollama goes beyond traditional video analysis tools that often separate frame and audio analysis. Instead, it integrates both visual and audio elements, allowing users to prompt the models to produce comprehensive summaries and in-depth contextual insights. Where most tools might identify objects or transcribe audio separately, OpenSceneSense-Ollama unifies these components into narrative summaries, making it ideal for richer datasets or more nuanced content moderation.

Getting Started

To begin using OpenSceneSense-Ollama:

  1. Prerequisites: Make sure you have Python 3.10+, FFmpeg, PyTorch and Ollama installed on your machine.
  2. Install with pip: Run `pip install openscenesense-ollama` to install the package.
  3. Configuration: Start analyzing video with customizable prompts, frame selection, and audio transcription.

Feel free to dive in, try it out, and share your feedback especially if you're working in AI, privacy-focused applications, or video content moderation. Let’s build a powerful, local solution for meaningful video analysis!

https://github.com/ymrohit/openscenesense-ollama

6 Comments
2024/11/05
00:50 UTC

7

[P] NN for creating best camouflage

I had this idea for some time, and I have created all the functions for creating data as well as all the architecture. The problem is that I only have two years experience in Deep Learning, and this is GAN style network, and GANs are known to be very hard to train. I would like you opinions on idea, as well as some tips, suggestions, advices and things to change. Also if someone finds this interesting I would love to work with someone on this project.

Camouflage Pattern Generation Model

The objective is to create a model that generates optimal camouflage color patterns by training a generator model and using a segmentation model as a discriminator to assess the effectiveness of the generated camouflage. Both the generator and discriminator are trained simultaneously.

Model Structure

Forward Process

  1. Generator:
    • The generator is a simple decoder model that takes a random latent vector of size n_embed = 128 and outputs a 3x32x32 camouflage color pattern.
    • This generated camouflage pattern is then tiled to form a larger texture, matching the size of an image of a soldier.
  2. Creating Camouflaged Soldier:
    • Random black-and-white PNG images of soldiers are sampled and resized to (1, W, H), with the values inverted so the soldier appears in white (foreground) and the background is black.
    • The tiled camouflage pattern is then applied to the soldier by masking with the soldier image, producing a camouflaged soldier figure. This entire operation is batched and allows gradients to flow through.
  3. Placing Camouflaged Soldier on Background:
    • The camouflaged soldier is randomly placed on a background image (e.g., a forest scene).
    • A label mask for the segmentation model is generated simultaneously, with two classes: background and soldier.
  4. Discriminator (Segmentation Model):
    • A pre-trained segmentation model (acting as a discriminator) is used with two output classes (background and soldier).
    • This model assesses how well the camouflage pattern blends the soldier into the background by trying to classify the soldier as the background.

Loss Functions and Optimization

Two loss functions are used, each with separate backpropagation processes:

  1. Generator Loss:
    • This encourages the generator to create a camouflage pattern that makes the soldier indistinguishable from the background.
    • Loss Function: CrossEntropyLoss(output, 0) where the output is the predicted segmentation map from the discriminator, and 0 represents the background class.
  2. Discriminator (Segmentation Model) Loss:
    • This encourages the segmentation model to correctly identify the camouflaged soldier in the background.
    • Loss Function: CrossEntropyLoss(output, label_mask) where the label mask has two classes: background and soldier.

Key Considerations

This setup resembles a Generative Adversarial Network (GAN) but differs in that it uses no "real" camouflage data, only generated samples. Additionally:

  • Separate Optimizers: Different optimizers are recommended for the generator and discriminator.
  • Loss Scaling: Careful tuning of scaling factors or learning rates may be required to stabilize training.
  • Two-Step Backpropagation: Instead of a typical GAN-style loss, a two-step backpropagation approach is used to update the models independently.

https://preview.redd.it/qd2cr2rkyyyd1.png?width=5603&format=png&auto=webp&s=0faee2cb0504a98c36b365b2edbc59253509d8c7

10 Comments
2024/11/04
23:28 UTC

137

What problems do Large Language Models (LLMs) actually solve very well? [D]

While there's growing skepticism about the AI hype cycle, particularly around chatbots and RAG systems, I'm interested in identifying specific problems where LLMs demonstrably outperform traditional methods in terms of accuracy, cost, or efficiency. Problems I can think of are:

- words categorization

- sentiment analysis of no-large body of text

- image recognition (to some extent)

- writing style transfer (to some extent)

what else?

100 Comments
2024/11/04
20:52 UTC

2

[D] Resources for adding cross attention to a pretrained language model

I want to train new cross attention layers feeding into a pretrained transformer (maybe a small llama model) while keeping the rest of the model constant.

What are some resources that might be helpful?

5 Comments
2024/11/04
16:03 UTC

Back To Top