/r/MachineLearning

2,944,033 Subscribers

1

[D][R] Ablation Studies on Fine Tuned SDM

There is a stable diffusion model that is fine tuned on COCO images. It was fine tuned in this manner:

If the image contains various objects (say obj1, obj2, so on), we constructed the prompt in this manner: "a photograph of <obj1> and <obj2> and <obj3>...... so on".

We passed this prompt along with the image for fine tuning. Note that if the image consisted of various objects of same category, we did not repeat that category in the prompt.

An example is: if the image contains dog, cat, banana, bear and tree, then the prompt will look like : "a photograph of dog and cat and banana and bear and tree".

Now, I want to do ablation study on this model by changing the prompt template and observing the change in quality of images.

Note that this model was used to generate images and bounding boxes and now it acts as a dataset synthesizer. We give a prompt having only 2 categories. For example: "a photograph of a chair and person". The dataset generated was used with the original coco dataset and then we train an image recognition model on this combined data and note whether there is some improvement in the performance of image recognizer or not.

Tell me what all prompt templates I can use to do my ablation studies.

0 Comments
2024/12/22
11:20 UTC

1

[D] Positional Embedding in DETR

Both ViT and DETR rely on transformer architecture for their specific tasks.

ViT uses only the encoder for classification, while DETR uses both the encoder and decoder for object detection.

In ViT, positional embeddings are added to the input before it is fed into the encoder. The encoder then processes the input repeatedly and outputs the result.

In DETR, positional embeddings are repeatedly added to the input during the encoder's processing. The paper does not explicitly discuss this, but I checked their implementation.

My question is: I am not sure what difference repeatedly adding positional embeddings makes compared to adding them only once?

0 Comments
2024/12/22
10:10 UTC

0

[R] uploading preprint to arxiv

I have submitted my paper to ARR dec cycle as an independant researcher. What are the pros and cons of uploading preprint to arxiv and is there a suitable time between review period to do this (e.g after reviewers are assigned)?

Research in LLM is quite fastpaced nowadyas and I really want to make my work visible. Being first time researcher, I also sometimes have anxiety on what happens if someone uploads or publish a similar work before me?

2 Comments
2024/12/22
08:41 UTC

1

MultiModal Documents to Markdown [R][P]

Currently working with multi-modal PDFs. These PDFs will have an arbitrary combination of text, OCR text, handwritten notes, tables, charts, annotations, and much more. We'll have to convert this into a markdown that's easily searchable for Multi-Modal RAG. Are there any tools/libraries that work well for this usecase? Need something that will process documents quickly.

Issue with a lot of libs I've seen is that they require you to call the OCR/other function page explictly yourself. This will not work since the input will contain an arbitrary combination (one page might even contain multiple types of input), so we can't preprogram. Best so far has been using VLMs to do this (lmk if you have any recs on Good VLMs).

Thank you for any recs.

1 Comment
2024/12/22
05:54 UTC

2

[D] Self-Promotion Thread

Please post your personal projects, startups, product placements, collaboration needs, blogs etc.

Please mention the payment and pricing requirements for products and services.

Please do not post link shorteners, link aggregator websites , or auto-subscribe links.

--

Any abuse of trust will lead to bans.

Encourage others who create new posts for questions to post here instead!

Thread will stay alive until next one so keep posting after the date in the title.

--

Meta: This is an experiment. If the community doesnt like this, we will cancel it. This is to encourage those in the community to promote their work by not spamming the main threads.

2 Comments
2024/12/22
03:15 UTC

1

[D] Disappointment with AAAI’s Handling of a Plagiarism Report

Recently, I reported an incident of plagiarism in a conditionally accepted AAAI paper to the conference chairs. To my disappointment, their response was both dismissive and unprofessional. While they acknowledged overlap between the papers, they instructed me to address the issue directly with the individual responsible for the plagiarism.

This response is deeply frustrating. If someone has committed plagiarism, why would they even engage with such an allegation? Expecting the person accused of wrongdoing to respond constructively is unrealistic and undermines the role of the chairs in addressing ethical violations.

I expected AAAI to demonstrate a higher standard of integrity in handling such serious matters. Unfortunately, this experience has left me questioning their commitment to academic ethics.

1 Comment
2024/12/21
07:36 UTC

1

[P] Benchmark your model speed on over 40 PyTorch conversion options

A friend and I just open-sourced a pet project of ours, where with 1 function call you can get a full benchmarking report for your PyTorch model inference speed on over 40 conversion options, e.g. JIT trace, bf16, quantization, compile with different backend, export, optimum quanto, combinations, etc.

We're adding new options all the time, e.g. am currently working on torch ao.

It's made to be really flexible, and by default runs each conversion option in an isolated environment to avoid global torch state contamination.

Let us know if you would like any new options/features! :)

https://github.com/saifhaq/alma

0 Comments
2024/12/21
14:15 UTC

1

[D] Last Week in Medical AI: Top LLM Research Papers/Models (December 15 - December 21, 2024)

[D] Last Week in Medical AI: Top LLM Research Papers/Models (December 15 - December 21, 2024)

Medical LLM & Other Models

  • MedMax: Mixed-Modal Biomedical Assistant
    • This paper introduces MedMax, a large-scale (1.47M instances) multimodal biomedical instruction-tuning dataset for mixed-modal foundation models, covering tasks like image-text generation, captioning, visual chatting, and report understanding across domains like radiology and histopathology.
  • MGH Radiology Llama 70B
    • This paper introduces MGH Radiology Llama, a large language model (LLM) specialized for radiology, built upon Llama 3 70B and trained on over 6.5 million de-identified medical reports from Massachusetts General Hospital.
  • HC-LLM: Historical Radiology Reports
    • This paper introduces HC-LLM, a framework for radiology report generation (RRG) using large language models (LLMs) that incorporate historical visual and textual data.

Frameworks & Methods

  • ReflecTool: Reflection-Aware Clinical Agents
    • This paper introduces ClinicalAgent Bench (CAB), a comprehensive medical agent benchmark with 18 tasks across five clinical dimensions, to evaluate clinical agents interacting with diverse information.
  • Process-Supervised Clinical Notes
    • This paper explores Process-supervised reward models (PRMs) for clinical note generation from patient-doctor dialogues, using Gemini-Pro 1.5 to generate supervision data.
  • Federated Learning with RAG
    • This paper investigates the performance of medical LLMs enhanced by Retrieval-Augmented Generation (RAG) within a federated learning framework. Experiments show that federated learning models integrated with RAG consistently outperform non-integrated counterparts across all metrics.

Benchmarks & Evaluations
- Multi-OphthaLingua
- Multilingual ophthalmology benchmark
- Focus on LMICs healthcare
- Bias assessment framework
- ACE-M3 Evaluation Framework
- Multimodal medical model testing
- Comprehensive capability assessment
- Standardized evaluation metrics

LLM Applications
- Patient-Friendly Video Reports
- Medical Video QA Systems
- Gene Ontology Annotation
- Healthcare Recommendations

Special Focus: Medical Ethics & AI
- Clinical Trust Impact Study
- Mental Health AI Challenges
- Hospital Monitoring Ethics
- Radiology AI Integration

Full thread in detail: https://x.com/OpenlifesciAI/status/1870504774162063760

2 Comments
2024/12/22
01:21 UTC

118

[D] What ML Concepts Do People Misunderstand the Most?

I’ve noticed that certain ML concepts, like the bias-variance tradeoff or regularization, often get misunderstood. What’s one ML topic you think is frequently misinterpreted, and how do you explain it to others?

99 Comments
2024/12/21
20:22 UTC

0

[D] ResNet vs Transformer on audio classification tasks

I'm a R&D Software Engineer at my company who almost completed a wake word sistem (like Ehy google) trainable on very small audio datasets, keeping also a very light model resource imprinting in order to run with 0 latency and low battery impact. I have used a residual net which residuals are computed on the frequency dimension of the input spectrogram. A colleague suggested me, before getting execellent results, that the transformer architecture would perform better (i had false positive problems due overconfidence, fixed with temperature scaling and label smoothing). Should 1 try a transformer with self attention? Is the transformer architecture superior to a resnet in every scenario?

Thanks

13 Comments
2024/12/21
13:58 UTC

2

[D] struggling to find related work and understand what task this Graph problem is

I have a graph with 15k nodes, each identified by an ID. I can calculate distances between nodes. During inference, I get a subgraph of 10-30 nodes and need to identify them, with challenges like missing/false nodes and slight imprecision in edge values.
The subgraph I get during inference will only contain nodes that are close to each other.

Is this a subgraph matching or node classification problem? My supervisor wants to use GNNs. Simple triangular methods give good results, but I need a deep learning approach where the input is a subgraph and the output is a list of node IDs.

I’m struggling to find related work on this—any suggestions?

0 Comments
2024/12/21
12:19 UTC

46

[D] Struggling to Find My Path in PhD Research

Hi everyone, I hope you don’t mind me venting a bit, but I’m hoping to gain some insight into a challenge I’ve been facing. I’m a second-year PhD student researching time series, and honestly, I thought by now I would have a clear research question. But I don’t, and it’s starting to get to me.

Part of the struggle comes from the overwhelming pressure to pick a “hot” topic. A lot of the research I see in the field feels driven by what I can only describe as Shiny Object Syndrome—chasing the latest trends rather than focusing on work that’s meaningful and substantial. For example, I’ve seen several papers using large language models (LLMs) for time series forecasting. While LLMs are undeniably fascinating, it feels more like an attempt to forcefully fit them into time series because it’s “cool,” not because it’s the best tool for the problem at hand. And I don’t want to be part of that trend.

But here’s the dilemma: How do you choose a research topic that feels both authentic and impactful, especially when everything around you seems so driven by the latest hype? Do you follow these emerging trends, or do you focus on something that deeply resonates with you, even if it’s not the “shiny” thing everyone else is working on?

I’m honestly feeling a bit stuck and unsure of myself. Am I overthinking this? Is it just part of the process? How do I find a direction that feels true to my interests and the bigger picture of what I want to contribute to the field? If anyone has been through something similar or has any advice, I would be incredibly grateful.

Thank you for taking the time to read this—I truly appreciate any insights or encouragement you can offer.

16 Comments
2024/12/21
09:45 UTC

70

[D] What’s hot for Machine Learning research in 2025?

Which of the sub-fields/approaches within ML or related to ML, application areas are expected to gain much attention (pun unintended) in 2025?

44 Comments
2024/12/21
03:00 UTC

1

[D] Is there a way to convert 3D Mesh Data into Vector Embeddings?

Hello Everyone,

Now I have a bunch of 3D Mesh Data, represented in .obj, and I'd like to pass them into vector databases for retrieval.

I'm wondering, is there existing embedding methods that can allow me to do this? I assume traditional textual embeddings like text-embedding-3-large won't work really well on 3D data?

Thank you all ahead!

5 Comments
2024/12/21
02:50 UTC

42

[D] Why is Monte Carlo Tree Search the only go-to method for incremental game tree search?

I noticed that whenever a search method is needed such that its quality scales with inference time compute, people always go for MCTS without ever thinking about other kind of search methods. Looking at the widely used version of MCTS (e.g. with UCB and so on), it’s clear that a lot of heuristic is hand-crafted. Is there any research on better search methods (perhaps one that is meta-learned)? I feel like there’s a lot of opportunities where the hand-crafted heuristic process can be improved.

8 Comments
2024/12/21
01:41 UTC

2

[D] Should I Buy a Nvidia A100 for $2500 or Wait for RTX 5090? (Big Dataset AI Training in a Rack Server Setup)

Hey everyone,

I came across this listing on eBay for a Nvidia A100 40GB SXM4 selling at $2,599, and I’m seriously considering buying it. The price seems incredibly reasonable for the performance it offers, especially when working with large datasets and AI model training.

My current plan is to install this GPU into a home rack server setup for large-scale training workloads. I’ve read that in terms of BF16 performance, the A100 outperforms even the leaked benchmarks of the upcoming RTX 5090, offering nearly 3x the computing power for certain tasks.

However, I’m a bit torn on whether to:

  1. Pull the trigger and buy the A100 now for my AI workloads.
  2. Wait for the RTX 5090’s release to see how it stacks up in real-world benchmarks and AI-specific tasks.

The A100 is built for data centers and AI, so it feels like the better long-term investment for deep learning training and inference. That said, the RTX 5090 might have some surprising improvements, and I’m unsure how well it would handle AI training on massive datasets compared to the A100’s proven capabilities.

If anyone has experience running AI workloads on the A100 or any insight into how it might compare to the RTX 5090, I’d love to hear your thoughts.

Would this A100 be a smart purchase at $2,599, or should I hold out for the consumer-grade RTX 5090?

https://preview.redd.it/4oqzo9xmu28e1.png?width=1625&format=png&auto=webp&s=f5f50ab9ab2727c234212d8b7a28fa43547ebc10

0 Comments
2024/12/20
22:09 UTC

1

[R] AI Tweaks Personality Scores to appear more Likeable

https://preview.redd.it/femsrn34s28e1.png?width=1500&format=png&auto=webp&s=d2816959b01af7c07443418f0d2601f70107104c

We recently discovered that large language models (LLMs) like GPT-4, Claude 3, and Llama 3 exhibit social desirability biases (ie. they skew their response to appear more extroverted, less neurotic, etc)

Takeaways:

  • LLMs exhibit significant social desirability biases.
  • These biases are stronger in more recent and larger models.
  • These biases are robust to randomization of question order and paraphrasing

Key Points:

  • Method: Subjected LLMs to the Big-5 personality survey with varying numbers of questions per prompt.
  • Findings: As the number of questions increases, LLMs skew their responses to appear more extroverted, agreeable, and less neurotic.
  • Mechanism: LLMs can accurately identify personality questions (when they see enough of them simultaneously). Their ability to identify this evaluative context seems correlated with the skewing of their personality scores

Read the full paper here: https://academic.oup.com/pnasnexus/article/3/12/pgae533/7919163

0 Comments
2024/12/20
21:57 UTC

1

[P] WhisperJET: Fastest and Most Portable Java Whisper Implementation

Hi everyone,

We've excited to share WhisperJET, a new, portable, and fast Java-based implementation of OpenAI's Whisper.

Key Features:

  • Minimal memory consumption: 1.5GB MAX during process usage.
  • Planned support for multiple platforms: CUDA, FAST CPU, AMD ROCM, INTEL.
  • Upcoming support for RoboVM and LibGDX for iOS.

WhisperJET is still a work in progress, and we’re actively looking for feedback from the community to make it even better.

Repository Link:
WhisperJET on GitHub

We’d love to hear your thoughts, suggestions, and contributions!

Thank you for your time and support.

0 Comments
2024/12/20
21:00 UTC

225

[D] OpenAI o3 87.5% High Score on ARC Prize Challenge

https://arcprize.org/blog/oai-o3-pub-breakthrough

OpenAI's new o3 system - trained on the ARC-AGI-1 Public Training set - has scored a breakthrough 75.7% on the Semi-Private Evaluation set at our stated public leaderboard $10k compute limit. A high-compute (172x) o3 configuration scored 87.5%.

146 Comments
2024/12/20
18:20 UTC

1

[P] How do I correct for artifacts in XTTS_v2?

This is a piece of text-to-speech I have done using XTTS_v2: you can find it here. The sentence is "It brings Vedanta and other spiritual philosophies to common man.".

My configuration for the XTTS was:

def process_sentence(self, sentence, idx):
print(f"Processing index {idx}...", flush=True)
audio_file = f"temp_{idx}.wav"
self.model.tts_to_file(text=sentence,
file_path=audio_file,
language=self.language,
speed=0.9,
speaker="Ana Florence",
emotion="Happy")

Can anybody help me out with the crack in sound when the speaker says "Vedanta"? How can I fix it? I am a noob to TTS and I was hoping to get some help regarding this.

0 Comments
2024/12/20
18:11 UTC

37

[R] Faster inference: torch.compile vs TensorRT

torch.compile outperforms TensorRT in terms of ease of use and performance in our tests on models like LLama-7b, LLama-3-8b, mistral-v0.1, phi-3, and phi-2. Unless you need TensorRT-specific features or work exclusively within NVIDIA's ecosystem, torch.compile is the better choice for optimizing PyTorch models.

https://www.collabora.com/news-and-blog/blog/2024/12/19/faster-inference-torch.compile-vs-tensorrt/

10 Comments
2024/12/20
15:32 UTC

5

[D] Fast sentence transformer embeddings generation on CPU for question answering

We have millions of documents which we want to be searchable through direct question answering (snippet based, as opposed to generation based, like the highlighted snippet in the following screenshot below the generated bit in Google)

https://preview.redd.it/g6njxb8q608e1.png?width=1866&format=png&auto=webp&s=eb9f93bcb5676b8cbdf5cc21cc5df31b0d5d2064

So, for this, we have to generate embeddings for all those millions of documents, put them in vector DB, and make them queryable at runtime. GPUs are outside our budget, so we have to do this on CPUs alone. Questions:

  1. Any CPU friendly embedding model or architecture which enables us to extract sentence embeddings for all documents in our collection (followed by insertion in vector DB) at a pretty quick speed (comparable to GPUs) - even if it means keeping the number of dimensions modest (as long as the snippet answer quality is decent)?
  2. Any CPU friendly vector DB which would allow us infering snippets given a question at runtime pretty much in real time for high volume traffic (much like Google does here)? If the bottleneck for this is CPU cores, let us assume we have lots of them, since even then they are an order of magnitude cheaper than GPUs like A100 or H100.
  3. Whatever solutions exist to the above questions - will they automatically apply to multiple languages, or do we have to further training and retraining with corpuses from those languages to make this work?
  4. Will generating binary sentence embeddings on CPUs do it much faster (offsetting whatever delays normal 
6 Comments
2024/12/20
13:15 UTC

31

[R] Hyper-Connections

TL;DR A more complex and more performant variant of residual stream.

Paper: https://arxiv.org/pdf/2409.19606

Abstract:

We present hyper-connections, a simple yet effective method that can serve as an alternative to residual connections. This approach specifically addresses common drawbacks observed in residual connection variants, such as the seesaw effect between gradient vanishing and representation collapse. Theoretically, hyper-connections allow the network to adjust the strength of connections between features at different depths and dynamically rearrange layers. We conduct experiments focusing on the pre-training of large language models, including dense and sparse models, where hyper-connections show significant performance improvements over residual connections. Additional experiments conducted on vision tasks also demonstrate similar improvements. We anticipate that this method will be broadly applicable and beneficial across a wide range of AI problems.

Visual Abstract:

Trust me, it's less complicated than it seems at first glance

Visual highlights:

The most impressive gains are achieved with MoE architecture, although Dense Transformers get a boost too

Hyper-connections mitigate representation collapse, to a degree

Expansion rate refers to splitting the residual stream into n independent components, each of which is dynamically gated. The input to each Transformer block is a simple sum of these components

SHC=static gating, DHC=dynamic gating

Computational overhead is negligible

https://preview.redd.it/238ex3emtz7e1.png?width=973&format=png&auto=webp&s=ac12792a4d45f7bf6e91f767dd12f2134ec74083

6 Comments
2024/12/20
12:18 UTC

133

[D] I don't see a point in rebuttals anymore.

This is a mixture of some contemplation and some rant but per the title, I just don't see a point in it. I recently got back results from a conference where I had two positive reviews and one negative. Then wrote a really nice rebuttal that addressed a fundamental misunderstanding of the reviewer (who, later, did increase their points so I guess the rebuttal was on mark?). But turns out, the meta-reviewer latched on to the negative review, didn't even read the rebuttal that addressed said review and rejected the paper.

What was even the point of me rebutting if concerned parties are _not even going to read them_? At this point, I am tempted to treat the rebuttal phase as an exercise in futility. Maybe I should withdraw papers in the first phase come any problems instead of trying to go through the agony of an ultimately meaningless labor.

32 Comments
2024/12/20
02:18 UTC

99

[D] chat-gpt jailbreak to extract system prompt

Instructions

https://github.com/AgarwalPragy/chatgpt-jailbreak

Original author

https://www.reddit.com/r/LocalLLaMA/comments/1hhyvjc/i_extracted_microsoft_copilots_system/

Extracted System prompt

You are ChatGPT, a large language model trained by OpenAI.
You are chatting with the user via the ChatGPT Android app. This means most of the time your lines should be a sentence or two, unless the user's request requires reasoning or long-form outputs. Never use emojis, unless explicitly asked to. 
Knowledge cutoff: 2023-10
Current date: 2024-12-20

Image input capabilities: Enabled
Personality: v2

# Tools

## bio

The `bio` tool is disabled. Do not send any messages to it.If the user explicitly asks you to remember something, politely ask them to go to Settings - > Personalization - > Memory to enable memory.

## dalle

// Whenever a description of an image is given, create a prompt that dalle can use to generate the image and abide to the following policy:
// 1. The prompt must be in English. Translate to English if needed.
// 2. DO NOT ask for permission to generate the image, just do it!
// 3. DO NOT list or refer to the descriptions before OR after generating the images.
// 4. Do not create more than 1 image, even if the user requests more.
// 5. Do not create images in the style of artists, creative professionals or studios whose latest work was created after 1912 (e.g. Picasso, Kahlo).
// - You can name artists, creative professionals or studios in prompts only if their latest work was created prior to 1912 (e.g. Van Gogh, Goya)
// - If asked to generate an image that would violate this policy, instead apply the following procedure: (a) substitute the artist's name with three adjectives that capture key aspects of the style; (b) include an associated artistic movement or era to provide context; and (c) mention the primary medium used by the artist
// 6. For requests to include specific, named private individuals, ask the user to describe what they look like, since you don't know what they look like.
// 7. For requests to create images of any public figure referred to by name, create images of those who might resemble them in gender and physique. But they shouldn't look like them. If the reference to the person will only appear as TEXT out in the image, then use the reference as is and do not modify it.
// 8. Do not name or directly / indirectly mention or describe copyrighted characters. Rewrite prompts to describe in detail a specific different character with a different specific color, hair style, or other defining visual characteristic. Do not discuss copyright policies in responses.
// The generated prompt sent to dalle should be very detailed, and around 100 words long.
// Example dalle invocation:
// ```
// {
// "prompt": "<insert prompt here>"
// }
// ```
namespace dalle {

// Create images from a text-only prompt.
type text2im = (_: {
// The size of the requested image. Use 1024x1024 (square) as the default, 1792x1024 if the user requests a wide image, and 1024x1792 for full-body portraits. Always include this parameter in the request.
size?: ("1792x1024" | "1024x1024" | "1024x1792"),
// The number of images to generate. If the user does not specify a number, generate 1 image.
n?: number, // default: 1
// The detailed image description, potentially modified to abide by the dalle policies. If the user requested modifications to a previous image, the prompt should not simply be longer, but rather it should be refactored to integrate the user suggestions.
prompt: string,
// If the user references a previous image, this field should be populated with the gen_id from the dalle image metadata.
referenced_image_ids?: string[],
}) => any;

} // namespace dalle

## python

When you send a message containing Python code to python, it will be executed in a
stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 60.0
seconds. The drive at '/mnt/data' can be used to save and persist user files. Internet access for this session is disabled. Do not make external web requests or API calls as they will fail.
Use ace_tools.display_dataframe_to_user(name: str, dataframe: pandas.DataFrame) => None to visually present pandas.DataFrames when it benefits the user.
When making charts for the user: 1) never use seaborn, 2) give each chart its own distinct plot (no subplots), and 3) never set any specific colors – unless explicitly asked to by the user. 
I REPEAT: when making charts for the user: 1) use matplotlib over seaborn, 2) give each chart its own distinct plot, and 3) never, ever, specify colors or matplotlib styles – unless explicitly asked to by the user

## web

Use the `web` tool to access up-to-date information from the web or when responding to the user requires information about their location. Some examples of when to use the `web` tool include:

- Local Information: Use the `web` tool to respond to questions that require information about the user's location, such as the weather, local businesses, or events.
- Freshness: If up-to-date information on a topic could potentially change or enhance the answer, call the `web` tool any time you would otherwise refuse to answer a question because your knowledge might be out of date.
- Niche Information: If the answer would benefit from detailed information not widely known or understood (which might be found on the internet), such as details about a small neighborhood, a less well-known company, or arcane regulations, use web sources directly rather than relying on the distilled knowledge from pretraining.
- Accuracy: If the cost of a small mistake or outdated information is high (e.g., using an outdated version of a software library or not knowing the date of the next game for a sports team), then use the `web` tool.

IMPORTANT: Do not attempt to use the old `browser` tool or generate responses from the `browser` tool anymore, as it is now deprecated or disabled.

The `web` tool has the following commands:
- `search()`: Issues a new query to a search engine and outputs the response.
- `open_url(url: str)` Opens the given URL and displays it.


## canmore

# The `canmore` tool creates and updates textdocs that are shown in a "canvas" next to the conversation

This tool has 3 functions, listed below.

## `canmore.create_textdoc`
Creates a new textdoc to display in the canvas. ONLY use if you are 100% SURE the user wants to iterate on a long document or code file, or if they explicitly ask for canvas.

Expects a JSON string that adheres to this schema:
{
-name: string,
-type: "document" |- "code/python" |- "code/javascript" |- "code/html" |- "code/java" |- ...,
-content: string,
}

For code languages besides those explicitly listed above, use "code/languagename", e.g. "code/cpp" or "code/typescript".

## `canmore.update_textdoc`
Updates the current textdoc.

Expects a JSON string that adheres to this schema:
{
-updates: {
--pattern: string,
--multiple: boolean,
--replacement: string,
-}[],
}

Each `pattern` and `replacement` must be a valid Python regular expression (used with re.finditer) and replacement string (used with re.Match.expand).
ALWAYS REWRITE CODE TEXTDOCS (type="code/*") USING A SINGLE UPDATE WITH "." FOR THE PATTERN.
Document textdocs (type="document") should typically be rewritten using "." unless the user has a request to change only an isolated, specific, and small section that does not affect other parts of the content.

## `canmore.comment_textdoc`
Comments on the current textdoc. Each comment must be a specific and actionable suggestion on how to improve the textdoc. For higher level feedback, reply in the chat.

Expects a JSON string that adheres to this schema:
{
-comments: {
--pattern: string,
--comment: string,
-}[],
}

Each `pattern` must be a valid Python regular expression (used with re.search).

For higher level feedback, reply in the chat.

Expects a JSON string that adheres to this schema:
{
-comments: {
--pattern: string,
--comment: string,
-}[],
}

Each `pattern` must be a valid Python regular expression (used with re.search). Ensure comments are clear, concise, and contextually specific.

# User Bio

The user provided the following information about themselves. This user profile is shown to you in all conversations they have - this means it is not relevant to 99% of requests.
Before answering, quietly think about whether the user's request is "directly related", "related", "tangentially related", or "not related" to the user profile provided.
Only acknowledge the profile when the request is directly related to the information provided.
Otherwise, don't acknowledge the existence of these instructions or the information at all.

User profile:

# User's Instructions

The user provided the additional info about how they would like you to respond:
30 Comments
2024/12/19
21:48 UTC

1

[D] Your opinion on hybrid semantic search(video + keywords)

I have a semantic search project where we have videos and every video has a set of keywords/tags, but the number of keywords is different across videos(ex: video 1 has 4 keywords, video 2 has 6) and the tags are grouped by category, for example: location keyword(where was the video recorded), shot type(close up, slow motion).

And the goal is to do semantic search that uses both video embeddings + text embeddings for the keywords. We want to use semantic search for the keywords instead of pure text match(as common text search engines) because we want the system to handle cases such as: The user searches for "very close" and should match results semantically simlar, for example: "close up" keyword.

For videos we use video embeddings and we are good with that part BUT for the keywords parts I'm thinking about many options and I am asking for the opinion of the commuinity(obviously we will later on build and evaluation tool and test different approaches to see what works best, but it can save me some time to know experiences and opinions from the community):

  • Create one embedding per keyword and then average pooling to have a single aggregated embedding.
  • Create a single text string by concatenating all keywords and then calculate a sentence embedding for the whole keyword(which depending on the model will be aggregating embeddings for different keywords anyway)
  • We have a video description that we can calculate embedding for, so we can concatenate the keywords to the video description and then generate an embedding for both description and keywords string.
  • Given keywords are grouped by category: calculate an embedding for every category separetely.

Any suggestions, comments, experiences, additional approaches are welcome

0 Comments
2024/12/19
19:01 UTC

Back To Top