/r/deeplearning
Resources for understanding and implementing "deep learning" (learning data representations through artificial neural networks).
/r/deeplearning
Been reverse engineering WizardMath's architecture (Luo et al., 2023) and honestly, it's beautiful in its simplicity. Everyone's focused on the results, but the 3-step training process is the real breakthrough.
Most "math-solving" LLMs are just doing fancy pattern matching. This approach is different because it's actually learning mathematical reasoning, not just memorizing solution patterns.
I've been implementing something similar in my own work. The results aren't as good as WizardMath yet, but the approach scales surprisingly well to other types of reasoning tasks. You can read more of my analysis here. If you're experimenting with wizard math, also let me know https://blog.bagel.net/p/train-fast-but-think-slow
Hi,
I'm working on a Variational Autoencoder, which is fully convolutional, so it can take different sequence length inputs. In calculating the Kullback-Liebler divergence, you would normally use summation to reduce the loss, then divide by the batch size. However, I have different sequence lengths as inputs, and I think this is making the KL divergence be very different between batches. I could normalize by dividing the kl loss by the sequence length of the given batch, but this will not be a correct implementation mathematically.
I'm unsure on what to do.
Deep Learning Innovations: What’s Driving the Next Wave of AI Power?
The field of deep learning is continuously evolving, with innovations like self-supervised learning, new architectures, and neural networks that are capable of unimaginable feats. This post explores recent advancements that are pushing the boundaries of what deep learning can achieve, from image recognition to natural language understanding. Learn about the latest tools, research, and trends shaping the future of AI.🪐
Want to stay at the cutting edge of AI? Join r/deeplearning and see what’s new in the world of deep learning! 👇🏽
I was working with SAM2 and have been trying to figure out the best way to fine-tune it for my specific use case. A few considerations that I was hoping get some insights on:
Hey there i hope you'll good , im going to be 20s old in the next months and i just dropped off the university for financial reasons my parents aren't that much to support me,so I'm feeling lost right now i wanna invest my time in something that's can earn me some money ,i knew some of electronics repair but im not sure if it's good career, and i have intereste in Al and machine learning and i heard frome someone on YouTube it's not for who have no coding skills , pls clear me up or you can suggest some finance advice
Hello, I have a custom transformer model exported from PyTorch, and I am trying to deploy as a Chrome extension. For greedy/beam search, what is the best practice? I am in the process of using Javascript and ort.Tensor to create attention mask and input sequence at each step, but realized this could be a bit slow. Thanks!
I have been through a great book called Dive into Deep Learning but i can't understand the intuition behind attention which leads to the fact that I can't fully comprehend transformers
so where should I go if I want to fully understand attention mechanisms and transformers?
my second question is , are attention mechanisms a must in order to understand transformers?
I see on https://huggingface.co/HuggingFaceTB/SmolLM2-135M-Instruct/tree/main/onnx:
File Name | Size |
---|---|
model.onnx | 654 MB |
model_fp16.onnx | 327 MB |
model_q4.onnx | 200 MB |
model_q4f16.onnx | 134 MB |
I understand that:
model.onnx
is the fp32 model,model_fp16.onnx
is the model whose weights are quantized to fp16
I don't understand the size of model_q4.onnx
and model_q4f16.onnx
Why is model_q4.onnx
200 MB instead of 654 MB / 4 = 163.5 MB? I thought model_q4.onnx
meant that the weights are quantized to 4 bits.
Why is model_q4f16.onnx
134 MB instead of 654 MB / 4 = 163.5 MB? I thought model_q4f16.onnx
meant that the weights are quantized to 4 bits and activations are fp16, since https://llm.mlc.ai/docs/compilation/configure_quantization.html states:
qAfB(_id)
, whereA
represents the number of bits for storing weights andB
represents the number of bits for storing activations.
and Why do activations need more bits (16bit) than weights (8bit) in tensor flow's neural network quantization framework? indicates that activations don't count toward the model size (understandably).
I want to train pytorch model in bfloat16 and convert into onnx bfloat16. Does onnxruntime support bfloat16?
Hello there I was just seeing this dataset: FOR-species20K dataset and I noticed that it has the metadata (that is, the species, genus, filename of the particular point cloud data) for the training set but it is not given for the test set. Is it provided somewhere else or is it not provided at all? Because if I do not know the true labels for the test set then how will I validate it?
I was interested in making a model for this dataset and was thinking of using PointNet++ or PointNet to do so.
Traffic Sign Detection using DETR
https://debuggercafe.com/traffic-sign-detection-using-detr/
In this article, we will create a small proof of concept for traffic sign detection. We will use the DETR object detection model in particular for traffic sign detection. We will use a very small dataset. Also, we will entirely focus on the practical steps that we take to get the best results.
when is somebody going to use tokenformer in prompt to video, in chatbots and robots ? https://github.com/Haiyang-W/TokenFormer
Hey DL enjoyers, I feel like LLMs have pretty much hit their limit with innovation. There can be a lot done but nothing extra significant that it can complete change the LLM scene. Agents excluded. I did enjoy NLP before the whole LLM thing started. So here I ask, what next? What can a single individual or an individual with a research team do to make the NLP and LLM scene more interesting. My eyes are on explainable NLP (a long the lines of bertviz and SHAPley) and human in the loop compatible NLP. Redditors, show me the way.
Full disclosure: I'm going to use some of these ideas to add in my PhD idea.
JSON Mode has been one of the biggest enablers for working with Large Language Models! JSON mode is even expanding into Multimodal Foundation models! But how exactly is JSON mode achieved?
There are generally 3 paths to JSON mode:
Although most of the field has converged on the first method, Let Me Speak Freely? is a new paper challenging the potential tradeoffs in achieving JSON mode with constrained generation.
I am BEYOND EXCITED to publish the 108th Weaviate Podcast with Zhi Rui Tam, the lead author of Let Me Speak Freely? A Study on the Impact of Format Restrictions on Performance of Large Language Models!
As the title of the paper suggests, although constrained generation is awesome because of its reliability, we may be sacrificing the performance of the LLM by producing our JSON with this method.
The podcast dives into how these experiments identify this and all sorts of details about the potential and implementation details of Structured Outputs. I particularly love the conversation topic of incredibly Complex Structured Outputs, such as generating 10 values in a single inference or say HTML templates.
I hope you enjoy the podcast! As always please reach out if you would like to discuss any of these ideas further!
I am training different neural network models on an images dataset (40 gb) size may exceed later. I have a laptop with RTX 260 but it's taking too long so how do you guys do it. If anyone of you is using online GPUs, how much do they cost and which is the cheaper and get the job done.
I've been reading about Osmo, a startup using AI to predict and recreate scents by analyzing the molecular structures of smells, which they believe could impact fields from healthcare to fragrances.
It’s fascinating to think about machines “smelling” with this level of accuracy, but I’m curious — how might this actually change the way we experience the world around us? I guess I'm struggling to see the practical or unexpected ways AI-driven scent technology could affect daily life or specific industries, so I want to hear different perspectives on this.
Hi!
I am working on a deep learning model training script with checkpointing functionality. I have a question about the order in which to setup things when picking up training from a checkpoint. The checkpoint contains the model weights and optimizer state. Now what I would like to know is whether there is any difference between these two options:
Thank you in advance for answering.
EDIT: I am using PyTorch
I am trying to run Nymbo/Virtual-Try-On at main in my local server based on ubuntu, I had set it up, installed the libraries yet getting [ONNXRuntimeError] : 7 : INVALID_PROTOBUF.
Although I was able to run this repository successfully on google colab.
Error in detail:
python app.py
/home/ubuntu/VTON-env/lib/python3.10/site-packages/huggingface_hub/file_download.py:1142: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
warnings.warn(
The config attributes {'decay': 0.9999, 'inv_gamma': 1.0, 'min_decay': 0.0, 'optimization_step': 37000, 'power': 0.6666666666666666, 'update_after_step': 0, 'use_ema_warmup': False} were passed to UNet2DConditionModel, but are not expected and will be ignored. Please verify your config.json configuration file.
Some weights of the model checkpoint were not used when initializing UNet2DConditionModel:
['add_embedding.linear_1.bias, add_embedding.linear_1.weight, add_embedding.linear_2.bias, add_embedding.linear_2.weight']
Traceback (most recent call last):
File "/home/ubuntu/Virtual-Try-On/app.py", line 93, in <module>
parsing_model = Parsing(0)
File "/home/ubuntu/Virtual-Try-On/preprocess/humanparsing/run_parsing.py", line 20, in __init__
self.session = ort.InferenceSession(os.path.join(Path(__file__).absolute().parents[2].absolute(), 'ckpt/humanparsing/parsing_atr.onnx'),
File "/home/ubuntu/VTON-env/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 419, in __init__
self._create_inference_session(providers, provider_options, disabled_optimizers)
File "/home/ubuntu/VTON-env/lib/python3.10/site-packages/onnxruntime/capi/onnxruntime_inference_collection.py", line 452, in _create_inference_session
sess = C.InferenceSession(session_options, self._model_path, True, self._read_config_from_model)
onnxruntime.capi.onnxruntime_pybind11_state.InvalidProtobuf: [ONNXRuntimeError] : 7 : INVALID_PROTOBUF : Load model from /home/ubuntu/Virtual-Try-On/ckpt/humanparsing/parsing_atr.onnx failed:Protobuf parsing failed.
I will be really thankful if anyone can help me resolve this error.
I am working with a set of ground truth points in Armstrong units, which I need to rescale to match the size of my 3D data grid. However, when I try to scale the points back to their original Armstrong units, I notice that I’m losing some precision, and the positions no longer match exactly.
Here’s the approach I’ve implemented so far:
However, when I rescale the points back to the Armstrong units, they don’t match the original positions exactly, leading to a loss of precision.
Let me share my code so you guys understand better
https://github.com/TanetiSanjay/Doubts/blob/main/seg.py
Edit: The code was not readable so I uploaded it in github.
I am trying to work on a project that involves fetching real time data from apis and feeding in into an autoencoder model but most of the apis have extremely limited requests allowance. Are there any free resources that would suit real time streaming and if not can you specify any alternatives to make sure I stay in the api limit and still be able to build a robust autoencoder model?
Description I am encountering performance bottlenecks while running multi-threaded inference on high-resolution images using TensorRT. The model involves breaking the image into patches to manage GPU memory, performing inference on each patch, and then merging the results. However, the inference time per patch is still high, even when increasing the batch size. Additionally, loading multiple engines onto the GPU to parallelize the inference does not yield the expected speedup. I am seeking advice on optimizing the inference process for faster execution, either by improving batch processing or enabling better parallelism in TensorRT.
Environment TensorRT Version: 10.5.0 GPU Type: RTX 3050TI 4GB Nvidia Driver Version: 535.183.01 CUDA Version: 12.2 CUDNN Version: N/A Operating System + Version: Ubuntu 20.04 Python Version: 3.11
Relevant Files build_engine.py def build_engine(onnx_file_path, engine_file_path): logger = trt.Logger(trt.Logger.ERROR) builder = trt.Builder(logger) network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)) profile = builder.create_optimization_profile() config = builder.create_builder_config() parser = trt.OnnxParser(network, logger)
if not os.path.exists(onnx_file_path):
print("Failed finding ONNX file!")
return
print("Succeeded finding ONNX file!")
with open(onnx_file_path, 'rb') as model:
if not parser.parse(model.read()):
print('Failed parsing the ONNX file')
for error in range(parser.num_errors):
print(parser.get_error(error))
return
print('Completed parsing of ONNX file')
# Configure input profile
input_tensor = network.get_input(0)
profile.set_shape(input_tensor.name, (min_batch, shape[1], shape[2], shape[3]), shape, (max_batch, shape[1], shape[2], shape[3]))
config.add_optimization_profile(profile)
# Build the serialized engine
engine_string = builder.build_serialized_network(network, config)
if engine_string is None:
print("Failed building engine!")
return
print("Succeeded building engine!")
with open(engine_file_path, "wb") as f:
f.write(engine_string)
inference.py class TRTModel: def init(self, trt_path): self.trt_path = trt_path trt.init_libnvinfer_plugins(None, "") self.logger = trt.Logger(trt.Logger.ERROR) with open(self.trt_path, "rb") as f: engine_data = f.read() self.engine = trt.Runtime(self.logger).deserialize_cuda_engine(engine_data)
def create_execution_context(self):
return self.engine.create_execution_context()
def process_async(self, input_data):
_, stream = cudart.cudaStreamCreate()
context = self.create_execution_context()
input_size = input_data.nbytes
output_size = input_data.nbytes
input_device = cudart.cudaMallocAsync(input_size, stream)[1]
output_device = cudart.cudaMallocAsync(output_size, stream)[1]
input_data_np = input_data.cpu().numpy()
cudart.cudaMemcpyAsync(input_device, input_data_np.ctypes.data, input_data.nbytes,
cudart.cudaMemcpyKind.cudaMemcpyHostToDevice, stream)
context.set_tensor_address('images', int(input_device))
context.set_tensor_address('output', int(output_device))
context.execute_async_v3(stream_handle=int(stream))
output_host = np.empty_like(input_data_np, dtype=np.float32)
cudart.cudaMemcpyAsync(output_host.ctypes.data, output_device, output_host.nbytes,
cudart.cudaMemcpyKind.cudaMemcpyDeviceToHost, stream)
cudart.cudaStreamSynchronize(stream)
cudart.cudaFree(input_device)
cudart.cudaFree(output_device)
cudart.cudaStreamDestroy(stream)
return output_host
Steps To Reproduce Build the Engine: Use build_engine to convert an ONNX model into a TensorRT engine. Run Inference: Use TRTModel to perform inference on cropped image patches. Expected Result: While batch sizes are increased, the inference time per patch remains high. Running multiple engines for parallel inference also does not improve performance. Profiling Results: Transfer to device: 0.48 ms Inference time: 784.75 ms Transfer to host: 0.67 ms Total time for a single patch (256x256): 19-22 seconds on average I am seeking optimization suggestions for improving multi-batch processing or multi-threaded parallel inference in TensorRT.
Hey Hive mind,
I am new to deep learning and I am using deep learning (a CNN) to predict a timeseries.
I am using the model from this paper:
https://arxiv.org/pdf/2211.02024
So it has been done before and seems to work well.
However, for my data the output of the model is super weird. See images...
train/loss_r is the correlation for the training and val/loss_r is the correlation for the validation
and then for each region of interest (ROI) the predicted (blue) vs. real (orange) timeseries.
What is also weird is that it says for some ROIs r = 0.20 (or so) but the predicted signal (blue) is almost flat?
What am I doing wrong? Any input?
Edit: code is available here:
https://github.com/kovalalvi/beira/tree/master
Do you know of any new or old tools or libraries related to AI and deep learning… Or for generative AI
I'm currently having trouble understanding why distillation works in JEPA and BYOL. This is how I'm currently thinking about it:
There are 2 encoders: teacher and student. Teacher weights are updated via exponential moving average of student weight. So essentially a "dumb" encoder teaching a "smart encoder"?
It's not intuitive to me why distillation would even work. Hope somebody can give a good explanation!
I was reading a text book and I found it cumbersome to highlight the pdf, copy it and paste it in the chapGPT and ask queries on the pasted text. So i thought to build a project, basically an application, that lets us query using llms all we need to do is select the text in the pdf. Any thoughts for guidance, where to start or any tools i can use…
I am currently using timm_3d 3d classification models to train simple binary classification problem, I have around 200 sample data, i have used monai Densenet Resnet and other networks and have good train test and validation accuracy (above 95% balance accuracy) , but When using monai efficient net model and vgg models from timm_3d the loss function is not decreasing and accuracy is just above 50% , I have tried running using different learning rate and also tried different learning rate scheduler but none of them are working, How can I overcome this issue? Thank you