/r/LanguageTechnology
This sub will focus on theory, careers, and applications of NLP (Natural Language Processing), which includes anything from Regex & Text Analytics to Transformers & LLMs.
A community for discussion and news related to Natural Language Processing (NLP).
Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.
Information & Resources
Related subreddits
Guidelines
/r/LanguageTechnology
Hello! (TL;DR at the bottom)
I am quite new here since I stumbled upon the subreddit by chance while looking up information about a specific master's program.
I recently graduated with a bachelor's degree in (theoretical) Linguistics (phonology, morphology, syntax, semantics, sociolinguistics etc.) and I loved my major (graduated with almost a 3.9 GPA) but didn't want to rush into a master's program blindly without deciding what I would like to REALLY focus on or specialize in. I could always see myself continuing with theoretical linguistics stuff and eventually going down the 'academia' route; but realizing the network, time and luck one would need to have to secure a position in academia made me have doubts. I honestly can't stand the thought of having a PhD in linguistics just because I am passionate about the field, only to end up unemployed at the age of 30+, so I decided to venture into a different branch.
I have to be honest, I am not the most well-versed person out there when it comes to CL or NLP but I took a course focusing on computational methods in linguistics around a year ago, which fascinated me. Throughout the course, we looked at regex, text processing, n-gram language models, finite state automata etc. but besides the little bit of Python I learned for that course, I barely have any programming knowledge/experience (I also took a course focusing on data analysis with R but not sure how much that helps).
I am not pursuing any degree as of now, you can consider it to be something similar to a gap year and since I want to look into CL/NLP/LT-specific programs, I think I can use my free time to gain some programming knowledge by the time the application periods start, I have at least 6-8 months after all.
I want to apply to master's programs for the upcoming academic year (2025/2026) and I have already started researching. However, not long after I started, I realized that there were quite a few programs available and they all had different names, different program content and approaches to the area of LT(?). I was overwhelmed by the sheer number of options; so, I wanted to make this post to get some advice.
I would love to hear your advice/suggestions if anyone here has completed, is still doing or has knowledge about any CL/NLP/LT master's program that would be suitable for someone with a solid foundation in theoretical linguistics but not so much in CS, coding or maths. I am mainly interested in programs in Germany (I have already looked into a few there such as Stuttgart, Potsdam, Heidelberg etc. but I don't know what I should look for when deciding which programs to apply to) but feel free to chime in if you have anything to say about any program in Europe. What are the most important things to look for when choosing programs to apply to? Which programs do you think would prepare a student the best, considering the 'fluctuating' nature of the industry?
P.S.: I assume there are a lot of people from the US on the subreddit but I am not located anywhere near, so studying in the US isn't one of my options.
TL;DR: Which CL/NLP/LT master's programs in Europe would you recommend to someone with a strong background in Linguistics (preferably in Germany)?
https://github.com/MaartenGr/BERTopic
BERTopic seems to be a popular method to interpret contextual embeddings. Here's a list of steps from their website on how it operates:
"You can swap out any of these models or even remove them entirely. The following steps are completely modular:
My question is why not fine-tune your documents first and get optimized embeddings as opposed to just directly using a pre-trained model to get embedding representations and then proceeding with other steps ?
Am I missing out on something?
Thanks
Hello folks, I'm doing research on few-shot learning, conceptual transfer, and analogical reasoning in NLP models, particularly large language models. There’s been significant work on how models achieve few-shot or zero-shot capabilities, adapt to new contexts, and even demonstrate some form of analogical reasoning. However, I’m interested in exploring these phenomena from a different perspective:
How cognitively plausible are these techniques?
That is, how closely do the mechanisms underlying few-shot learning and analogical reasoning in NLP models mirror (or diverge from) human cognitive processes? I haven’t found much literature on this.
If anyone here is familiar with:
I’d love to hear from you! I’m hoping to evaluate the current state of literature on the nuanced interplay between computational approaches and human-like cognitive traits in NLP.
I'm working on a problem where I have a product name, but this product might contain dimensions, measurements and all sorts of engineering technical information.
My database is quite large, and there is absolutely no standardization for these queries, and sometimes they might be in different languages.
For example: "cork screw 7x2x 0.5lbs --in", this should be mapped to "cork screw".
With large LLMs I can easily solve this problem, but I cannot afford having them.
Do you guys have any suggestions on how to tackle this problem, where inference is relatively fast?
Hi everyone,
I'm working on building a multilingual TTS system and am looking for high-quality open-source data in French, Spanish, and Arabic (in that order of priority). Ideally, I'd like datasets that include both text and corresponding audio, but if the audio quality is decent, I can work with audio-only data too.
Here are the specifics of what I'm looking for:
If anyone knows of any sources or projects that offer such data, I’d be extremely grateful for the pointers. Thanks in advance for any recommendations!
Hello everyone,
I'm trying to reproduce an old experiment that uses the wikitext-2
dataset, and it relies on torchtext
to import it. However, it seems the link from which the dataset is downloaded is no longer working. Here’s the link that’s broken: https://s3.amazonaws.com/research.metamind.io/wikitext/wikitext-2-v1.zip
Here’s the relevant torchtext
source code for reference: https://pytorch.org/text/0.12.0/_modules/torchtext/datasets/wikitext2.html
Does anyone know an updated link or a workaround to get this dataset? Thanks!
Obviously a CS major would be ideal, but since I'm a first year applying out of stream, there is a good chance I won't get into the CS major program. Also, the CS minor would still allow me to take an ML course, a CL course, and an NLP course in my third/fourth years. Considering everything, is this possible? Is there a different minor that would be better suited to CL/NLP than Stats?
Hello!
I am currently in my final semester of my BA in Linguistics, and I really want to go into CompLing after graduating. The problem with this is that it seems impossible to get a job in the field without some sort of formal education in CS. Fortunately, though, I have taken online courses in Python and CS (CS50 courses) and am breezing through my Python for Text Processing course this semester because of it. I also do have a strong suit for math, so courses in that would not be a concern for me pursuing another degree.
I would love to get another degree in any program that would set me up for a career, though funding is another massive issue here. As of now, it seems that the jobs I would qualify for now with just the BA in Ling are all low-paying (teaching ESL mainly), meaning I would struggle to pay for an expensive masters program. Because of this, these are the current options I have been considering, and I would appreciate insight from anyone with relevant or similar experience:
My questions for you all are:
Have any of you been in a similar position? I often see people mention that they came from Linguistics and pivoted, but I don't actually understand how that process works, how people fund it, or which of programs I know of are actually reasonable for my circumstances.
I have seen that people claim you should just try to get a job in the industry, but how is that possible when you have no work experience in programming?
Would another Linguistics degree with just a concentration in CL be enough to actually get me jobs, or is that unrealistic?
How the HELL do people fund their master's programs to level up their income when their initial career pays much lower?? One of my biggest concerns about working elsewhere first is that I'll never be able to fund my higher education if I do wait instead of just taking loans and making more money sooner.
I don't expect anyone to provide me with a life plan or anything, but any insight you have on these things would really help since it feels like I've already messed up by getting a Linguistics degree.
I'm working on my graduation project, and my main idea is to fine-tune an LLM to summarize scientific papers. The challenge is that if my summaries end up looking exactly like the abstract, it wouldn’t add much value. So, I’m thinking it should either focus on the novel contributions of the paper or maybe summarize by section. As a user or a developer, do you have any ideas on how I can approach this?
This also seems like a query-based task since the user would send a PDF or an arXiv link along with a specific question. I don’t want it to feel like a chatbot interaction. Any guidance on how to approach this, including datasets, architectures, or general advice, would help a lot. Thanks!
Is there any way to use a single pretrained model such as bert for both intent classification and entity extraction. Rather than creating two different model for the purpose.
Since loading two models would take quite a bit of memory, I've tried rasa framework 's diet classifier need something else since I was facing dependency issues.
Also it's extremely time consuming to create the custom dataset for NER in BIO format. Would like some help on that that as well.
Right now I'm using bert for intent classification and a pretrained spacy model with entity ruler for entity extraction. Is there any better way to do it. Also the memory consumption for loading the models are pretty high. So I believe combining both should solve that as well.
I am working on a project that analyze MRI images to some numerical value such as, median or standard deviation and contrast of the image ... can LLM such as, GPT 4 take those data and convert it to medical report or convert it to medical text. Can even translate those numeric values to strings or medical text like median = 1 that mean thise tumor is spreading?
In English audio transcription, there's still a ton of issues with homophones (ex. "Greece" and "grease"). With all the characters that share pronunciation in Mandarin, do those models have the same issues? Does it rely more heavily on common compounds?
What kind of storage would you guys use for a co-pilot like rag pipeline?
Just a vector-db for semantic/hybrid search, or is a graph-db the best choice for retrieving relevant code-fragments?
Hi,
I’m looking for a service that provides an API for Whisper v3 that returns word-level confidence scores (not just word-level timestamps).
I have tried Deepgram, but their Whisper endpoint is very unstable. It sometimes takes 30s to return the JSON data for a short audio recording.
Azure Speech or OpenAI don’t return word-level confidence data.
Thank you for any suggestions!
Hey Reddit!
We’re working on something that we think could make model discovery a LOT easier for everyone: a model recommendation system where you can just type what you're working on in plain English, and it'll suggest the best AI models for your project. 🎉
The main idea is that you can literally describe your project in natural language, like:
And based on that input, the system will recommend the best models for the job! No deep diving into technical specs, no complex filters—just solid recommendations based on what you need.
Alongside the model suggestions, we’re adding features to make the platform super user-friendly:
We want this platform to actually solve problems for people in the AI/ML space, and that’s where you come in! 🙌
We’re building a tool where you can just describe your project in plain English, and it’ll recommend the best AI models for you. No need for complex searches—just type what you need! Looking for your feedback on what you'd want to see or any features you think are missing from current platforms like Hugging Face.
We'd love to hear your thoughts and ideas! What would make this platform super useful for you? Let us know what you think could improve the model discovery process, or what’s lacking in existing platforms!
Thanks in advance, Reddit! 😊
Hi, I’m looking for jobs related to language technologies and found a hiring company called Anzu global. Most jobs posted there are contract positions. I googled that and found the score is 4.4. But I’m still suspecting that it’s a scam web. Cuz the only way to submit application is to send WORD resume to an email. The website says it mainly hires people with AI, NLP, ML, CL majors. Anyone has any experience with this company? Thanks
Are you interested in fine tuning LLMs? Do you want to participate in mental health research using AI? Would you like to win some money doing it?
I have been working on an open source tool called Harmony which helps researchers combine datasets in psychology and social sciences.
We have noticed for a while that the similarity score that Harmony gives back could be improved. For example, items to do with "sleep" are often grouped together (because of the data that the off the shelf LLMs such as SentenceTransformers are trained on) while a psychologist would consider them to be different.
We are running a competition on the online platform DOXA AI where you can win up to 500 GBP in vouchers (1st place prize). Check it out here: https://harmonydata.ac.uk/doxa/
We *provide training data*, and your code will be evaluated on submission on the platform.
## How to get started?
Create an account on DOXA AI https://doxaai.com/competition/harmony-matching and run the example notebook. This will download the training data.
If you would like some tips on how to train an LLM, I recommend this Hugging Face tutorial: https://huggingface.co/docs/transformers/en/training
Hello Everyone ,
I work in a startup B2B company that connects pharmacies with sellers (we give them the best discount for each product in our marketplace) the seller have a list of medicine in our marketplace(40000 + products) and each seller send a list of their products and we match the sent product names with the corresponding product in our marketplace
the seller send a sheet with name and price and we match it and intgrate it with the marketplace
the challenges we face is
seller names is mostly misspelled and with a lot of variations and noises
the seller names often sent with added words over the product name that does not relate to the seller name itself
we built a system using tf-idf + cosine similarity and we got an accuracy of 80 % (it does not do well for capturing the meaning of the words and generate bad results in small sheets)
because correcting wrong matches out of our model cost us money and time(we have a group of people that review manually ) we wants to accieve an accuracy with over 98%
we have dataset with previously correct matches that have seller input of product name and our matches
and our unique marketplace data in marketplace
can anyone guide me to possible solutions using neural network that we feed with seller inputs and target match to generalize the matching process or possible pre-trained model that we can fine tune with our data to achieve high accuracy ?
Don't know why this idea (which is cool) never caught up, but I'm wondering if we could build an open-source model for the same, eg a fine-tuned LLM with perhaps a small model that tries to distinguish between when the user is providing "text value", and when he is speaking "edition commands", and then do the edits
A "basic prototype" shouldn't be too hard, but could be quite helpful
I have a background in a Computer Science + Linguistics BS, and a couple years of experience in industry as an AI software engineer (mostly implementing LLMs with python for chatbots/topic modeling/insights).
I'm currently doing a part time master's degree and in a class that's revisiting all the concepts that I learned in undergrad and never used in my career.
You know, Naive Bayes, Convolutional Neural Networks, HMMs/Viterbi, N-grams, Logistic Regression, etc.
I get that there is value in having "foundational knowledge" of how things used to be done, but the majority of my class is covering concepts that I learned, and then later forgot because I never used them in my career. And now I'm working fulltime in AI, taking an AI class to get better at my job, only to learn concepts that I already know I won't use.
From what I've read in literature, and what I've experienced, system prompts and/or finetuned LLMs kind of beat traditional models at nearly all tasks. And even if there were cases where they didn't, LLMs eliminate the huge hurdle in industry of finding time/resources to make a quality training data set.
I won't pretend that I'm senior enough to know everything, or that I have enough experience to invalidate the relevance of PhDs with far more knowledge than me. So please, if anybody can make a point about how any of these techniques still matter, please let me know. It'd really help motivate me to learn them more in depth and maybe apply them to my work.
Let’s say I have documents that are relatively similar between them and I need to process them sentence by sentence or windows of sentences, for a similarity search task. How do I fine tune an embedder like BAAI bge m3 or similar ones in order to learn the language of the specific domain of the documents? Any hints? Can I use the plain text without any kind of supervised learning?
I'm trying to predict salary from job postings. Sometimes, a job posting will have a salary mentioned (40/hr, 3000 a month.. etc). My colleague mentioned I probably should mask those in the text to prevent leakage.
While I agree, I'm not completely convinced.
I'm modelling with a CNN/LSTM model based on word embeddings, with a dictionary size of 40000. Because I assume I will only very rarely find a salary that I have a token for in my dictionary, I haven't masked my input data so far.
I am also on the fence whether the LSTM would learn the relationship at all on tokens that do make it into its vocabulary. It might "know" a number is a number and that the number is closely related to other numbers near it, but I'm intuitively unable to say how this would influence the regression.
Lastly, the real life use case for this would be to simply predict a salary based on the data that we get. If a number is present in the text and we can predict better because of that, it's a good thing.
Before I spend a day trying to figure this out, can anyone tell me if this a huge problem?
Is it possible to find a job in the NLP industry with a PhD that focuses more on the linguistic side of NLP?
I’m still an MSc student in NLP, coming from a BA in Linguistics, and at the moment, I’m studying more STEM-related subjects like linear algebra, machine learning, etc. However, my university focuses both on very applied, engineering-oriented research (such as NLP and computer vision, and I have several courses in this area) as well as more linguistically oriented research, like:
-“how parsing is easier in left-branching languages, so English should ideally be written in reverse”
-the performance of transformer models on functional words.
When I enrolled, I chose all the more technical courses with a strong ML foundation, but I’m starting to think that, as a linguist, I actually enjoy the more linguistic side of things. I was wondering, though, how useful such research could be, whether it only serves an academic purpose or if it can also have value outside of academia.
I’m unsure if I want to stay in academia or not, so I’d like to pursue a specialization that could keep both doors open for me.
I’m in the first year of an MSc in Computational Linguistics/NLP and I come from a BA in Languages and Linguistics.
Right from the start, I’ve been struggling with the courses, even before studying actual NLP. At the moment, I’m mainly doing linear algebra and programming, and I feel so frustrated after every class.
I see that many of my classmates are also having difficulties, but I feel especially stupid, particularly when it comes to programming. I missed half of the course (due to medical reasons), but I had already taken a course on Codecademy and thought it wouldn’t be that hard. In reality, I’m not understanding anything about programming anymore, and we’re just doing beginner stuff, mainly working with regular expressions.
It feels so ridiculous to be struggling with programming at this level in a master’s program for ML and NLP, especially when there are so many other master’s students my age who are much better at it. And I wonder how I could ever work in this field with such a low level of programming (and computer science in general). I’ve never been a tech enthusiast, and honestly, I don’t know how to use computers as well as many others who are much more knowledgeable (I’m talking about basic things like RAM, processors, and how to tinker with them).
I wonder how someone like me, who doesn’t even know how to use a computer well, can work with ML and NLP-related tasks.
Has anyone had a similar experience, maybe someone who is now working or doing research in NLP after coming from a humanities-linguistics background? How did you find it, was it tough? Does it even make sense for a linguist to pursue this field of study?
Am assigned with a task of building a GPT for our database (Postgres SQL). Where all the info (mean, all datasets are stored in this Postgres SQL) and which contains almost millions of data points.
After we fine-tune the model. Ex: we use the GPT-4 model to fine-tune, what in another three to four years openAI release advance models ex GPT-8, should we re-train our fine-tune models to improve its accuracy, precision and so on ?
How can I build a chatbot for this kind of situation ?
Would also appreciate, if you could post a link or title of the research papers to read !!
I'd like to create a model for intent classification and entity extraction. The intent part isn't an issue, but I'm having trouble with entity extraction. I have some custom entities, such as group_name-ax111, and I want to fine-tune the model. I’ve tried using the Rasa framework, and the DIET classifier worked well, but I can't import the NLP model due to conflicting dependencies.
I’ve also explored Flair, NeMo, SpaCy, and NLTK, but I want the NER model to have contextual understanding. I considered using a traditional model, but I’m struggling to create datasets since, in Rasa, I could easily add entities under the lookup table. Is there any other familiar framework or alternative way to create the dataset for NER more easily?
I am currently trying to build a model that can read the emotional aspect of a message. The idea is to find the feelings behind a message through the language used. To do this I figured a LLM model would work best as there can be a lot a nuance in the sentences that might go unnoticed. However a major problem I ran into is that many of the data repositories out there do not focus on the emotional aspect. The NLTK movies library only has positive/negative reviews. I did find the crowd sourced NRC Emotion Lexicon which contains the data of interest; but this is all unigrams and not sentences.
my first impression was to use current tools like the module Nrclex to map to the movie reviews data but I quickly found that Nrclex is really just tallying the non-stopwords present ("not happy" == "happy" as not is not tallied).
So now I am looking to update Nrclex to include pos_tag data about the adjacent words. However this seems to be the only half of the problem as adverbs and adjectives can differ in modifying the meaning of a word. "very happy" and "not happy" both change the meaning of happy where "not" flips the meaning and "very" changes the magnitude. I need to know the spin of the word before I can start implementing a modifier for the emotional data to output the correct response.
and this is all in the effort to enhance the movies reviews for with the emotional data to build an LLM to quantify the emotional information found in a text.
So right now I am trying to figure out how to generate the enhance/invert information for the adverbs and adjectives. Sentiment analysis won't work as words like "not" and "none" have no sentiment, and this isn't really the type of data that can be used for inverting a word meaning. I thought about using it for adverbs as words like "smartly" do have sentiment but this only addresses the enhance side of the issue.
Is there a data repository that contains this type of data? Does this make sense what I am thinking? Is there an easier method I may be missing?
Overview
I am working as a research intern with a professor at my university on Machine Translation, I have collected a decent sized text corpus (around 10 GB). Now, my professor has instructed me to find text quality metrics for the data.
Some details about the dataset
First, let me explain how the data is stored and what format it's in. I have stored all the text data in Parquet files (which are similar to dataframes), with each row containing the text data. The data can consist of a single sentence, an article, or just a paragraph, as I have collected the data from various sources such as Hugging Face, scraped articles e.t.c.
This is the question
What text quality metrics should I find that will help me understand the data better and guide me in the right direction to ultimately improve my machine translation model?