/r/LanguageTechnology
This sub will focus on theory, careers, and applications of NLP (Natural Language Processing), which includes anything from Regex & Text Analytics to Transformers & LLMs.
A community for discussion and news related to Natural Language Processing (NLP).
Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.
Information & Resources
Related subreddits
Guidelines
/r/LanguageTechnology
Needing some advice, maybe this sub will help me. I'm a 24 yo Brazilian with an undergrad degree in Linguistics and Literature at a Brazilian University. My thesis involved NLP by LLMs.
I'm planning on applying for a master's program on Europe. I want to keep studying NLP and, preferably, get a job on this field instead of following an academic path.
I found many Computational Linguistics masters, some NLP ones focused on AI, and some AI ones focused on NLP that accepted Linguistics undergrads.
What should I look for when deciding between the master programs I found in the area?
Please, if my question is too vague, let me know what is missing, I'll give any information needed. I'd appreciate any help.
So I am implementing a feature that automatically extracts information from a document using Pre-Trained LLMs (specifically the recent Llama 3.2 3b models). The two main things I want to extract are the title of the document and a list of names involved mentioned in it. Basically, this is for a document management system, so having those two pieces of information automatically extracted makes organization easier.
The system in theory should be very simple, it is basically just: Document Text + Prompt -> LLM -> Extracted data. The extracted data would either be the title or an empty string if it could not identify a title. The same goes for the list of names, a JSON array of names or an empty array if it doesn't identify any names.
Since what I am trying to extract is the title and a list of names involved I am planning to just process the first 3-5 pages (most of the documents are just 1-3 pages, so it really does not matter), which means I think it should fit within a small context window. I have tested this manually through the chat interface of Open WebUI and it seems to work quite well.
Now what I am struggling with is how this feature can be evaluated and if it is considered Named Entity Recognition, if not what would it be considered/categorized as (So I could do further research). What I'm planning to use is a confusion matrix and the related metrics like Accuracy, Recall, Precision, and F-Measure (F1).
I'm really sorry I was going to explain my confusion further but I am struggling to write a coherent explanation 😅
Okay so my confusion is about accuracy. It seems like all the resources I've read about evaluating NER or Information Retrieval say that Accuracy isn't useful because of class imbalance where the negative class is probably going to make up a big majority and thus the accuracy would be very high due to the amount of true negatives skewing the accuracy in a way that isn't useful. At least this is how I am understanding it so far.
Now in my case, True Positive would be extracting the real title, True Negative would be extracting no title because there isn't any title, False Positive would extracting a title incorrectly, and False Negatives would be falsely extracting no title even though there is a title.
But in my case I think there isn't a class imbalance? Like getting a a True Positive is just as important as getting a False Negative and thus accuracy would be a valid metric? But I think that sort of highlights a difference between this Information Extraction vs Named Entity Recognition/Information Retrieval, which makes me unsure if this fits those categories. Does that make sense?
So in this information extraction I'm doing, finding and extracting a title (True Positive) or not finding a title thus returning an empty string (True Negative) are both important output and thus I think having the accuracy metric is a valid way to evaluate the feature.
I think in a way extraction is a step you do after recognition. While doing NER you go through every word in a document and label them as an entity or not, so the output of that is a list of those words with a label for each. Now with extraction, you're taking that list and filtering it by ones labeled by a specific class and then returning those words/entities.
What this means is that the positive and negative classes are different. From what I understand in NER, the positive class would be an entity that is recognized while the negative class would be one that is not a recognized entity. But in extraction, the positive class is if it was found and extracted and the negative class is fit it was not found and thus nothing was extracted.
Honestly I don't know if this makes any sense, I've been trying to wrap my head around this since noon and it is midnight now lol
Here I made a document that shows how I imagine Named Entity Recognition, Text Classification, and my method would work: https://docs.google.com/document/d/e/2PACX-1vTfgySSyn52eEmkYrVEAQt8bp3ZbDRFf_ry1xDBVF77s0DetWr1mSjN9UPGpYnMc6HgfitpZ3Uye5gq/pub
Also, one thing I haven't mentioned is that this is for my final project at my University. I'm working with one of the organizations in my University to use their software as a case study to implement a feature using LLM. So for the report I need to have proper evaluations and also proper references/sources for everything. Which is why I'm making this post trying to figure out what my method would be classified as so I can get more info to help with me finding more related literature/books.
Hello all - this problem has been bothering me for a long time. I dont think there is a quick and easy answer but i thought i may as well ask the experts. I had to deduplicate a dataset containing merchant names. I've cleaned the data to a good extent and achieved a reasonably standardized format for the merchant names (though it's still not perfect). For example:
Adidas International Trading Ag Rapresentante | Adidas Ag Rapresentante |
---|---|
Adidas International Trading Ag C 0 Rappresentante | Adidas Ag Rapresentante |
Adidas Argentina S A Cuit 30685140221 | Adidas Argentina Cuit |
Adidas Argentina Sa Cuyo | Adidas Argentina Cuit |
Adidas International Trading Bv Warehouse Adc | Adidas Bv Warehouse |
Adidas International Trading Bv Warehouse Adcse | Adidas Bv Warehouse |
I want to build a model that, given an uncleaned name, outputs the cleaned version. However, the problem I’m facing with using RNNs and CNNs is that when the model encounters an out-of-vocabulary (OOV) term, the predictions are extremely poor, I want the model to learn the cleaning and cluster patterns, rather than learning embedding representation in the training data. My dataset is large, with around half a million observations.
I considered building a Named Entity Recognition (NER) model, but it would be difficult to annotate representative data due to the significant geographical variation in the merchant names. FastText isn't ideal for entity recognition in this case, so I'm currently using Sentence-BERT.
I'm looking for a robust model that can generalise well to other similar datasets, using transfer learning. Any ideas on how to approach this?
Hello all - this problem has been bothering me for a long time. I dont think there is a quick and easy answer but i thought i may as well ask the experts.
In public sector research there's often massive spreadsheets with proper nouns taking up one of the columns. These are usually public entities, companies, or people. Much of the time these are free text entries.
This means for proper analysis one needs to standardise. Whilst fuzzy matching can take you some of the way, its not specifically for this kind of use case and has limitations. It cant deal with abbreviations, often different sequences of words etc.
brute forcing with llms is one way, the most thorough approach I think ive got to is something like:
but this seems so messy! I was just hoping i'd missed something or if anyone has any other advice!
Thanks so much
How impossible is it for a humanities student (specifically English) to get a job in the world of computational linguistics?
To give you some background: I graduated with a degree in English Studies in 2021 and since then I have not known how to fit my studies into real job without having to be an English teacher. A year ago I found an approved UDIMA course (Universidad a Distancia de Madrid) on Natural Language Processing at a school aimed at humanistic profiles (philology, translation, editing, proofreading, etc.) to introduce them to the world of NLP. I understand that the course serves as a basis and that from there I would have to continue studying on my own. This course also gives the option of doing an internship in a company, so I could at least get some experience in the sector. The problem is that I am still trying to understand what Natural Language Processing is and why we need it, and from what I have seen there is a lot of statistics and mathematics, which I have never been good at. It is quite a leap, going from analyzing old texts to programming. I am 27 years old and I feel like I am running out of time. I do not know if this field is too saturated or if (especially in Spain) profiles like mine are needed: people from with a humanities background who are training to acquire technical skills.
I ask for help from people who have followed a similar path to mine or directly from people who are working in this field and can share with me their opinion and perspective on all this.
Thank you very much in advance.
¿Cómo de imposible es que una persona de humanidades consiga un trabajo dentro del mundo de la lingüística computacional?
Por orientar un poco: me gradué de la carrera de Estudios Ingleses en 2021 y desde entonces no he sabido cómo encajar mi formación en el mundo laboral sin tener que ser profesora de inglés. Hace un año encontré un curso homologado de la UDIMA (Universidad a Distancia de Madrid) de Procesamiento del Lenguaje Natural en una escuela dirigida a perfiles humanísticos (filología, traducción, edición, corrección, etc) para introducirles en el mundo de PLN. Entiendo que el curso sirve como base y que a partir de ahí yo tendría que seguir formándome. Este curso también da la opción de hacer unas prácticas en una empresa, por lo que por lo menos podría conseguir un poco de experiencia en el sector. El problema es que aún estoy intentando entender qué es y para qué necesitamos el Procesamiento del Lenguaje Natural, y por lo que he visto hay mucha estadística y matemáticas, que nunca se me han dado nada bien. Es un salto bastante fuerte, pasar de analizar textos antiguos a ponerme a programar. Tengo 27 años y siento que me estoy quedando sin tiempo. No sé si este campo está muy saturado o si se necesitan (sobre todo en España) perfiles como el mío: gente de humanidades que se esté formando para adquirir habilidades técnicas.
Pido la ayuda de gente que haya seguido un camino parecido al mío o directamente a gente que estéis trabajando en este ámbito y podáis compartir conmigo vuestra opinión y perspectiva de todo esto.
Muchísimas gracias de antemano.
I have an upcoming onsite interview for a Language Engineer position at Amazon. I'm trying to get a sense of what kinds of NLP/Linguistic concepts they might ask about during the interview (aside from the behavioral questions and leadership principles). Ling is obviously very broad, so I was hoping for some suggestions on what specifically to focus on reviewing. I've searched for older posts on Reddit, but the few I found on this are several years old, so I was hoping to get more recent info. Can anyone who has some insights share their advice?
Thanks!
Hello, I would appreciate any answers ! I’m a PhD student rn in a language department with a focus on linguistics. I have MA in the same field as well. I want to however try and apply to Masters in computational linguistics. What are my chances? Is it even possible after my basically arts major.
I need dataset from IEEE Dataport. My institution does not have subscription. If anyone is willing to share please let me know. I will send you the link.
Hello everyone. I have scraped forum posts of adolescents, in which they talk about their emotional problems. I want to extract cause, effect / emotion, cause pairs. For ex "I am sad because I was bullied at school" should return "sad, bullied" for example. This is not the exact format I expect it to be in btw. However, keep in mind that I dont have annotated data. How can I go forward with this in an unsupervised manner. Many thanks!
What’s a good translator app that doesn’t speak out loud and just fills it in by text when someone speaks? And works offline too would be a bonus. Google translate speaks out loud and trying to find alternative apps on your suggestions. Let me know in comments please
I'm a little skeptical that this exists, but does there happen to be something like a pre-trained sentence transformer that generate embeddings which provide information about sentiment?
I am trying to build an efficient algorithm for finding word groups within a corpus made of online posts but the various methods I have tried have caveats in different aspects making this a rather difficult nut to crack.
to give a snippet of the data, here are some phrases that can be found in the dataset
Japan has lots of fun environments to visit
The best shows come from Nippon
Nihon is where again
Do you watch anime
jap animation is taking over entertainment
japanese animation is more serious than cartoons
In these,
Japan = Nippon = Nihon
Anime = Jap Animation = Japanese Animation
I want to know what conversational topics are being discussed within the corpus and my first approach was to tokenize everything and perform counts. This did ok but quickly common non-stop words rose above the more meaningful words and phrases.
The several attempts tried to perform calculations on ngrams, phrases, highly processed sentences (lamentized, etc) and all usually result in similar troubles.
One potential solution I have thought of was to try and identify these overlapping words and combine them into word groups. This way the word groupings would be tracked which should theoretically aid in increasing visibility of the topics in questions.
However this is quite laborious as generating these groupings requires a lot of similarity calculations.
I have thought about using umap to convert the embeddings into coordinates and through plotting on a graph, this would aid in finding similar words. this paper performed a similar methodology that I am trying to implement. Implementing it though has run into some issues where I am now stuck.
The embeddings of 768 layers to 3 feels random as words that should be next to each other (tested with cosine similarity) usually end up on the opposite sides of the figure.
Is there something I am missing?
Im looking to get into nlp and computational linguistics. What would be a good framework for starting out with python?
Came across this paper and GitHub project called Precision Knowledge Editing (PKE), and it seemed like something worth sharing here to get others’ thoughts. The idea is to reduce toxicity in large language models by identifying specific parts of the model (they call them "toxic hotspots") and tweaking them without breaking the model's overall performance.
Here’s the paper: https://arxiv.org/pdf/2410.03772
And the GitHub: https://github.com/HydroXai/Enhancing-Safety-in-Large-Language-Models
I’m curious what others think about this kind of approach. Is focusing on specific neurons/layers in a model a good way to address toxicity, or are there bigger trade-offs I’m missing? Would something like this scale to larger, more complex models?
Haven't tried it out too much yet myself but just been getting more into AI Safety recently. Would love to hear any thoughts or critiques from people who are deeper into AI safety or LLMs.
Hi, everyone! I am currently focusing on constructing a domain-specific benchmark and I would like to ask for some advice.
In order to enhance the benchmark, I want to incorporate several modules from the pipeline of one of the domain-specific sota models. These modules form the foundation of my benchmark construction pipeline, in the sense that they do the great "language modeling". All questions and answers are built upon the output of these modules(as well as the original raw text, etc).
However, since benchmarks are used for evaluation purpose, will it cause "contamination" so that the evaluation results will become unreliable because of the usage of domain-specific models? And will it be mitigated if I simply avoid directly evaluating the sota model itself as well as models those are based on it? (Given that quality assurance is carefully conducted)
Indeed, I haven't found any previous work(not constrained to any domain) that are doing this kind of stuff for benchmark construction. If any previous benchmarks are doing this, please provide me with the references. Thanks in advance!
Recently, unsloth has added support to fine-tune multi-modal LLMs as well starting off with Llama3.2 Vision. This post explains the codes on how to fine-tune Llama 3.2 Vision in Google Colab free tier : https://youtu.be/KnMRK4swzcM?si=GX14ewtTXjDczZtM
Hi , I'm fine tuning mBART-50-many-to-many-mt on a language that is unseen in its pre training.
I did a lot of background research and found that many papers discuss that fine tuning NMT models on high quality unseen data works and it gives good results. (Bleu : 10)
When I'm trying to replicate the same. This doesn't work at all (Bleu:0.1, 5epochs) I don't know what I'm doing wrong . I've basically followed hugging face's documentation to write the code , which I verified was right after cross checking from a GitHub repo of someone who fine tuned the same model.
A little more context
The dataset consists of En->Xx sentnce pairs
I used the auto tokenizer and used hugging face's trainer to train the model.
As for arguments, the important ones are LR:0.0005 , Epoch : 5 (runtime constraints) , batch :16 (memory constraints) , optim : adamW . Basically these. The loss improved from 3.3 to 0.8 after 5 epochs and Bleu 0.04 to 0.1 (don't know if this is improvement)
I even tried looking into majority reasons why this could happen but I've made sure to not overlook things. The dataset quality is high. Tokenizing is proper, arguments are proper . So I'm very lost as to why this is happening. Can someone help me please.
Reviews are to be released in less than 24 hours. Nervous
Just sharing our paper presented at EMNLP 2024 main conference, which introduces a sentence embedding model that captures both the semantics and communicative intention of utterances. This allows for the modeling of conversational "steps" and thus the automatic extraction of dialog flows.
We hope some of you find it useful! :)
Resources:
Paper Key Contributions:
Have a nice day everyone! :)
Hi everyone ,
I am a beginner at NLP , I am trying to train mBART-50 for translation on an unseen language. I have referred a lot of docx , a hell lot of discussions but nobody seems to address this fact. So I am confused if my issue is valid or is it just in my head.
As i know BART has a pre defined vocabulary where each token is defined. With that understanding if I am training the model on an unseen language, do I have to extend the vocabulary by adding tokens from the new language? Or the model extends its vocabulary on its own ?
If i had to provide a little more context , I can tokenize the English sentences using the pretrained tokenizer , but for the unseen language I do have a tokenizer which was trained for indic languages and it indeed does tokenize sentences properly. But what i am confused is if i do pass them to the model wouldn't it just classify as <unk> (unknown token?) since they're not present in its vocab?
Kindly help me with this , If someone can guide me about this I'd appreciate it!
So, I am currently about to graduate in about a month with a bachelors in Linguistics (with a 4.0 if that matters?) and I am trying to makes se of what to do after. I really would love to work in NLP, but unfortunately I didn’t have the time to complete more than a single python text processing class before my time has ended. (Though I’ve done other things on my own like cs50 and really loved it and picked up the content fast, so me not liking cs is not a concern) I’d really love to pursue a master’s degree in comp ling like through uni of washington, but i don’t have $50k ready to go for that, nor do i have the math basics to be admitted.
So, my thought is that I’ll do something like getting a job that will take any degree, then use that to pay for a second bachelors in comp sci through something affordable for me like wgu and use both degrees together to to get me into a position i’d really love, which i could then decide to pursue a masters once i’m more stable.
Does this sound ridiculous? Essentially what I’m asking before I actually try to go through with it is, would getting a second bachelors in comp sci after my first in linguistics be enough to break into nlp?
I am very new to NLP and the project I am working on is a chatbot, where the pipeline takes in the user query, identifies some unique value the user is asking about and performs a lookup. For example, here is a sample query "How many people work under Nancy Drew?". Currently we are performing windowing to extract chunks of words and performing look-up using FAISS embeddings and indexing. It works perfectly fine when the user asks for values exactly the way it is stored in the dataset. The problem arises when they misspell names. For example, "How many people work under nincy draw?" does not work. How can we go about handling this?
I’ve been working on a project designed to make audio transcription, translation, and content summarization (like interviews, cases, meetings, etc.) faster and more efficient.
Do you think something like this would be useful in your work or daily tasks? If so, what features or capabilities would you find most helpful?”
Let me know your thoughts 💭 💭
Pd: DM if you want to try it out
I am trying to make a new dictionary for my psychology bachelor’s thesis but the programme is refusing to recognise the words.
I have never used LIWC before and I’m at a complete loss. I don’t even know what is wrong. Can someone please help me out?
Use this module if you're tired to relearn regex syntax every couple of months :)
https://github.com/kallyaleksiev/aire
It's a minimalistic library that exposes a `compile` primitive which is similar to `re.compile` but let's you define the pattern with natural language
I've completely lost faith in Google Gemini. They're flat-out misrepresenting their memory features, and it's really frustrating. I had a detailed discussion with ChatGPT a few weeks ago about some coding issues. It remembered everything and offered helpful advice. When I tried the same thing with Gemini, it was like starting from scratch – it didn't remember anything. To add insult to injury, they market additional memory for a higher price, even though the basic version doesn't work. Google's completely misrepresenting the memory capabilities of Gemini.
I’ve never formally studied NLP, but I’m familiar with concepts like sentiment analysis, POS tagging, and distributional semantics at a concept level. I’d like to read some NLP papers, some research. to get more into this world and also to figure out whether I truly like it or not.