/r/LanguageTechnology

Photograph via snooOG

This sub will focus on theory, careers, and applications of NLP (Natural Language Processing), which includes anything from Regex & Text Analytics to Transformers & LLMs.

A community for discussion and news related to Natural Language Processing (NLP).

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.

Information & Resources

Related subreddits

Guidelines

  • Please keep submissions on topic and of high quality.
  • Civility & Respect are expected. Please report any uncivil conduct.
  • Memes and other low effort jokes are not acceptable forms of content.
  • Please follow proper reddiquette.

/r/LanguageTechnology

50,422 Subscribers

3

Please help: AI Ethics in Translation: Survey on MT's Impact

Good day!

This survey was created by my student, and she wasn’t sure how Reddit works, so she asked for my help. Here is her message:

Hi everyone! 👋 I’m a 4th-year Translation major, and I’m conducting research on the impact of machine translation (MT) and AI on the translation profession, especially focusing on ethics. If you’re a translator, I would greatly appreciate your insights!

The survey covers topics like MT usage, job satisfaction, and ethical concerns. Your responses will help me better understand the current landscape and will be used solely for academic purposes. It takes about 10-15 minutes, and all responses are anonymous.

👉 https://forms.gle/GCGwuhEd7sFnyqy7A

Thank you so much in advance for your time! 🙏 Your input means a lot to me.

0 Comments
2024/11/10
15:25 UTC

0

Does anyone else find the English language is almost set up for failure

Two , to ,too, witch, which, don't forget one, won, sun son, The list goes on and on, and then you throw in slang, sarcasm, and to finish it off (consciousness) w/ a splash of individually

I just see flaws in the way we communicate, and I the only one???

6 Comments
2024/11/10
13:09 UTC

2

Recommendations for an Embedding Model to Handle Large Text Files

Hey everyone,

I'm working on a project that requires embedding large text files, specifically financial documents like 10-K filings. Each file has a high token count and I need a model that can efficiently handle this

4 Comments
2024/11/10
12:14 UTC

0

Today’s Big Question: Can AI Really Understand Language?🤨

Hey r/LanguageTechnology! 🌐

AI’s making leaps in language processing, but are we hitting the ceiling? Translation is one thing, but can machines truly grasp cultural context, emotions, or the depth of human language?

What’s your take—will AI ever go beyond just statistical predictions to genuine understanding? Or are we stuck with “one-size-fits-all” models?

Let’s discuss!👇🏽🥲

7 Comments
2024/11/10
09:07 UTC

6

How do I find consultants with NLP expertise?

I work at a non-profit and we just completed a series of interviews. I would like to use NLP to process the text from these interviews but not sure where to start? Should I hire a consultant, buy a software package? Look for an NLP core group at a university?

20 Comments
2024/11/09
00:02 UTC

14

Can I Transition from Linguistics to Tech?

I am looking for some realistic opinions on whether it’s feasible for me to pursue a career in NLP. Here’s a bit of background about myself:

For my Bachelor's, I studied Translation and Interpretation. Although I later felt it might not have been the best fit, I completed the program. Afterward, I decided to shift paths and am now pursuing a Master’s degree in Linguistics/Literature. When choosing this degree, I believed that linguistics or literature were my only options given my undergraduate background.

However, since beginning my Master's, I’ve developed a strong interest in Natural Language Processing, and I genuinely want to build a career in this field. The challenge is that, because of my background and current coursework, I have no formal experience in computer science or programming.

So, is it unrealistic to aim for a career in NLP without a formal education in this field, or is it possible to self-study and acquire the skills I need? If so, how should I start, and what steps can I take to improve my skills?

15 Comments
2024/11/07
18:19 UTC

6

Open-Source PDF Chat with Source Highlights

Hey, we released a open source project Denser Chat yesterday. With this tool, you can upload PDFs and chat with them directly. Each response is backed by highlighted source passages from the PDF, making it super transparent.

GitHub repo: Denser Chat on GitHub

Main Features:

  • Extract text and tables directly from PDFs
  • Easily build chatbots with denser-retriever
  • Chat in a Streamlit app with real-time source highlighting

Hope this repo is useful for your AI application development!

2 Comments
2024/11/07
06:53 UTC

2

Improving NLP models thru neurolinguistics and neuroscience

Lately, I’ve been considering specializing in combining neuroscience, particularly neurolinguistics, to improve neural networks and, in general, the language capabilities of AI systems. But I have several doubts about this.

First of all, I don’t come from a computer science or neuroscience background—I have an undergraduate degree in languages and linguistics, and now I’m pursuing a master’s in NLP and neuroscience.

I wanted to ask:

1.	Given the current development of LLMs, transformers, etc., is this type of research between neuroscience and NLP still useful?

2.	Could this kind of research be relevant in the tech industry as well as academia? Some people say that neuroscience has nothing more to offer to AI/NLP, while others believe it’s the future of AI.

3.	What types of research do you know about that combine neurolinguistics with NLP to improve the language of these models? Perhaps you could suggest some papers. So far, I’ve seen some very recent research using neurolinguistic data, like fMRI data, to analyze how language models like BERT represent language compared to the human brain.

4.	I’m not sure what kind of background is necessary for this field. I notice that people working in this area usually have a STEM background in engineering, CS, or neuroscience, and I wonder if my background would be suitable. 

The point is that I don’t want to do pure research in neurolinguistica or neuroscience so that the results can guide AI/ NLP researches. I would like to use neurolinguistics to improve AI and NLP, so it’s kinda different.

2 Comments
2024/11/06
16:27 UTC

9

What should I major in to pursue a career in language technology?

Hello, I am a high schooler who wants to go into computational linguistics in the future. Is it better to pursue an undergraduate degree in linguistics + computer science or linguistics + data science? And if the school I end up going to offers an undergraduate degree in computational linguistics, should I take it or go more broad?

Thanks in advance!

13 Comments
2024/11/05
23:56 UTC

3

Seeking Help to Build a SaaS MVP for a Niche Market - Open to Collaborations

Hey everyone,

I’m looking to create an MVP for a SaaS product in a very niche area where I have around 11 years of experience. I truly believe this could be a game-changer for both professionals and enthusiastic hobbyists, especially if we manage to get it off the ground with the limited resources I currently have.

Here’s the problem: the type of work this tool would handle requires specialized knowledge that's hard to find. For businesses, finding qualified people is a real challenge, and when they do, the process tends to be really time-consuming. I think if we could make this tool work, it would be easy to market to companies in this niche around the world.

For hobbyists and enthusiasts, this tool could be a huge help too. It would allow them to perform highly technical tasks with just some basic understanding. I’m imagining it like this: watch a couple of general YouTube videos, and you’re good to go.

About the SaaS Tool (MVP)

The idea for the MVP is relatively simple. Imagine an LLM (large language model) that reads a PDF file of electronic schematics and provides a step-by-step guide, asking the user to input measurements and making decisions based on those inputs. It's like having a guided troubleshooting process for diagnostics.

If this MVP works, I’d like to look for funding to develop a full-fledged version, integrating communication with physical bench-top measuring tools, AI vision, and tapping into a wealth of knowledge from forums and resources already out there on the internet.

The Problem

Here’s the kicker: I’m not a developer, and I don’t know where to start with building this MVP. But I’m very open to learning, collaborating, and gathering all the help I can to create something that could attract investors and take this concept to the next level.

If anyone is interested in working together on this or has advice, my DMs are open. Whether you’re a developer, someone with experience in SaaS MVPs, or just curious about the concept, I’d love to connect.

Let’s see if we can make something exciting happen!

3 Comments
2024/11/05
20:00 UTC

1

Chatbot Reduction in execution time with reference to paper

Recently, I did a project with a paper recently uploaded on archive.

That name was "Enhancing robustness in large language models : Prompting for mitigating the impact of irrelevant information" This paper used gpt3.5

My idea was that what if we put information(information that indicates what words are irrelevant) into embedding space as context.

I used just one sample as experiment,

the result was,

  1. original qeury + no context vector takes 5.01 seconds to answer

2)original query + context vector takes 4.79 seconds

  1. (original query + irrelevant information) + no context takes 8.86 seconds

4)(original query + irrelevant information) + context takes 6.23 seconds

My question is that is time difference just system things or if model really easily figure out the purpose of query easily if we give model irrelevant information with notifying model that it is an irrelevant thing.

By the way, I used chatgpt4 as api.

Thanks

And experiment code is here :  genji970/Chatbot_Reduction-in-execution-time_with-reference-to-paper-Enhancing-Robustness-in-LLM-: Chatbot_Reduction in execution time_with reference to paper "Enhancing Robustness in Large Language Models : Prompting for Mitigating the Impact of Irrelevant Information"

2 Comments
2024/11/05
18:22 UTC

1

Run GGUF models using python

GGUF is an optimised file format to store ML models (including LLMs) leading to faster and efficient LLMs usage with reducing memory usage as well. This post explains the code on how to use GGUF LLMs (only text based) using python with the help of Ollama and LangChain : https://youtu.be/VSbUOwxx3s0

0 Comments
2024/11/05
06:06 UTC

3

BM25 for Recommendation System

I’ve implemented a modified version of BM25 for a document recommendation system and want to assess its performance compared to the standard BM25. Is it feasible to conduct this evaluation purely through mathematical analysis, or is user-based testing (like A/B testing) necessary? Additionally, what criteria should be used to select the queries for this evaluation?

In the initial phase of my study, I couldn't find many resources on evaluating the reliability of recommendation system methodologies. Thanks

4 Comments
2024/11/04
09:21 UTC

1

Newbie

Hi, i am a 21 year old guy... i heard about generative AI prompt engineering.. this seemed interesting to me.. can you guys guide me the pathway to learn it

1 Comment
2024/11/04
02:30 UTC

13

Biggest breakthroughs/most interesting developments in NLP?

Hello! I have no background in any of this. I've been really curious about the whole field lately. Not necessarily for any particular reason- I'm just fascinated by it. What would you say are some of the most important breakthroughs specifically in NLP and especially in real world applications in recent history? Also, what are some texts or resources you'd recommend for the casually curious pedestrian about machine learning, computational linguistics, etc. in general? Not for someone trying to enter the field or study for a degree. More like a "for Dummies." Thanks!

10 Comments
2024/11/04
00:43 UTC

6

What is the state-of-the-art for entity tagging + resolution?

I am trying to create a mechanism that can tag/identify keywords/phrases/ngrams within a text, and match them up to a custom vocabulary. Because I am dealing with highly dissimilar word forms such as acronyms (ex. MBA = Masters of Business Administration), things like text, edit distance would not work for my use case.

So far I have tried using embeddings (i.e. custom fast text), few shot models, as well as just straight prompting OpenAI, but the results are still not adequate.

Is this simply a function of needing larger custom trained embeddings, or are there more advanced approaches for this specific use case?

5 Comments
2024/11/03
14:24 UTC

2

I am looking for a way to implement AI TTS in Python

Hello, I am trying my best to learn AI and make myself an AI driven robot. For now I have a Basic Chatbot and I wanted to include AI Text-to-Speech (like Tacotron2 or XTTS). During my research I found Coqui with a good API for Python but it looks like it's not maintained anymore and I have a lot of issues using it and no tutorials are helpful.

That's why I wanted to ask if somebody could recommend me a good replacement for Coqui? Something I could finetune a model with and then implement it into my python project for my chatbot? Or maybe someone could help me setup Coqui if it's still possible and I just can't find a good docs.

0 Comments
2024/11/03
10:56 UTC

6

Part time masters specializing in NLP

Hello, I have the opportunity to get reimbursed for wadvancing my education. I work in a data science team, dealing primarily with natural language data. My knowledge of what I do is based solely on my background in behavioral sciences (I have an MS degree here) and everything that I needed to learn online to perform my job requirements. I would love to get a deeper understanding of the concepts involved in the computational tools I use so I can be more flexible and creative in using the technology available.

That said, I am looking for a part time masters program that specializes in NLP. It has to be part time as I would like to keep this job, and they only reimburse 6 credits per semester. Ideally, I am looking for something that can be done online but I am also open to relocating to other states in the US.

Do you have any recommendations or are you in a program you like? Would love some to get your input.

Thank you!

4 Comments
2024/11/02
13:15 UTC

11

Few Queries around learning NLP

Folks, please assist me by choosing to answer any 1 or all of the below queries.

  1. Could you please suggest a great modern reference book to learn NLP with Pytorch that also has a github page. Something that includes transformers is what I am looking for. I have some older references (4-6 yrs old) from O'reilly/Manning/Packt on NLP, but I am not sure if they'd still be relevant. Comment if I can use these.

  2. Can someone also demistify if I should continue learning to build stuff using Pytorch and transformers lib (which I believe is the richer format for learning) or should I learn FastAI. I really am not looking forward to rapid prototyping atm but everyone tells me its relevant.

  3. How did you teach yourself to build NLP projects? Any insights into the process are welcome. How does one build project today - is it all about pre-trained models? what's the better thought process?

Background - I understand theoretical concepts around NLP (and deep learning in general) but I am not well versed with the recent developments after the transformers. I am also comfortable writing code with Pytorch. Looking forward to build basic to advanced projects around NLP in a systematic and an organized learning format in order to develop skill.

Apologies in advance if I have asked too much in a single post. Thanks in advance.

0 Comments
2024/11/02
11:28 UTC

0

A simple LLM-powered Python script that bulk-translates files from any language into English

0 Comments
2024/11/02
06:39 UTC

1

Translation Technology For A Self Made Writing System

Hello everyone! I have, what should hopefully be, a unique project I wouldn't mind assistance with. Because I am weird, as a mental exercise, I am in the process of creating my own writing system. This includes making new unique Alphabet letters, Punctuation Marks, and Numbers.

I wondering if anyone might know of any programs that would be able to allow me to import pictures of the new letters, numbers, and Punctuation marks into it. Also the new rules for the writing system as well, such as the direction of writing. Then use them to basically translate English into the new writing system.

4 Comments
2024/11/02
06:34 UTC

4

SLM Finetuning on custom dataset

I am working on a usecase where we have call center transcripts(between caller and agent) available and we need to fetch certain information from transcripts (like if agent committed to the caller that your issue will be resolved in 5 days).

I tried gpt4o-mini and output was great.

I want to finetune a SLM like llama3.2 1B? Out of box output from this wasn’t great.

Any suggestions/approach would be helpful.

Thanks in advance.

8 Comments
2024/11/01
16:58 UTC

6

Machine Translation of Maharashtri Prakrit (an ancient Indian language) to English by Fine-Tuning M2M100_418M model on custom made Dataset.

Hey Folks,
I have created a Machine Translation Model to translate Maharshtri Prakrit to English. I created the dataset manually since Maharashtri Prakrit is extremely low-resource language. There are very less texts that are currently found as digital copy. The dataset created called Deshika which have 1.47k Sentences (This is extremely tiny but there were no resources present from which I can create the dataset). I fine-tuned M2M100 model and it achieved a BLEU score of 15.3416 and METEOR score of 0.4723. I know this model praTranv2 is not that good because of small dataset. Can you all help me how can I increase the performance of this model also any more suggestions for how should I increase my dataset.

github link: https://github.com/sarveshchaudhari/praTran.git
dataset link: https://huggingface.co/datasets/sarch7040/Deshika
model link: https://huggingface.co/sarch7040/praTranv2

2 Comments
2024/11/01
16:29 UTC

0

SacreCOMET: Pitfalls of the most popular MT metric

3 Comments
2024/11/01
12:46 UTC

12

CL/NLP/LT Master's Programs in Europe

Hello! (TL;DR at the bottom)

I am quite new here since I stumbled upon the subreddit by chance while looking up information about a specific master's program.

I recently graduated with a bachelor's degree in (theoretical) Linguistics (phonology, morphology, syntax, semantics, sociolinguistics etc.) and I loved my major (graduated with almost a 3.9 GPA) but didn't want to rush into a master's program blindly without deciding what I would like to REALLY focus on or specialize in. I could always see myself continuing with theoretical linguistics stuff and eventually going down the 'academia' route; but realizing the network, time and luck one would need to have to secure a position in academia made me have doubts. I honestly can't stand the thought of having a PhD in linguistics just because I am passionate about the field, only to end up unemployed at the age of 30+, so I decided to venture into a different branch.

I have to be honest, I am not the most well-versed person out there when it comes to CL or NLP but I took a course focusing on computational methods in linguistics around a year ago, which fascinated me. Throughout the course, we looked at regex, text processing, n-gram language models, finite state automata etc. but besides the little bit of Python I learned for that course, I barely have any programming knowledge/experience (I also took a course focusing on data analysis with R but not sure how much that helps).

I am not pursuing any degree as of now, you can consider it to be something similar to a gap year and since I want to look into CL/NLP/LT-specific programs, I think I can use my free time to gain some programming knowledge by the time the application periods start, I have at least 6-8 months after all.

I want to apply to master's programs for the upcoming academic year (2025/2026) and I have already started researching. However, not long after I started, I realized that there were quite a few programs available and they all had different names, different program content and approaches to the area of LT(?). I was overwhelmed by the sheer number of options; so, I wanted to make this post to get some advice.

I would love to hear your advice/suggestions if anyone here has completed, is still doing or has knowledge about any CL/NLP/LT master's program that would be suitable for someone with a solid foundation in theoretical linguistics but not so much in CS, coding or maths. I am mainly interested in programs in Germany (I have already looked into a few there such as Stuttgart, Potsdam, Heidelberg etc. but I don't know what I should look for when deciding which programs to apply to) but feel free to chime in if you have anything to say about any program in Europe. What are the most important things to look for when choosing programs to apply to? Which programs do you think would prepare a student the best, considering the 'fluctuating' nature of the industry?

P.S.: I assume there are a lot of people from the US on the subreddit but I am not located anywhere near, so studying in the US isn't one of my options.

TL;DR: Which CL/NLP/LT master's programs in Europe would you recommend to someone with a strong background in Linguistics (preferably in Germany)?

14 Comments
2024/10/30
17:08 UTC

8

Why not fine-tune first for BERTopic

https://github.com/MaartenGr/BERTopic

BERTopic seems to be a popular method to interpret contextual embeddings. Here's a list of steps from their website on how it operates:

"You can swap out any of these models or even remove them entirely. The following steps are completely modular:

  1. Embedding documents
  2. Reducing dimensionality of embeddings
  3. Clustering reduced embeddings into topics
  4. Tokenization of topics
  5. Weight tokens
  6. Represent topics with one or multiple representations"

My question is why not fine-tune your documents first and get optimized embeddings as opposed to just directly using a pre-trained model to get embedding representations and then proceeding with other steps ?

Am I missing out on something?

Thanks

5 Comments
2024/10/29
20:27 UTC

3

How ‘Human’ Are NLP Models in Conceptual Transfer and Reasoning? Seeking Research on Cognitive Plausibility!

Hello folks, I'm doing research on few-shot learning, conceptual transfer, and analogical reasoning in NLP models, particularly large language models. There’s been significant work on how models achieve few-shot or zero-shot capabilities, adapt to new contexts, and even demonstrate some form of analogical reasoning. However, I’m interested in exploring these phenomena from a different perspective:

How cognitively plausible are these techniques?

That is, how closely do the mechanisms underlying few-shot learning and analogical reasoning in NLP models mirror (or diverge from) human cognitive processes? I haven’t found much literature on this.

If anyone here is familiar with:

  • Research that touches on the cognitive or neuroscientific perspective of few-shot or analogical learning in LLMs
  • Work that evaluates how similar LLM methods are to human reasoning or creative thought processes
  • Any pointers on experimental setups, papers, or even theoretical discussions that address human-computer analogies in transfer learning

I’d love to hear from you! I’m hoping to evaluate the current state of literature on the nuanced interplay between computational approaches and human-like cognitive traits in NLP.

1 Comment
2024/10/28
20:50 UTC

1

Model for cleaning queries, for example dimensions and measurements

I'm working on a problem where I have a product name, but this product might contain dimensions, measurements and all sorts of engineering technical information.

My database is quite large, and there is absolutely no standardization for these queries, and sometimes they might be in different languages.

For example: "cork screw 7x2x 0.5lbs --in", this should be mapped to "cork screw".

With large LLMs I can easily solve this problem, but I cannot afford having them.

Do you guys have any suggestions on how to tackle this problem, where inference is relatively fast?

2 Comments
2024/10/28
17:14 UTC

1

Looking for Open-Source Multilingual TTS Training Data (French, Spanish, Arabic)

Hi everyone,

I'm working on building a multilingual TTS system and am looking for high-quality open-source data in French, Spanish, and Arabic (in that order of priority). Ideally, I'd like datasets that include both text and corresponding audio, but if the audio quality is decent, I can work with audio-only data too.

Here are the specifics of what I'm looking for:

  • Audio Quality: Clean recordings with minimal background noise or artifacts.
  • Sampling Rate: At least 22 kHz.
  • Speakers: Ideally, multiple speakers are represented to improve robustness in the TTS model.

If anyone knows of any sources or projects that offer such data, I’d be extremely grateful for the pointers. Thanks in advance for any recommendations!

5 Comments
2024/10/28
16:12 UTC

Back To Top