/r/LanguageTechnology

Photograph via snooOG

This sub will focus on theory, careers, and applications of NLP (Natural Language Processing), which includes anything from Regex & Text Analytics to Transformers & LLMs.

A community for discussion and news related to Natural Language Processing (NLP).

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.

Information & Resources

Related subreddits

Guidelines

  • Please keep submissions on topic and of high quality.
  • Civility & Respect are expected. Please report any uncivil conduct.
  • Memes and other low effort jokes are not acceptable forms of content.
  • Please follow proper reddiquette.

/r/LanguageTechnology

52,381 Subscribers

0

AI Langauge learning app

I am building an ai powered language learning app with features like real time pronunciation, more listening and speaking practice, and personalizes session planning. Anyone interested?? Or any suggestions??

2 Comments
2025/02/01
08:31 UTC

8

What is the minimum amount of parallel corpora needed for Machine Translation of Extremely Low Resource Ancient Language.

I am trying to build an nmt for prakrit languages. But I am having trouble finding the datasets. What must be the minimum threshold for the data size to get a descent BLEU score let's say around 30. You can also refer my earlier project I have posted in this subreddit.

2 Comments
2025/02/01
07:07 UTC

2

[P] Project - Document information extraction and structured data mapping

Hi everyone,

I'm working on a project where I need to extract information from bills, questionnaires, and other documents to complete a structured report on an organization's climate transition plan. The report includes placeholders that need to be filled based on the extracted information.

For context, the report follows a structured template, including statements like:

I need to rewrite all of those statements and merge them in the form a final, complete report. The challenge is that the placeholders must be filled based on answers to a set of decision-tree-style questions. For example:

1.1 Does the organization have a climate transition plan? (Yes/No)

  • If Yes → Go to question 1.2
  • If No → Skip to question 2

1.2 Is the transition plan approved by administrative bodies? (Yes/No)

  • Regardless, proceed to 1.3

1.3 Are the emission reduction targets aligned with limiting global warming to 1.5°C? (Yes/No)

  • Regardless, reference supporting evidence

And so on, leading to more questions and open-ended responses like:

  • "Explain how locked-in emissions impact the organization's ability to meet its emission reduction targets."
  • "Describe the organization's strategies to manage locked-in emissions."

The extracted information from the bills and questionnaires will be used to answer these questions. However, my main issue is designing a method to take this extracted information and systematically map it to the placeholders in the report based on the decision tree.

I have an idea in mind, but always like to have others' insights. Would appreciate your opinion on:

  1. Structuring the logic to take extracted data and answer the decision-tree questions reliably.
  2. Mapping answers to the corresponding sections of the report.
  3. Automating the process where possible (e.g., using rules, NLP, or other techniques).

Has anyone worked on something similar? What approaches would you recommend for efficiently structuring and automating this process?

Thanks in advance!

5 Comments
2025/01/31
10:27 UTC

5

What AI tools can I use for this NLP issue?

I'm looking for an AI solution to an issue I face pretty regularly. I run surveys and receive many open-end text responses. Sometimes there are up to 3k of these responses. From these responses, I need to find overarching themes that encompass the sentiment of the open-end text responses. Doing it manually in a team is an absolute pain as it involves reading each response individually and categorizing it in a theme manually. This takes a lot of time.

I've tried using ChatGPT 4-o and other specialized GPTs within the ChatGPT interface to try this but they do not work well. It randomly categorizes options after a point and only does the first 30-40 responses well. It also fails to recognize responses that have typos. Any solutions or specific tools you would recommend? My friend and I know how to code as well and would be open to using APIs, but ready to go services would be better.

13 Comments
2025/01/30
17:55 UTC

1

Need some help for a project

So the project is we get bunch of unstructured data like emails etc and we have to extract data from it like name, age and in case of order mails things like quantity, company name etc. I think Named Entity Recognition is the way to go but am stuck on how to proceed. Any help would be appreciated. Thank you

Edit: I know that we have can use NER but how do I extract things like quantity, item name etc apart from tags like Person, Location etc. Thanks

5 Comments
2025/01/30
16:39 UTC

1

NER with texts longer than max_length ?

Hello,

I want to do NER on texts using this model: https://huggingface.co/urchade/gliner_large_bio-v0.1 . The texts I am working with are of variable length. I do not truncate or split them. The model seems to have run fine on them, except it displayed warnings like:

UserWarning: The sentencepiece tokenizer that you are converting to a fast tokenizer uses the b
yte fallback option which is not implemented in the fast tokenizers. In practice this means that the fast version of the tokenizer can produce unknown tokens whereas the sentencepiece version would have converted these
unknown tokens into a sequence of byte tokens matching the original piece of text.
 warnings.warn(
Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.

I manually gave a max_length longer than what was in the config file:

model_name = "urchade/gliner_large_bio-v0.1"model = GLiNER.from_pretrained(pretrained_model_name_or_path=model_name, max_length=2048)

What could be the consequences of this?

Thank you!

6 Comments
2025/01/30
09:32 UTC

1

question about creating my own translation

so i dont really know if this is the right place to ask so if this is not the right place to ask this please point me to where is the most appropriate. with that said

my goal is to create my own japanese to english translator tool. i know japanese so even if the tool that i create isnt optimal it would be easy for me to correct.

what tools do i need to do to achieve my goal? does that tool also have a way to visualize the flow of the conversion through maybe a flowvhart? if not im fine with not having that feature.

also might be offtopic but is there a info on net where it shows you how the translator(machine or program) breaks down the sentence and translate it? interested in japanese text

3 Comments
2025/01/30
06:59 UTC

4

A Structure that potentially replaces Transformer

I have an idea to replace the Transformer Structure, here is a short explaination.

In Transformer architicture, it uses weights to select values to generate new value, but if we do it this way, the new value is not percise enough. 

Assume the input vectors has length N. In this method, It first uses a special RNN unit to go over all the inputs of the sequence, and generates an embedding with length M. Then, it does a linear transformation using this embedding with a matirx of shape (N X N) X  M.

Next, reshape the resulting vector to a matrix with shape N x N. This matrix is dynamic, its values depends on the inputs, whereas the previous (N X N) X  M matrix is fixed and trained.

Then, times all input vectors with the matrix to output new vectors with length N.

All the steps above is one layer of the structure, and can be repeated many times.

After several layers, concatanate the output of all the layers. if you have Z layers, the length of the new vector will be ZN.

Finally, use the special RNN unit to process the whole sequence to give the final result(after adding several Dense layers).

The full detail is in this code, including how the RNN unit works and how positional encoding is added: 

https://github.com/yanlong5/loong_style_model/blob/main/loong_style_model.ipynb

 

Contact me if you are interested in the algorithm, My name is Yanlong and my email is y35lyu@uwaterloo.ca

4 Comments
2025/01/29
21:24 UTC

1

installing BRAT on mac/linux

Hi, all.

This might be a long shot. I have some old annotation in .ann. My brat installation used to work. But I have tried multiple ways to install brat on both mac and linux server from source code and image, but all failed. It seems to be some cgi issue.

Since I haven't seen the source code updated for many years, I am not sure if it is still installable. If it can be installed, which source code/docker image has been proven to be working?

thanks!

0 Comments
2025/01/29
08:56 UTC

1

Please advice first ARR (ACL 2025) submission

Hi everyone.

I will submit for the first time to the ARR feb cycle including ACL conference.

The ACL 2025 website regulation states that long paper is up to 8 pages, so can't it be over 1-2 pages?

In fact, long papers in ACL, EMNLP, and NAACL conf have often been 9 to 10 pages.

7 Comments
2025/01/28
21:31 UTC

5

Need help with BERTopic and Top2Vec - Topic Modeling

Hello dear community!
I’m working with a dataset of job postings for data scientists. One of the columns contains the "required skills." I’d like to analyze this dataset using topic modeling to extract the most prominent skills and skill clusters.

The data looks like this:
"3+ years of experience with data exploration, data cleaning, data analysis, data visualization, or data mining. 3+ years of experience with statistical and general-purpose programming languages for data analysis. [...]"

I tried using BERTopic with "normal" embeddings and more tech focused embeddings but got very bad results. I am not experienced with Topic Modeling. I am glad for any help :)

10 Comments
2025/01/28
12:00 UTC

1

How to summarize multimodal content

0 Comments
2025/01/28
09:13 UTC

1

Should I switch to SDE or find NLP-related RA in the UK if I still want to go for a phd several years later?

Hi everyone, I’m an international student who recently graduated from the University of Edinburgh with a Master’s degree (Merit) in a field related to NLP and Machine Learning. My undergraduate background is in linguistics. After graduation, I noticed that finding a MLE role in the UK often requires a PhD. However, after discussing with my supervisor, she suggested that I consider applying for a RA position first, as the PhD application process is highly competitive.

I’m unsure about the best path forward and would appreciate some advice. Should I focus on finding an NLP-related RA position in the UK and then apply for a PhD? Or would it make more sense to first transition into a SDE role, gain industry experience, and later pivot to MLE before applying for a PhD based on my work experience? Alternatively, should I reconsider pursuing a PhD altogether?

Feel free to ask me for more information if it's needed for suggestions! Also appreciate if there is any lab or uni recommendations for RA/Phd.

FYI, I don't have any work experiences so far, only research experiences in linguistics and NLP.

3 Comments
2025/01/27
17:29 UTC

0

I want to learn new languages without straining my eyes. What AI conversation apps are best to do natural and step by step hands free calls with chatbots?

0 Comments
2025/01/26
13:40 UTC

15

How to do PhD research in NLP if we have advance models like GPT and Gemini already.

I am just wondering what avenues of research or what topic to do research on if we have advanced NLP models like Chat GPT and Gemini who have enormous processing power and training data access, I mean isn't the research useless if whatever we do Chat GPT can do better?

13 Comments
2025/01/26
00:14 UTC

3

Which natural language to learn?

Hi!

I'm a 17 years old guy from Moscow, in the 10th grade, and I'm planning to apply to either HSE (Higher School of Economics) or Moscow State University (MSU) for a program in Fundamental and Applied/Computational Linguistics. To do this, I'm planning to take the Unified State Exam (USE) in advanced mathematics, computer science, and English, as well as study some topics from the first-year curriculum in advance. I'm already gradually practicing programming in Python, advanced math (I'm currently reading about limits and integrals), and slowly getting into the basics of linguistics. I also want to start learning a second foreign language, which is mandatory in both universities. However, I don't know which one would be better. Both universities offer a choice of European and Asian languages.

It's important to me that the third language would be a good addition to my future resume or be in demand in NLP.

I'm not afraid of any difficulties. I'm ready for any challenges if I approach them at my own pace, I'm ready to adapt my mindset. I'm left-handed, so writing from right to left is not difficult for me, I tried it. Logograms are not a catastrophe for me to memorize as well. In fact, I love making up my own writing systems just for fun.

Which language would you choose and why?

Thank you!

15 Comments
2025/01/25
19:30 UTC

7

Got really bad scores at ARR Dec24 cycle

First time researcher here. I got assessment scores of 1.5, 1.5 and 2 from three reviewers. All the reviewers acknowledge the novelty of my work in strenghts. But the points reviewers raised in weakness if addressed will increase the paper length from short to long (as this was mainly an initial study as mentioned in limitations). Also reviewers dont seem to understand the point of paper.For such a low score, is their any point for doubling down on convincing reviewers or should I just acknowledge their criticism and improve in another submission? Also what should be my target scores for acceptance into a relevant ACL workshop?

1 Comment
2025/01/25
19:22 UTC

1

NAACL 2025 December Cycle

Anyone know what average overall score required to be accepted to main, or like what is a safe number? Is there anywhere I can see average scores for the October cycle?

2 Comments
2025/01/25
19:09 UTC

6

MSc Interview Speech and Language

Hi!

I've been invited to an interview for the MSc in Speech and Language Processing at Ediburgh. I've never done an interview for a program before so I'm unsure about what they would ask or about the organization of the interview.

Has anyone done an interview for this program or other related?

Any advice on the interview topic is welcomed!

2 Comments
2025/01/25
13:17 UTC

2

Is AI good for translation?

I mean for mainly business purposes, e.g., decks, content, reports, etc. Can AI do it well? Will it make bad mistakes? Should I use a person instead?

5 Comments
2025/01/25
11:29 UTC

3

I want to prepare myself to apply to the computational linguistics program at Université Paris Cité

I’ve been sifting through the website but cannot find some pretty basic info about the program details, such as application deadlines and if GREs are required. Has anyone studied or at least applied to UP Cité? I would really appreciate any help or direction. I’m coming from an unrelated area of study, if that helps at all. Thank you in advance.

11 Comments
2025/01/25
04:38 UTC

0

chatbot capable of interactive (suggestions, followups, context understanding) chat with very large SQL data (lakhs of rows, hundreds of tables)

Hi guys,

* Will converting SQL tables into embeddings, and then retreiving query from them will be of help here?

* How do I make sure my chatbot understands the context and asks follow-up questions if there is any missing information in the user prompt?

* How do I save all the user prompt and response in one chat so as to make context of the chat history? Will not the token limit of the prompt exceed? How to combat this?

* What are some of the existing open source (langchains') agents/classes that can be actually helpful?

**I have tried create_sql_query_chain - not much of help in understanding context

**create_sql_agent gives error when data in some column is of some other format and is not utf-8 encoded [Also not sure how does this class internally works]

* Guys, please suggest me any handy repository that has implemented similar stuff, or maybe some youtube video or anything works!! Any suggestions would be appreciated!!

Pls free to dm if you have worked on similar project!

0 Comments
2025/01/24
18:37 UTC

4

Master’s in CL without prior knowledge in IT

hey there!

I am currently looking for an MA program in Computer linguistics/ Language and AI or other programs that would connect IT with linguistics, yet I don’t have any previous experience in programming. Anyone knows about the programs in Europe (and the UK) which would accept applicants with various backgrounds without prior knowledge in IT? That would immensely help me.

Please, let me know if you’re by any chance aware of scholarships available for these countries/programs ✨✨

Thank you a lot in advance!

3 Comments
2025/01/24
17:39 UTC

0

I need help

Hello everyone. I am newbie in NLP world, and have a task from one firm. It is technical task for intern position. Here is the description of the task:

You task it to process provided technical articles and implement continual training for one of the large Language Models – BERT. The purpose is such that your BERT model understands the context of those papers and ready to answer questions related to those papers. For that, you need to work with Hugging Face. It is also suggested for you to work via Colab. Your deliverables are:

·       Deploy original BERT model and test it by asking the questions

·       Do continual training of BERT and generate a code allowing to ask questions regarding paper context

·       Compare answers of original and your BERT models and show that your model is fit-to-purpose

Here is my problem. As I know, when we finetune BERT we need question, answer, context, start and end positions of answer. But there are too many content provided by them. 6 pdfs which are separated books. Is there a way to generate that questions answers and etc in easy way?

2 Comments
2025/01/24
16:41 UTC

1

ACL Rolling Review December 2024

13 Comments
2025/01/24
09:57 UTC

4

Is there a list of all the shared task in NLP at one place ?

I am looking for currently running or future shared tasks in NLP .

4 Comments
2025/01/24
05:26 UTC

3

Topic Modeling for high volume chat data

0 Comments
2025/01/24
04:11 UTC

19

Have you observed better multi-label classification results with ModernBERT?

I've had success in the past with BERT and with the release of ModernBert I have substituted the new version. However, the results are nowhere near as good. Previously, finetuning a domain adapted BERT model would achieve an f1 score of ~.65, however swapping out for ModernBERT, the best I can achieve is an f1 score of ~.54.

For context, as part of my role as an analyst I partially automate thematic analysis of short text (between sentence and paragraphs). The data is pretty imbalanced and there are roughly 30 different labels with some ambiguous boundaries.

I am curious if anyone is experiencing the same? Could it be the the long-short attention isn't as useful for only shorter texts?

I haven't run an exhaustive hyperparameter search, but was hoping to gauge others' experience before embarking down the rabbit hole.

5 Comments
2025/01/23
22:05 UTC

1

Dataset for character prediction

Hello,

New to NLP and looking for a multilingual dataset/corpus (That won't crash my computer) that allows for a model to be trained that will predict the next character in a sequence. Thanks!

4 Comments
2025/01/23
21:56 UTC

Back To Top