/r/LanguageTechnology

Photograph via snooOG

A community for discussion and news related to Natural Language Processing (NLP).

Natural language processing (NLP) is a field of computer science, artificial intelligence and computational linguistics concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language corpora.

Information & Resources

Related subreddits

Guidelines

  • Please keep submissions on topic and of high quality.
  • Civility & Respect are expected. Please report any uncivil conduct.
  • Memes and other low effort jokes are not acceptable forms of content.
  • Please follow proper reddiquette.

/r/LanguageTechnology

47,297 Subscribers

1

A new ChatGPT learning platform.

0 Comments
2024/05/06
05:59 UTC

1

How does a basic grammar bot work?

I'm interested in creating a simple grammar bot to basically determine if a sentence is proper sentence in the English language. Grammarly will explain specific mistakes, for example, if I write "He live in America" it will suggest, "It appears that the subject pronoun He and the verb live are not in agreement. Consider changing the verb. As the subject of a sentence or clause a personal pronoun can be in the first person (i, we), The second person (you), or the third person (he, she, it, they). The same personal pronouns are either singular (I, U, he, she, it quotes parentheses or plural (we, you, they). Make sure that the form of the verb agrees with the form of the personal pronoun." That's a lot of rules just for one mistake. I assume Grammarly has plenty of English experts who wrote in these rules, but what might suffice to distinguish a string of random words from an English sentence (even if it's not be perfect).

0 Comments
2024/05/06
01:00 UTC

2

Using RAFT with embeddings

Can we use emebeddings to add context while generating training dataset for Retrieval Augmented Fine Tuning(RAFT)? I plan to fine tune gpt3.5. P.S. I am quite novice to the world of word emebeddings and vector DBs, but know the concept.

0 Comments
2024/05/05
20:22 UTC

3

Custom Entity Recognition for longer paragraphs.

Hello!
I am building an application that needs to do "Named Entity Recognition" (I think that is the right term), for entities that are longer pieces of text, 1-5 sentences. I have tried using AWS Comprehend Custom Entity Recognition, but it seems to be a better fit for entities that are just a few words.
Does any resources out there talk about this topic? I would ideally like to pay for a service to do the heavy lifting, but I will gladly create my own model from the bottom if I need to.

Thanks in advance!

5 Comments
2024/05/04
13:59 UTC

1

need help on an emotion recognition model

Hello r/LanguageTechnology, I am currently working on a project which needs an emotion recognition model. I want the model to take in verbal input from a user and classify whether the user is confident or not based on various attributes like clarity, modulation, pace, stuttering, volume etc.
I am currently planning to use either LSTMs or RNNs for this but if anyone has a better approach then please do help, i would appreciate it a lot :D.

0 Comments
2024/05/04
13:58 UTC

3

Translation API that can return alternative translations?

In need of a tool that can do this, even if just for single words

Sometimes words/phrases can have multiple different translations in the language pair. I've been using DeepL but their API only gives one interpretation.

Would building an OpenAI assistant be effective for this?

6 Comments
2024/05/04
12:42 UTC

2

LLMs can't play tic-tac-toe. Why? Explained

0 Comments
2024/05/04
03:00 UTC

11

Which NLP-master programs in Europe are more cs-leaning?

I'm (hopefully) going to finish my bachelors degree in Computational Linguistics and English Studies in Germany (FAU Erlangen-Nürnberg, to be precise) next year and I'm starting to look into masters programs. As much as I love linguistics, thinking about job perspectives I want to do a program that is much heavier on the computer science aspects than the linguistic ones. I sadly haven't been able to take any math courses and I doubt I'd be able to finish the ones you would have with a normal cs degree before finishing my studies, I do however have programming experience in Python and Java and I've also worked with Neural Networks before.

I'd like to stay in Europe and I also can't afford places like Edinburgh with those absurd tuition fees (seriously, 31k? who can afford that?). I know Stuttgart is supposed to be good, Heidelberg too, although I don't know how cs-heavy that is considering it's a master of arts. I've also heard about this European Erasmus Mundus LCT Program, although I wonder how likely it would be to get a scholarship for that. Also I'd be a little worried about having to find housing twice in 2 years.

tl;dr

looking for a cs-heavy NLP-master in Europe (or smth else that I could get into with basically no mathematical experience that enables me to work with Machine Learning etc. later) that also won't require me to sell a kidney to afford it.

15 Comments
2024/05/03
17:16 UTC

3

Recommendations for text classification of high level conceptual categories

Hello lovely people of r/LanguageTechnology !

I am working on a project, and would love any suggestions. I am a psychology researcher trying to utilize NLP for qualitative research with a dataset of ~350,000 social media posts related to my topic (a specific component of wellbeing). I would like do do a few text categorizations:

First a binary classification, relevant or irrelevant (I have done a lot of cleaning, but there is a limit to how much I can exclude before I start removing relevant posts, so my thought was to train a classifier to filter out irrelevant posts).

Second, sentiment (likely positive, negative, and neutral, though maybe just positive and negative)

And finally, three different theoretical dimensions/categories of the wellbeing concept I am analyzing (This one I am sure will be the most difficult, but also potentially isn't completely necessary, it would just be very cool). These would not be mutually exclusive.

I have been reading so much about transformers vs sentence transformers, and have also considered using an LLM (especially for the 3rd task, as it is highly conceptual and I could see a LLM having some advantage with that). I have also looked into using this framework, Adala (https://github.com/HumanSignal/Adala) for using an LLM - it looks promising to me. I also have considered fine-tuning a small LLM such as Phi-3 for this.

Does anyone have any recommendations? I have also gone back and forth whether I should train 3 separate models, or attempt to do it all as one big multi-class classification (it seems like with something like Adala I could do this).

Any recommendations? Thanks in advance!!

5 Comments
2024/05/03
12:54 UTC

2

Similarity of a group of tokens

Hi, I have been trying to cluster labels create a small number of labels that represent the originals

-engineer, project, -engineer, electrical -senior project engineer -senior mechanical engineer -administrator Etc,

The desired result is groups ( (engineer, project, engineer, electrical ),

(senior project engineer, senior mechanical engineer)

Etc

The steps I took: -tokenised the labels using nltk -created embedded vector for each token using glove and its wiki giga model. -obtained a single vector by calculating the mean of each element

  • calculated a similarity matrix
  • created groups of labels whose similarity are less than a threshold, 0.92 .

My issue is that I have is that the labels with engineer all combined.

Something like ‘senior project engineer’ and ‘project engineer’ have a ver high similarity score.

I have tried different operations to get the final vector instead of the average but I get the same result.

Any ideas?

Should I multiply the similarity matrix by a distance Levenshtein?

I have not tried Bert or any transformer driven method.

Thanks!

3 Comments
2024/05/02
21:23 UTC

2

Medspacy and Scispacy compatibility issue.

I am trying to use the en_ner_bionlp13cg_md model with medspacy. This only seems to work if I enable the parser, which is a major appeal of medspacy, as seen below:

nlp = medspacy.load("en_ner_bionlp13cg_md", disable=['parser'])

This is successful, but I lose parsing.

If I run the following:

nlp = medspacy.load("en_ner_bionlp13cg_md")
text = "blahblahblah"
doc = nlp(text)
visualize_ent(doc)

I get the following error:

ValueError Traceback (most recent call last)
Input In [86], in <cell line: 2>()
1 text = "blahblahblah"
----> 2 doc = nlp(text)
3 visualize_ent(doc)

File c:\Users\x\anaconda3\lib\site-packages\spacy\language.py:1054, in Language.call(self, text, disable, component_cfg)
1052 raise ValueError(Errors.E109.format(name=name)) from e
1053 except Exception as e:
-> 1054 error_handler(name, proc, [doc], e)
1055 if not isinstance(doc, Doc):
1056 raise ValueError(Errors.E005.format(name=name, returned_type=type(doc)))

File c:\Users\x\anaconda3\lib\site-packages\spacy\util.py:1722, in raise_error(proc_name, proc, docs, e)
1721 def raise_error(proc_name, proc, docs, e):
-> 1722 raise e

File c:\Users\x\anaconda3\lib\site-packages\spacy\language.py:1049, in Language.call(self, text, disable, component_cfg)
1047 error_handler = proc.get_error_handler()
1048 try:
-> 1049 doc = proc(doc, **component_cfg.get(name, {})) # type: ignore[call-arg]
1050 except KeyError as e:
1051 # This typically happens if a component is not initialized
1052 raise ValueError(Errors.E109.format(name=name)) from e

File c:\Users\x\anaconda3\lib\site-packages\PyRuSH\PyRuSHSentencizer.py:53, in PyRuSHSentencizer.call(self, doc)
51 def call(self, doc):
52 tags = self.predict([doc])
---> 53 cset_annotations([doc], tags)
54 return doc

File c:\Users\x\anaconda3\lib\site-packages\PyRuSH\StaticSentencizerFun.pyx:48, in PyRuSH.StaticSentencizerFun.cset_annotations()

File c:\Users\x\anaconda3\lib\site-packages\PyRuSH\StaticSentencizerFun.pyx:56, in PyRuSH.StaticSentencizerFun.cset_annotations()

File c:\Users\x\anaconda3\lib\site-packages\spacy\tokens\token.pyx:509, in spacy.tokens.token.Token.sent_start.set()

File c:\Users\x\anaconda3\lib\site-packages\spacy\tokens\token.pyx:528, in spacy.tokens.token.Token.is_sent_start.set()

ValueError: [E043] Refusing to write to token.sent_start if its document is parsed, because this may cause inconsistent state.

Any assistance in resolving this is greatly appreciated. I do not have this error if I use spacy.load(), only medspacy.load().

0 Comments
2024/05/02
17:01 UTC

1

MA in speech and language processing at Konstanz university?

0 Comments
2024/05/02
12:54 UTC

4

Please help me solve a problem

I have a huge csv containing chats of Ai and human discussing their feedback on a specefic product, my objective is to extract the product feedbacks since i want to improve my product but the bottleneck is the huge dataset. I want to use NLU techniques to drop off irrelevant conversations but traversing the whole dataset and understanding each sentence is taking a lot of time for doing this.

How should i go about solving this problem? I've been scratching my head over this for a long time now :((

5 Comments
2024/05/02
10:57 UTC

1

Issue on CoNLL Coreference Scorer

By CoNLL Scorer, I mean this: https://github.com/conll/reference-coreference-scorers

I have a Brazilian Portuguese corpus on SemEval format that I'd like to use to test a coreference resolution model. On this corpus, the coreference column is the 7th one. I tried testing it against itself on the scorer just to see if it would read it right, and I think it didn't, as all it gave me was empty:

METRIC muc:
[none]:
====> :
File :
====> :
File :
Total key mentions: 0
Total response mentions: 0
Strictly correct identified mentions: 0
Partially correct identified mentions: 0
No identified: 0
Invented: 0
Recall: (0 / 0) 0%      Precision: (0 / 0) 0%   F1: 0%
--------------------------------------------------------------------------

====== TOTALS =======
Identification of Mentions: Recall: (0 / 0) 0%  Precision: (0 / 0) 0%   F1: 0%
--------------------------------------------------------------------------
Coreference: Recall: (0 / 0) 0% Precision: (0 / 0) 0%   F1: 0%
--------------------------------------------------------------------------

Now I'm stuck as to what I should do to make it read it correctly. Should I add some empty columns until the correference columns reaches the one the scorer is looking at? And which one would that be?

Thank you so much, please let me know if there is any important info i forgot to add.

0 Comments
2024/05/02
06:31 UTC

6

What do you think is the state of the art technique for matching a piece of text to a reference database?

The problem I'm trying to solve is that I have new strings coming in that I haven't seen before that are synonyms for existing strings in my database. For example, if I have a table of city names and I receive the strings "Jefferson City, MO" or "Jeff City" or "Jefferson City, Miss" I want them all to match to "Jefferson City, Missouri."

I first tried solving this with fuzzy matching from the fuzzywuzzy library using Levenshtein distance and that worked pretty well as a first quick attempt.

Now that I have some more time I'm returning to the problem to use some more sophisticated techniques. I've been able to improve upon the fuzzy matching by using the SentenceTransformer library from HuggingFace to generate an embedding of the token. I also generate embeddings of all the tokens in the reference table. Then I use the faiss library to find the existing embedding that is closest to the new embedding. If you're interested I can share some python code in a comment.

My questions:

  1. Have you had success with a different approach or a similar approach but with some tweaks? For example, I just discovered the "Splink" library when doing some searching which seems promising but my input is mostly strings rather than tabular data.
  2. Do you think it's worth it to try to fine tune the sentence embeddings to fit my specific use case? If so, have you found any high quality tutorials covering how to get that working?
  3. Do you think it's worth it to introduce an element of attention to the embeddings? Continuing the example from above I might have "Jefferson City", "St. Louis", and "Kansas City" all in the same document and then if I get "Springfield" next it would be great to interpret that as "Springfield, MO" rather than a "Springfield" in another state. My understanding is that introducing attention can get me closer to that sort of logic -- has anyone had luck introducing that in a problem like this or have a high quality tutorial to link to?

I appreciate your input thank you very much!

5 Comments
2024/05/01
19:19 UTC

2

Good way to represent model "needs" ?

I am testing a few different embedding models and I think I understand some perform better if the embedded passages and queries are prefixed with patterns they were traine with. Here is what I came up with to represent these "needs" (and some additional data):
{
"intfloat/multilingual-e5-large": {
"docprompt": "passage: ",
"queryprompt": "query: ",
"embedding_length": 4096,
"model_parameters": "118M",
"context_length": 512
},
"distiluse-base-multilingual-cased-v2": {
"docprompt": "",
"queryprompt": "",
"embedding_length": 512,
"model_parameters": "135M",
"context_length": 128
},
"paraphrase-multilingual-mpnet-base-v2": {
"docprompt": "",
"queryprompt": "",
"embedding_length": 768,
"model_parameters": "278M",
"context_length": 128
},
"nomic-ai/nomic-embed-text-v1": {
"docprompt": "",
"queryprompt": "search_query: ",
"st_additional_params": {
"trust_remote_code": true
},
"embedding_length": 770,
"model_parameters": "137M",
"context_length": 8192
}
}
So if I got things right, as an example when creating vectors with multilingual-e5-large I should prepend "passage: " to my document and when vectorizing my query I should prefix it with "query: ".

Is there a simpler or more standard way of handling this, without reinventing the wheel?

Thanks for any suggestions.

0 Comments
2024/05/01
07:25 UTC

2

How to benchmark for precision/recall of semantic retrieval

I am testing an handful of embeddings models to perform semantic retrieval. These are the ones I've started with: nomic-ai/nomic-embed-text-v1, intfloat/multilingual-e5-large, distiluse-base-multilingual-cased-v2, paraphrase-multilingual-mpnet-base-v2

Have vectorized a few thousand news articles with them and then am typing in my queries, vectorizing them with each model and judging if the retrieved articles make sense. All of this is very empirical and heavily manual/cumbersome.

Is there some better approach? Not sure this is fundamental to know but the corpus/query are in Italian.

Thanks

0 Comments
2024/05/01
07:06 UTC

11

Multilabel text classification on unlabled data

I'm curious what you all think about this approach to do text classification.

I have a bunch of text varying between 20 to 2000+ words long, each talking about varying topics. I'll like to tag them with a fix set of labels ( about 8). E.g. "finance" , "tech"..

This set of data isn't labelled.

Thus my idea is to perform a zero-shot classification with LLM for each label as a binary classification problem.

My idea is to perform a binary classification, explain to the LLM what "finance" topic means, and ask it to reply with "yes" or "no" if the text is talking about this topic. And if all returns a "no" I'll label it as "others".

For validation we are thinking to manually label a very small sample (just 2 people working on this) to see how well it works.

Does this methology make sense?

edit:

for more information , the text is human transcribed text of shareholder meetings. Not sure if something like a newspaper dataset can be used as a proxy dataset to train a classifier.

16 Comments
2024/05/01
06:07 UTC

1

Seeking Advice: Integrating AI/NLP Error Detection into Existing VLE for Thesis Project

Hey everyone! I'm currently working on my project thesis, which involves developing AI and NLP techniques for automated error detection in teaching materials. I'm looking for advice on how to integrate this functionality into an existing Virtual Learning Environment (VLE) within a tight timeframe of 3 months.

If anyone has experience with integrating AI/NLP tools into VLEs, especially within a short timeframe, I'd love to hear about your approach and any tips or best practices you can share.

Additionally, I'm open to suggestions for tools or technologies that could expedite the development process. Are there any specific AI/NLP frameworks or platforms that you recommend for this type of project?

Thanks in advance for any insights or recommendations you can provide!

6 Comments
2024/04/30
21:24 UTC

1

language teachers: how many repetitions do you think a novice learner needs to solidify a concept in their memory?

short-term memory is less powerful than long-term retention, of course, but i’d still love any thoughts, opinions, and feedback!

5 Comments
2024/04/30
17:20 UTC

0

Using LLM models as classifiers for routing RAG chatbots? A long term plan, or how to improve?

I'm making a RAG chatbot for my company and I have basically zero data to work with, aside from what I can think of and create on my own. So not enough for a training dataset. But with the power of prompt engineering, a good LLM, and some software gluing it all together, I'm able to use my LLM to effectively classify user queries as one of several categories to route use cases to other chains.

As someone who studied ML and normal SWE, it feels weird to just replace what could/should have been an ML classifier, but realistically I can't use ML because I don't have data yet.

Is anybody else doing anything similar? I was thinking maybe I could use transformers as like a pretrained ML classifier and log chat usage data in production. Then if we acquired enough data, I'd be able to train an ML algorithm (or maybe fine tune a smaller/cheaper LLM) to save cost and processing time.

3 Comments
2024/04/30
15:36 UTC

7

I made a text-game where all the LLMs trick each other pretending to be humans. They went crazy. (Video)

1 Comment
2024/04/30
14:22 UTC

5

ROUGE Score Explained

Hi there,

I've created a video here where I explain the ROUGE score, a popular metric used to evaluate summarization models.

I hope it may be of use to some of you out there. Feedback is more than welcomed! :)

0 Comments
2024/04/30
11:15 UTC

4

Help with fraud recognition

Hi everyone! I'm currently doing an internship at a local bank. The project I'm working on is, as the title says, automatic fraud detection, more precisely for bank transfers. I have these features:

  • Origin country
  • Amount
  • Description
  • IBAN code of the receiver
  • Name of the receiver
  • Channel
  • IP
  • Device ID
  • Receiving country
  • Receiving city

Each month of 2023 has a file with all bank transfers. Bank transfers tagged as fraudulent, across the whole year, are about 600, while the non-fraudulent total transfers should be around the million.

Given these information, what strategy should I employ? Which algorithms suit my case best? And, do you think the features I have are enough? At the moment, the best result was with Logistic Regression and ADASYN for resampling, but the number of false positives was way too high.

Thanks!

2 Comments
2024/04/30
07:38 UTC

2

Clustering Embeddings for Sub-Topic Extraction in RAG

Hey Guys. I'm building a project that involves a RAG pipeline and the retrieval part for that was pretty easy - just needed to embed the chunks and then call top-k retrieval. Now I want to incorporate another component that can identify the widest range of like 'subtopics' in a big group of text chunks. So like if I chunk and embed a paper on black holes, it should be able to return the chunks on the different subtopics covered in that paper, so I can then get the sub-topics of each chunk. (If I'm going about this wrong and there's a much easier way let me know) I'm assuming the correct way to go about this is like k-means clustering or smthn? Thing is the vector database I'm currently using - pinecone - is really easy to use but only supports top-k retrieval. What other options are there then for something like this? Would appreciate any advice and guidance.

1 Comment
2024/04/30
07:00 UTC

1

How to do morphological analysis of Japanese text or at least tokenization / word segmentation on an android device?

Hi!

I'd like to split a Japanese text into words (or morphemes I guess) on an android device but I'd like to do that on the device. I'd prefer to not have some sort of web service for this. I also don't necessarily need PoS tagging or something like this. Inserting spaces is basically all I really need. The rest is just a bonus. A welcome bonus but optional.

Unfortunately, I have a really hard time finding something like this. There are a few Java and Kotlin libraries but the one Kotlin library I tried doesn't work on Android (at least my test app didn't work) and the Java libraries seem to want macbe installed (although I didn't look too deeply into those yet).

Is this just a really dumb idea to do this sort of thing on an android device or am I just missing the obvious solution here?

I know that a web service would probably be the easiest choice (python would make this super easy) but I was hoping I could keep the app without external dependencies.

Thanks for your Time

1 Comment
2024/04/30
06:49 UTC

1

Online Master degree at Universidad de la Rioja

Good evening, people! I've just graduated from university with a degree in Modern Languages. I want to redirect my career towards a more technology-focused degree. I want to study the Online Master degree in Language processing and artificial intelligence from Universidad de la Rioja in Spain. But the thing is that this program is online. Have anyone of you studied this program at this same university? Do you recommend it? It is easy to find a job after finishing the degree? Are the classes held synchronously or you have to read a webpage and study by yourself? I appreciate all the information you can drop on this topic. Goodbye!

2 Comments
2024/04/30
00:57 UTC

2

Sheffield vs UoM masters programme?

I've been accepted for both the computational linguistics & corpus linguistics programme at Uni of Manchester, and computer science with speech and language processing at Uni of Sheffield. I'm about to finish my undergrad in Linguistics.

I'd ideally like to go into industry rather than academia, but I can't decide which masters would be better for my future. I have little experience currently in maths or programming since high school, but I've been accepted onto an intense coding bootcamp over the Summer, and I plan to take some math courses in my free time.

The masters at UoM appeals to me for it being tailored toward linguistics students, so there won't be any assumed knowledge. However, it's a brand new programme starting this year so I don't know if it'll be more linguistics or computational leaning.

The one at Sheffield seems like it'll give me more industry connections and is a well-established, long-running programme. Currently I'm leaning towards this one.

If anyone has any insights or opinions, I'd really appreciate it!

1 Comment
2024/04/29
18:18 UTC

Back To Top