/r/datacleaning

Photograph via snooOG

Data scientists can spend up to 80 percent of their time correcting data errors before extracting value from the data.

We at /r/datacleaning are interested in data cleaning as a preprocessing step to data mining. This subreddit is focused on advances in data cleaning research, data cleaning algorithms, and data cleaning tools. Related topics that we are interested in include: databases, statistics, machine learning, data mining, AI, visualization, etc.

Garbage in, garbage out! Data scientists can spend up to 80 percent of their time correcting data errors before extracting value from the data.

We at /r/datacleaning are interested in data cleaning as a preprocessing step to data mining. This subreddit is focused on advances in data cleaning research, data cleaning algorithms, and data cleaning tools. Related topics that we are interested in include: databases, statistics, machine learning, data mining, AI, information theory, information retrieval, pattern recognition, NLP, data visualization, etc.

Related subreddits :

/r/machinelearning /r/etl

/r/datacleaning

4,617 Subscribers

1

Recreating a database from old exports. Can this be cleaned with Python?

https://preview.redd.it/oacbjt53jsce1.png?width=1129&format=png&auto=webp&s=660adecf6840abb3c509dc685900d29fbef7e792

I'm recreating an old database from the exported data. Many of the tables have "dirty" data. For example, one of the table exports for Descriptions split the description into several lines. There are over 650k lines, so correcting the export manually will take a very long time. I've attempted to clean the data with Python, but haven't succeeded. Is there a way to clean this kind of data with Python? And, more importantly, how?! Any tips are greatly appreciated!!

1 Comment
2025/01/13
17:16 UTC

2

Is Data Cleaning the Hardest Part of Data Analysis?

I've been observing my sister as she works on a data analysis project, and data cleaning is taking up most of her time. She’s struggling with it, and I’m curious—do you also find data cleaning the hardest part of data analysis? How do you handle the challenges of data cleaning efficiently? or is this a problem for every one

3 Comments
2025/01/08
16:02 UTC

0

Expert Data Cleaning Services | Boost Your Data Quality and Accuracy!

Is your data messy and incomplete? Let me help you clean it up and transform it into reliable, accurate insights! As a certified Data Analytics expert, I specialize in data cleaning using advanced tools like Python, Excel, and Power BI.

I can help you:

  • Remove duplicates and errors
  • Fill missing values
  • Standardize data formats
  • Clean and organize large datasets for analysis

With my Data Cleaning services, you’ll get high-quality data ready for analysis, helping you make smarter business decisions. Get in touch now for a free consultation or quote!

Contact - truedatamate@gmail.com

#DataCleaning #DataAnalytics #Excel #PowerBI #Python #DataTransformation #CleanData #DataInsights #BigData #BusinessIntelligence #DataScience #DataAnalysis #Freelancer #AI #DataExperts #MachineLearning,CleanUpData

0 Comments
2024/12/24
04:50 UTC

1

Need Help with Mapping Vague Model data(in CSV) to a JSON File with Specific Boat Manufacturers and Models?

Hi everyone,

I'm working on a data-cleaning project and need some guidance. I have two datasets:

Real Data(JSON): This file contains a structured list of boat manufacturers and their respective models.

[Link] drive.google.com/file/d/1G5xL1ruUeZDazGDgM2RzRmctZeJV5ltv/view?usp=drive_link

Unmapped Data (CSV): This file contains less structured and often vague information about boats, including incomplete or inconsistent manufacturer and model details.

[Link] drive.google.com/file/d/18yHZztu3P7Rd-rXusdvh2wob2e7Q1vaz/view

Goal:
I want to map the data in the CSV file to the JSON file as accurately as possible, so I can standardize the vague entries in the CSV to match the structured data in the JSON.

Challenges:
The CSV data is inconsistent; manufacturer names might be misspelled, abbreviated, or slightly different from the ones in the JSON.

Some model details in the CSV are partial or unclear.

There are many entries, so manual mapping isn’t feasible.

What I’ve Tried:
- Experimenting with fuzzy string matching (fuzzywuzzy or rapidfuzz libraries).
- Looking for exact matches but finding the results too limited.

What I Need Help With:
- What’s the best approach to clean and map this data programmatically?
- Are there any specific tools, libraries, or techniques that can handle such mapping efficiently?
- Any advice on dealing with edge cases, like multiple possible matches or missing data?

I’d appreciate any insights, code snippets, or resources that could help me solve this problem.

Thanks in advance!

0 Comments
2024/12/11
11:57 UTC

0

ow can i implement a new lemmatizing function from scratch

hello good people i am a student at computer science engineering and i have homework at data retrieval field

using Python and i am not that much with this kind of programming language

but the main thing i want to say is how I should implement a steeming function from scratch without using nltk library because my doctor wants us to build it in the homework could anyone tell me where should i start and what I should do i searched everywhere in the google and with no benefits everything talks about the function in the nltk library

what should i do?

thanks for any help

sorry for my bad English

0 Comments
2024/11/18
13:38 UTC

1

DATA CLEANING HELP

Ive just started DATA SCIENCE. Like ive done Numpy, Pandas, Seaborn, Sklearn and some other libraries... and ive also done Machine learning(learned algos). And now i wanna start doing project. Whenever i sit to do project, i get stuck by DATA CLEANING PROCESS! So, anyone could you share how to go ahead in this situation, if youve any good resource related to data cleaning please help me with that too...! THANKS!

2 Comments
2024/11/06
11:01 UTC

0

Need a mentor

Hi guys! Urgent need a mentor who can give me tasks from Data cleaning to visualization. I never studied data analytics formely, just studied from YouTube. Need help, I am counting on this reddit community.

0 Comments
2024/10/27
07:38 UTC

2

Tips cleaning this dictionary?

I don't know if this is the right place for this but I need help cleaning this old dictionary, it is the only dictionary my native language has as of now. I want to make an app from it.

I discovered this pdf from an internet Archive as I had been looking for it for a while. This seems to be a digitized version of the physical copy.

The text can be copied but one letter doesn't copy properly, it is mistaken for other letters like V and U, which is the Ʋ letter I have pointed an arrow to. These days that letter is written with a Ŵ.

The dictionary goes from Tumbuka to Tonga to English and then flips at some point to go from English to Tonga to Tumbuka.

I only want the Tumbuka to English pairs and vice-versa ignoring the Tonga so I make a mobile app more easily.

Here is a link to the dictionary

4 Comments
2024/10/25
18:52 UTC

0

FREE email data cleaning (no catch)

Hi all,

It’s time for us to give back to the Reddit communities we love so much.

Normally when creating an account on Listcleaner.net you get 100 free cleaning credits to try our email cleaning service.

Right now we want to give 25 users of the r/datacleaning subreddit not 100, but 1000 credits to clean your email data, when creating an account.

You DO NOT have to buy anything, and the only contact information required to create your account on Listcleaner.net is your email address.

After creating an account, please tell us via DM your Listcleaner accounts username or email address and we will add the credits to your account.

The credits can be used on our website and via our API.

Happy email cleaning!
The Listcleaner.net team

0 Comments
2024/10/24
13:24 UTC

2 Comments
2024/10/03
07:36 UTC

0

what's the most common dirty data problem?

when working with dirty data, what data issues have you run into the most? what's important to look out for? do your tools look out for these things or do you have to manually build out these checks?

6 Comments
2024/09/09
17:36 UTC

3

How do y'all find datasets for cleaning practice?

I've been trying to find datasets to practice my cleaning skills and I find datasets already clean. Also if there's a way to find datasets to clean above a million rows that'll be so helpful!

2 Comments
2024/09/09
17:18 UTC

1

Tool to write data cleaning scripts in python from natural language. Thoughts & feedback? (Roasting is accepted & appreciated here)

0 Comments
2024/07/28
21:55 UTC

2

Need Help, Suggestions, and Feedbacks

Hi Guys,

Let's keep it short,

I want to learn data cleaning using Power Query/Power Bi and Pandas (Python)

But the problem is that I've no mentor or someone who can check my cleaned and processed data. Like I don't even know if I am cleaning the data appropriately or not.

Please tell me guys how this subreddit can be helpful.

Please help. I'm desperate for help!

2 Comments
2024/07/28
15:06 UTC

1

[PROMO] The NASA Breath Diagnostics Challenge - $55,000 in prizes

https://bitgrit.net/competition/22

The challenge tasks solvers to leverage their expertise to develop a classification model that can accurately discriminate between the breath of COVID-positive and COVID-negative individuals, using existing data. The ultimate goal is to improve the accuracy of the NASA E-Nose device as a potential clinical tool that would provide diagnostic results based on the molecular composition of human breath

0 Comments
2024/07/07
02:47 UTC

2

Cleaning rows with typos

I have a table in Excel filled with typos. For example: Row1: obi LLC, US, SC, 29418, Charlestone, id5 Row2: obi company, US, SC, 29418, Charlestone, id4 Row3: obi gmbh, US, SC, 29418, Charlestone, id3 Row4: obi, US, SC, 29418, Charlestone, id2 Row5: Obi LLC, US, SC, 59418, Charlestone, id1 Row6: Starbucks, US, SC, 1111, Budapest, id9 Row7: Starbucks kft, HU, BP, 1111, Budapest, id8 Row8: Starbucks, HU, BP, 1111, Budapest, id7

The correct rows here are row1 and row8 because their values occur most frequently in the table. I want to create a new table with only the correct directions. The expectation is to assign the standardized value to each row based on its relationship. It's important to consider not only the name but also the name/country/state/zip code/city combination. Fuzzy matching wouldn't work, because I don't have a list with the correct data. I initially tried using VBA, but I only managed to list the one row that occurred most frequently (in this case row 1). I can copy my code if necessary. Have you ever cleaned such messy data? What would you recommend? Thank you for your advice

2 Comments
2024/05/27
11:33 UTC

3

I was so tired of cleaning crappy data, so I made a tool

Hey guys, I think this might be very relevant in this sub. Lately, I was working on a tool to clean any textual data. In a nutshell it can convert inconsistent data like this (see all names are different and hard to analyse):

See first column

Into something like this:

See generated columns

https://data-cleaning.com

I'm actively looking for feedback and whether this meets someones needs / needs to be changed for your specific case. Please let me know what you think!

0 Comments
2024/05/10
19:33 UTC

1

help how to organize this column ?

I have a column named ' informations ' and it has the information of used cars, and this column has an attribute and her value seperated by a comma ( , ) but in the same cell i have multiple attribute and the values like this one :

,Puissance fiscale,4,Boîte de vitesse,Manuelle,Carburant,Essence,Année,2013,Kilométrage,120000,Model,I20,Couleur,bleu,Marque de voiture,Hyundai,Cylindrée,1.2

as you can that is a single cell ine the 1st line in the column named informations

Puissance fiscale has 4 as a value
boite de vitesse has manuelle as a value
ETC

NB: i have around 9000 line and not everyline have the same structure as this

5 Comments
2024/05/02
09:27 UTC

2

Decoding data classification: A simplified yet comprehensive handbook

In today's data-driven world, where data breaches are a constant threat, safeguarding your organization's sensitive information is paramount. Learn how to implement robust data classification processes and explore top tools for securing your data from our blog.

Explore now: https://www.infovision.com/blog/decoding-data-classification-simplified-yet-comprehensive-handbook

#CyberThreats
#DataClassification
#DataBreaches

0 Comments
2024/05/02
05:18 UTC

0

What does it imply when the total cost is negative, the unit selling price is positive and the order is 0? I am trying to clean data in Excel.

ORDER QUANTITY | UNIT SELLING PRICE| TOTAL COST

0 | 151.47 | -86.9076

0 | 690.89 | -1002.1401

0 | 822.75 | -978.8337

I am trying to clean a dataset and wanted to understand if it makes sense or if I should delete it from the table. There are about 28% of total entries with such data. It won't make sense to delete 28% either. Please drop your suggestions and understanding.

0 Comments
2024/04/06
18:04 UTC

1

Pricing Inquiry for Data Cleaning and Analysis Service with Databricks and PySpark Expertise

Hello,

I'm currently exploring options for professional data cleaning and analysis services, particularly those utilizing Databricks and PySpark expertise. I have a dataset that requires thorough cleaning to address inconsistencies and erroneous data, followed by in-depth analysis to extract valuable insights for my business.

Here's a breakdown of the tasks I'm looking to outsource:

  1. Initial Evaluation: Assessing my dataset to identify data quality issues.
  2. Data Cleaning: Applying advanced data cleaning techniques to rectify inconsistencies and erroneous data.
  3. Databricks Analysis: Utilizing Databricks for large-scale data analysis, optimizing processing performance.
  4. PySpark Development: Writing PySpark scripts for efficient processing and analysis of distributed data.
  5. Reporting and Insights: Generating detailed reports and providing insights based on the analysis performed.
  6. Continuous Optimization: Recommending strategies for ongoing improvement of data quality and analysis processes.

I understand that the cost of such services can vary depending on factors such as the complexity of the dataset, the volume of data, and the specific requirements of the analysis. However, I would appreciate any ballpark estimates or insights from forum members who have experience with similar projects.

Additionally, if you have recommendations for reputable service providers or consultants specializing in data cleaning and analysis with Databricks and PySpark, please feel free to share them.

Thank you in advance for your assistance!

1 Comment
2024/03/23
03:26 UTC

3

Cleaning header/footer text from OCR data

Hello! I have a collection of OCR text from about a million journal articles and would appreciate any input on how I can best clean it.

First, a bit about the format of the data: each article is stored as an array of strings where each string is the OCR output for each page of the article. The goal is to have a single large string for each article, but before concatenating the strings in these arrays, some cleaning needs to be done at the start and end of each string. Because we're talking about raw OCR output, and many journals have things like journal titles, page numbers, article titles, author names, etc. at the top and/or bottom of each page, and those have to be removed first.

The real problem, however, is that there is just so much variation in how journals do this. For example, some alternate between journal title and article tile at the top of each page with page numbers at the bottom, some alternate between page numbers being at the top and the bottom of each page, and the list goes on. (So far, I've identified 10 different patterns just from examining 20 arrays.) This is further complicated by most articles having different first and sometimes last pages, tables and captions, etc.

At this point, I could keep going to identify patterns, write some regex to detect what pattern is present, then clean accordingly. But I also wonder if there's a more general approach, like searching for some kind of regularity, either across pages or (more commonly) every other page, but I'm not quite sure how I should approach this task.

Any suggestions would be greatly appreciated!

1 Comment
2024/03/07
22:24 UTC

7

Looking to create a "Clean Data" definition

Hi,

Just wondering what requirements or checklist items people would suggest for a definition of Clean Data ready to be used in machine learning? Akin to "tidy data", but for modelling. I.e.

  • There should be no string fields. All data should be either in a numeric form, or as a categorical data type

etc

I know this will likely be opinionated, hence wanting to "crowd source" it 😃

Feel free to disagree with any statements, as I imagine there will be differences

1 Comment
2024/02/29
12:39 UTC

1

Data Cleaning Freelancer

Hey everyone,

I'm a sophomore studying data science and I've been digging into ways to earn money online. I stumbled upon the idea of freelancing my data cleaning skills, and it seems like an exciting avenue. Though I'm still learning, I'm a quick learner and confident that I can get proficient in data cleaning soon.

I'm keen to get hands-on experience and was wondering if anyone would be open to taking me under their wing as an apprentice or offering advice on where to begin.

While I'm still early in my studies, I've worked on a few exploratory data analyses for my classes. These involved cleaning data and using RStudio to create graphs.

I'm eager to turn this interest into a reality. Any guidance or tips on how to kickstart a career in freelancing data cleaning would be hugely appreciated!

Thanks in advance for any help or advice you can offer!

1 Comment
2023/12/23
22:23 UTC

2

AI tool to extract product characteristics

Hello everyone,

I am trying to clean up some data from our ERP systems regarding our items. I am working for a furniture company, we do have different characteristics that compose a product (size/timber/fabric and so on). So far, those characteristics has been input all in one description field. I'd like to extract those information and assign it to the new correct field (one field per characteristic). Maybe some AI tools might be able to help in that process? I am not a developer / technical IT.

1 Comment
2023/12/05
03:11 UTC

1

Success Story: How data cleaning tools helped support my project

Disclaimer: This is a personal project I did, made possible with RPA (UiPath web scraping). The stats come from SA Rugby website & I developed automation flows to get the stats, player bio & profile pictures from the same website. I used PowerQuery to transform the output & to debug issues & finally Tableau for visualisation. I highly recommend getting comfortable with Power Query, you can do so much with it!

Hi everyone, I'd like to share a personal project I did about the Springboks RWC Campaign. I'd love to get your feedback as PowerBI people, to get your unique perspective. We only use Tableau at so I thought I'd overcome confirmation bias by getting your guys' opinions.

The project is basically match stats for all the games the Springboks played in all championships in 2023. You can see those who are consistently performing well. The stats come from SA Rugby

Each match has highlight reels of the players' game contributions (71 total). The project also covers all the matches that the Boks under Rassie have played NZ (5 Wins, 5 Losses & 1 Draw).

Ultimately, the project shows how tough this World Cup was & the pressure the team faced, especially in the knockout phases.

PS. I think this would be great for those new to rugby, since it covers the biggest matches in the sport with highlight reels to see the entertaining stuff.

You can check out the full work here: https://public.tableau.com/views/Springboks2023RugbyWorldCupCampaign/TheSpringboks2023Campaign?:language=en-US&:display_count=n&:origin=viz_share_link

Final vs NZ

Semi Final vs England

Quarter Final vs France

0 Comments
2023/11/06
17:19 UTC

2

"Cleaning Call Center Data: Seeking Guidance/Help"

Hello everyone,

I am currently working on a call center trend dashboard project, and I've encountered an issue with multiple blank cells in the data. I'm unsure about the best approach to handle this. Should I delete rows with multiple blank cells, or should I use statistics to fill these blank cells?

I would greatly appreciate your guidance and suggestions on this matter. Your assistance would be invaluable. Thank you in advance!

Project Task :

Create a dashboard in Power BI for Claire that reflects all relevant Key Performance Indicators (KPIs) and metrics in the dataset

Possible KPIs include (to get you started, but not limited to):

  • Overall customer satisfaction
  • Overall calls answered/abandoned
  • Calls by time
  • Average speed of answer
  • Agent’s performance quadrant -> average handle time (talk duration) vs calls answered

Some info about data:

Total rows-5000

Total column :10

snapshot of data

"Total rows having missing values: 946 Each of the 946 rows has 3 blank/missing cells.

Please guide me on the approach I should take to clean this data.

Note: The blank column is just a temporary column used to check how many cells are blank in each row."

TL;DR:Seeking advice on handling data with many missing values (946 rows, 3 blank cells each) for a call center trend dashboard project. Also, tasked with creating a Power BI dashboard for Claire, highlighting KPIs and metrics. Please assist. Thanks!

2 Comments
2023/11/06
07:52 UTC

1

How to handle missing categorical values with more than 5% missing data?

I am upskilling in the field of data science. Recently started practicing on Kaggle datasets. Picked up a dataset which have more categorical columns than numerical and these columns have more that 5% (upto 60% null values in some columns) null values. I am confused about what technique to use on them. Cannot find resources where handling object columns specifically is focused upon. Any help please? can anyone suggest a book or website or just tell me how to proceed with this?

2 Comments
2023/07/12
19:43 UTC

5

How to get started with python for data analysis?

If you're embarking on the odyssey of studying Python data analysis, commence by acquiring a mastery of the rudiments of Python programming.

Once you've attained a level of proficiency with Python, plunge into the depths of indispensable libraries such as NumPy for numerical computation and Pandas for data manipulation. Engage in practical exercises utilizing authentic datasets to accrue experiential knowledge, and refine your prowess in data visualization employing Matplotlib and Seaborn.

Delve into the realm of statistical analysis using the comprehensive tools provided by SciPy, and contemplate augmenting your skill set with other pertinent libraries such as scikit-learn for machine learning. Engross yourself in online communities, undertake ambitious projects, and perpetually pursue learning and diligent practice to ascend to a zenith of expertise in Python data analysis—a gratifying pursuit that unveils the portals to unearthing invaluable insights from data.

To get you started, I will highly recommend you look at these articles.

Exploratory Data Analysis and visualization practical example:

https://link.medium.com/FYuBpTyvCAb

Data cleaning with python (a practical example)

https://link.medium.com/GBsdtEFvCAb

How to make data Visualization in python

https://link.medium.com/6rWH2nKvCAb

Python data cleaning made easy

https://link.medium.com/6rWH2nKvCAb

Sales Statistical analysis with python

https://link.medium.com/ZGx7NDRvCAb

https://link.medium.com/OidaOBUvCAb

Python Web App Development: Unleashing the Power of Simplicity and Flexibility

https://medium.com/@mondoa/python-web-app-development-unleashing-the-power-of-simplicity-and-flexibility-d34e9d1bd658

Enhancing Your Web Application with Python’s Data Analysis Tools

https://medium.com/@mondoa/enhancing-your-web-application-with-pythons-data-analysis-tools-2ecd0af29027

The Ultimate Python 3 Guide: Everything You Need to Know

https://medium.com/@mondoa/enhancing-a-comprehensive-python-3-tutorial-b8102f0cfcc4

0 Comments
2023/07/01
18:34 UTC

Back To Top