/r/datacurator
A place for us less messy data hoarders.
/r/datacurator is the place for discussion about the curation of digital data. Be it sorting, file formats, file encoding, best practices, discussion of your setup, tips and tricks, asking for help etc.
Rules:
/r/datacurator
I need to digitalize my whole physical archive of diplomas, medical documents, bills, records, etc.
I have an Epson V800 Perfection and about 2TB of lifetime storage on pCloud.
Thanks!
Please use this thread to discuss and ask questions about the curation of your digital data.
This thread is sorted to "new" so as to see the newest posts.
For a subreddit devoted to storage of data, backups, accessing your data over a network etc, please check out r/DataHoarder.
Something I recognized about when getting in a new company with some older guys in the IT or seeing stuff on PCs of friends who took care of the files of late family members are folders that are called "$$$$" or "§§§§" or something like this.
I used special letters also to have some folders shown up in alphabetical order directly on top and primary use this for technical stuff or as a general directory where i put things into I want to sort into the folders later.
I'm surprised to see this more often recently in older peoples file systems I get access to. Was this in the past something you learn about organizing stuff in your system? I couldn't find anything about this when asking google. I'm only curious about, if there is a story behind it or if so many people jump unconnected to the same practical conclusions.
I used to be meticulous about organizing files. But I get busy and lazy about what category this or that falls into... it drops into a single generic "request" folder. Then emails, I give up.
Now? I have 2 folders, one with final products and 1 with more working versions and that's really it. I really entirely on naming convention of the files to search and the fact that I know the timeline of when I saved the work so it's quick for me to search among the files to find things.
It's not perfect but, honestly, I took just as long sometimes trying to remember the file path I used to save things since that was a compromise too. It relied on the way I thought something should be categorized.
Am I insane for doing this? I haven't lost any files. It doesn't seem to take me any longer to find files. It is a bit distressing when I look at the list and it's most embarrassing when others see the file structure I suppose. But it's also quicker every time I save something. I feel like that time saved is constant.
Any ways to improve this approach further if I wanted to go all-in and ever have to explain myself to others, ha?
Sorry if this isn't the right place to post about this. Wasn't sure where else to go.
I also noticed the wiki hasn't been updated in years and the person who wrote it deleted their Reddit account. Has this subreddit been abandoned to the wolves?
I'm looking for a similar job in a similar company like the Data Curator position in Veeva Systems (Matching team).
Is anybody familiar with a company like this?
In my spare time, I've been working on archiving a thread of articles from Backstreets Ticket Exchange (Springsteen fan forum). These articles were reproduced in the thread over the course of 11yrs or so, many of them are either only available as print, or are now only on dead websites.
The forum has been in danger of shutting down for about a year or so now, which is why I've undertaken this effort.
I managed to grab them all (about 1,000 of them), and have each article in its own file. Now I'm just struggling with organizing/renaming all of them.
I figured on sorting them into folders by category (album/concert review, commentary, essay, etc.), but then renaming would be a different story and I'm not sure how to go about it.
I figured something like `YYYY-MM-DD_Author(s)_Source_Title.ext` would work, but then there's a number of them with really long titles or author lists. Would those get truncated?
Is there a general "standard" for this kind of thing? Or has anyone undertaken a similar project?
I don’t mind paying but it’s like 500 random pages I don’t feel like manually sorting and labeling. I just skimmed through it and it’s like every tax return since 92, every promotion my mom got. Documents from when I got my gal bladder removed in 02, my grandpas dd214, grandpas death certificate, all our birth certificates, my dd14 and my military promotions, receipts from our new roof, our warranties for our fridge, washer, dryer etc. our boiler replacement etc.
id like it to automatically make folders like one for appliance warranties another for tax returns etc. is that
I'm reorganizing my folder structure and trying to figure out the best way to categorize files. Some are short, practical guides (e.g., a manual for fixing engines), while others are long, detailed resources (e.g., a comprehensive survival guide or books about WW2).
I'm unsure how to decide what counts as a "document" versus a "book." Should the distinction be based on length, purpose, or something else entirely?
Additionally, what would be the best folder structure to accommodate both types of files? Should I have separate folders for "Documents" and "Books," or combine them into a single folder with subcategories?
I'd love to hear how others approach this kind of organization!
If my files weren't so interconnected with files that are automatically generated, then I would probably find organizing much easier. I have blender projects, coding projects. I attached image of my C:\users\me. There's stuff I manually created like Projects and portable apps, but it's mixed with alot of autogenerated files. Also, are there any templates I can model based off of that have autogenerated files in mind
So I moved my files from the old HDD to the new HDD, and I want to check if there are any corrupted files that appeared during the process, or if there are any corrupted file/video on the old HDD (there are about 200k files, so I can’t check each one).
I need an app that checks video or photo files for playability issues. I also need a modern-looking (highly preferred but not necessary) app that can check for corrupted files in a huge batch (it includes non-media files too, by the way)
(also i might need another app that fixes those files as well)
(also some of the videos have names like VTS_01_1.vob, and their playing length is 14 seconds, but the video continues after those 14 seconds as well. Any idea how to fix it? (they might have been extracted from an old DVD to an old hard disk about 10 years ago))( Also, if I were to convert the video to another format like .mp4, would that solve the problem, and would I lose any data during the process?)
Also, if this isn’t the right place to ask the second question, any idea where I should ask it?
/usr/bin/find -printf
, but I also export and load them in other programs like voidtools-everything, wiztree, ncdu (json) etc.Hi everyone,
This subreddit is like a goldmine, and it got me thinking about how valuable curated information on data curation itself could be. I’m on the hunt for books, articles, and other resources that provide coherent, systematic approaches to the following topics:
If you know of any resources that cover these areas in a structured and practical way - books, articles, blog posts, or anything else - I would love to hear your recommendations. Tools or courses that explore these ideas would also be appreciated.
Thanks for any input!
I have been setting up a VPS with Docker on Debian 12. I want to use this server as a compute platform to host several applications. Both third party applications such as Twenty CRM, Kuma Uptime, etc. as well as my own custom in-house applications that may be python or PHP applications. And also several websites that are typically static websites made with jekyll.
I have been mostly using docker-compose.
I want to learn how to organize this host properly such that it is easy to maintain and manage. And also to be sure to keep anything needed to bootstrap a new replacement host separate from all the generated stuff. What I mean is, lets say I need to switch hosting provider, I may rent a VPS at a different provider. I want to be able be confident I have all config, code, etc. in version control such that I just need to copy over the data folder/database dumps and check out the apps and config from version control and then basically be able to run a script or two to entirely configure the host and containers...
I would like your advice on how to handle deployment of my apps, websites, etc. How to handle having dev and prod versions of each app. How to package and deploy my apps. How to organise my repos.
I would like specific recommendations such as directory structure on where to store working copies, (i use SVN), docker-compose files, etc.
What to put in version control, what not to.
How to organize nginx configurations, firewall settings, etc.
Would this directory structure make sense?
/opt/apps/ # Main directory for all applications
third_party/ # For third-party applications
twenty_crm/ # Directory for Twenty CRM app
kuma_uptime/ # Directory for Kuma Uptime app
custom/ # For custom in-house applications
my_python_app/ # Example Python app
my_php_app/ # Example PHP app
websites/ # For static websites
site1/ # Example static site 1
site2/ # Example static site 2
/docker/ # Directory for Docker-related configurations
compose-files/ # Docker Compose files for each service
images/ # Custom Docker images, if needed
/srv/data/ # For persistent application data
/srv/logs/ # Centralized log storage
/etc/nginx/sites-available/ # Nginx configuration files
/etc/nginx/sites-enabled/ # Symlinks to active Nginx configurations
For version control, I am considering a layout such as this:
/trunk/
apps/
my_python_app/
my_php_app/
websites/
site1/
site2/
/branches/
/tags/
Not sure how to handle secrets...
If this does not belong here, I really hope you can point me in the right direction. The reason I find this relevant here is that I think this is mostly about how to organise the structure of these things and not so much how to actually configure and script stuff. I believe most of you in here have the right mindset and experience to know how to do this.
As a dad, a student, and a researcher I have been asking myself:
"Isn't there a better way to easily organize my downloads and files into proper folders and give them proper names so I can easily find them?"
I wanted to know if this was also a problem for anyone else.
Having to always manually go into my downloads to keep things organized.
I wish I could make custom Rules for my downloads so that anytime I download something, it goes into its respective folder.
i have a collection of memes and other media, i take about 1 hour to organize about 1k files, which is ok, but thats only by putting them into folders (eg. technology memes, fitness memes, esoteric memes, etc)
because of that, i run into the classic "file can be in 2 different folders problem" or the fact that i can't be hyper specific if i need to search for a file quickly, thats where tags (or even renaming) would come in handy, but the problem is that it would probably take waaaaay longer to tag all those files, and after a certain point i feel like it isn't worth it, curation is supposed to make your file easier, using AI to organize stuff would probably safe some people's time
so how long does it take to tag your files? was it worth it?
Please use this thread to discuss and ask questions about the curation of your digital data.
This thread is sorted to "new" so as to see the newest posts.
For a subreddit devoted to storage of data, backups, accessing your data over a network etc, please check out /r/DataHoarder.
+ If there isnt an problem or unless i forget it am planning to update this as time goes as well
+Bookmarks
bookmarks function itself : i use it to backup tab windows by the right click and choosing bookmark the window and let myself access my whole tabs from my phone and manage another huge folder hierarchy in browser
https://chromewebstore.google.com/detail/bookmark-dupes/ombpkjoelcapenbepmgifadkgpokfgfd https://chromewebstore.google.com/detail/bookmarks-clean-up/oncbjlgldmiagjophlhobkogeladjijl
these ones at above are for detecting duplicate bookmarks
https://chromewebstore.google.com/detail/rewind/oghafdocdmlkkjipdmnikdcgekjpiapf
this is for figuring when you bookmarked a thing or etc in case you need it for some reason sometime
+Tabs
https://chromewebstore.google.com/detail/session-buddy/edacconmaakjimmfgnblocblbcdcpbko
i discovered this one new so am not expertised in this but i use it to various purposes and backing up tabs
https://chromewebstore.google.com/detail/tab-manager-plus-for-chro/cnkdjjdmfiffagllbiiilooaoofcoeff
this was the one i used before , its more visual for seeing many at once and few more better things etc (it also shows more duplicate tabs compared to for some reason i dont know yet)
https://chromewebstore.google.com/detail/wayback-machine/fpnmgdkabkmnadcjpehmlllkndpkmiak
for backing up tabs in case something gets deleted (a youtube video for example)
So far I’ve been using a private discord user as my own dump for content I wanted to save (like urls, vids to watch later, memes, etc) but I’ve realized this probably isn’t the most secure so what works similar to discord that lets me organize and save content? I would also appreciate if it’s cross platform since I have an iPhone but use a windows desktop so something like apple notes wouldn’t work well
I have a decent system for my documents and media, but I'm struggling a little with how best to save local copies of important reference articles (not scholarly-type works that often have reference systems built in) and how to find them. Link rot is a real thing and I fully expect it to get worse. Also, I'd like to clear out my browser tabs lol.
My initial thought, for longevity, is to just save the text of the article in a .txt file, with a filename of the originalHeadline_author_date_tag1tag2tag3.txt in one large folder so I can just search for tags. But then I thought, maybe I want the main tag first, since headline and author and date aren't likely to be good for organization. I'd prefer to at least look by Psychology or NaturalWorld or Politics, without necessarily needing to remember the tags I gave it.
Another option is to have a txt or md file with this info that I use as a guide, so any new article gets added there and as its own txt file. This would be faster to search, and I'd prepend an ID to each article txt file so I can easily find it. This does free me from a particular naming schema (though probably good to keep some data in the article txt files), but adds overhead for every article I add. I'm not anticipating doing thousands (or even hundreds) of articles to start, but over time, it should be robust. I'd also like to keep the original link somewhere, in case I need to hit it up for some reason (updates, clarifications, send to someone else).
Right now, this would all live in my NAS structure, and backed up to a cloud service periodically.
Thanks for any tips and ideas!
Most DAM I look at only support image, video, audio and compressed file types. Im looking for something that can do 3d assets like .obj files. I would prefer something self hosted and with a visual grid instead of a large list of file names as the only way to view the files. Please help and thanks for taking the time to read the post.
Hi i am a student. I find a useful pdf resource. I couldnt track where it came from. So maybe i could find what did they create about another subjects. Any help is appreciated. Thank you all in advance.
Who has the best definition of what Data Curation is and definitely is not as I’m seeing confusion on this topic and overlaps with other things like Data Wrangling and Data Preparation - any thoughts 💭?
Is there a tool to extract the transcripts from offline videos? Something like Submagic for YouTube? The issue is I do not have the initial source URLs anymore, they are saved on the hard drive and I find it difficult to stay and play hundreds of hours of videos.
I have some thousands scanned documents in form of hand written letters, old printed letters, news paper articles etc. Some are in PDF format, some are in JPG/HEIC. I recently figured out that those residing in Apple Photos are "automatically" made searchable for most of the text.
But what's your good expert advice here? If I both want to keep the original scans (in either PDF or JPG or similar), _and_ would like to have all the text as easily searchable as possible?
Apple Photos, iCloud Drive, OneDrive, OCR with WonderShare PDF and then into HTML files, or something completely different?