21,064 Subscribers

How to archive documents

I need to digitalize my whole physical archive of diplomas, medical documents, bills, records, etc.

I have an Epson V800 Perfection and about 2TB of lifetime storage on pCloud.

Is the right format for long term storage PDF/A?
What DPI to scan them at, keeping in mind the space I got and that some have fine details, and might be printed later based on the scan. Is 1200 a good value?
What lossless compression you recommend? JPEG 2000 lossless is suitable?
What software could a) convert to PDF/A, as Epson Scan cannot natively scan in PDF/A? b) add multilingual OCR c) let me add advanced metadata, even better in bulk?

Thanks!

2 Comments

2025/01/31
21:10 UTC

Monthly /r/datacurator Q&A Discussion Thread - 2025

Please use this thread to discuss and ask questions about the curation of your digital data.

This thread is sorted to "new" so as to see the newest posts.

For a subreddit devoted to storage of data, backups, accessing your data over a network etc, please check out r/DataHoarder.

0 Comments

2025/01/31
20:00 UTC

Meaning of $$$$ Folders?

Something I recognized about when getting in a new company with some older guys in the IT or seeing stuff on PCs of friends who took care of the files of late family members are folders that are called "$$$$" or "§§§§" or something like this.

I used special letters also to have some folders shown up in alphabetical order directly on top and primary use this for technical stuff or as a general directory where i put things into I want to sort into the folders later.

I'm surprised to see this more often recently in older peoples file systems I get access to. Was this in the past something you learn about organizing stuff in your system? I couldn't find anything about this when asking google. I'm only curious about, if there is a story behind it or if so many people jump unconnected to the same practical conclusions.

11 Comments

2025/01/31
11:42 UTC

Am I insane for dropping file directories and email folders?

I used to be meticulous about organizing files. But I get busy and lazy about what category this or that falls into... it drops into a single generic "request" folder. Then emails, I give up.

Now? I have 2 folders, one with final products and 1 with more working versions and that's really it. I really entirely on naming convention of the files to search and the fact that I know the timeline of when I saved the work so it's quick for me to search among the files to find things.

It's not perfect but, honestly, I took just as long sometimes trying to remember the file path I used to save things since that was a compromise too. It relied on the way I thought something should be categorized.

Am I insane for doing this? I haven't lost any files. It doesn't seem to take me any longer to find files. It is a bit distressing when I look at the list and it's most embarrassing when others see the file structure I suppose. But it's also quicker every time I save something. I feel like that time saved is constant.

Any ways to improve this approach further if I wanted to go all-in and ever have to explain myself to others, ha?

Sorry if this isn't the right place to post about this. Wasn't sure where else to go.

9 Comments

2025/01/29
08:03 UTC

Meta: why is this subreddit full of AI-generated posts, spam, advertising, and bizarre posts and comments?

I also noticed the wiki hasn't been updated in years and the person who wrote it deleted their Reddit account. Has this subreddit been abandoned to the wolves?

7 Comments

2025/01/26
05:41 UTC

Data Curator Jobs like Veeva Systems

I'm looking for a similar job in a similar company like the Data Curator position in Veeva Systems (Matching team).

Is anybody familiar with a company like this?

2 Comments

2025/01/25
01:04 UTC

Organizing/Naming a ton of articles

In my spare time, I've been working on archiving a thread of articles from Backstreets Ticket Exchange (Springsteen fan forum). These articles were reproduced in the thread over the course of 11yrs or so, many of them are either only available as print, or are now only on dead websites.

The forum has been in danger of shutting down for about a year or so now, which is why I've undertaken this effort.

I managed to grab them all (about 1,000 of them), and have each article in its own file. Now I'm just struggling with organizing/renaming all of them.

I figured on sorting them into folders by category (album/concert review, commentary, essay, etc.), but then renaming would be a different story and I'm not sure how to go about it.

I figured something like `YYYY-MM-DD_Author(s)_Source_Title.ext` would work, but then there's a number of them with really long titles or author lists. Would those get truncated?

Is there a general "standard" for this kind of thing? Or has anyone undertaken a similar project?

5 Comments

2025/01/22
21:04 UTC

Just got synology nas and found about 500 pages of random documents in my mom’s attic. I have an adf scanner, what’s the best way to save and automate sorting?

I don’t mind paying but it’s like 500 random pages I don’t feel like manually sorting and labeling. I just skimmed through it and it’s like every tax return since 92, every promotion my mom got. Documents from when I got my gal bladder removed in 02, my grandpas dd214, grandpas death certificate, all our birth certificates, my dd14 and my military promotions, receipts from our new roof, our warranties for our fridge, washer, dryer etc. our boiler replacement etc.

id like it to automatically make folders like one for appliance warranties another for tax returns etc. is that

7 Comments

2025/01/22
17:56 UTC

How to distinguish between a document and a book for folder structure?

I'm reorganizing my folder structure and trying to figure out the best way to categorize files. Some are short, practical guides (e.g., a manual for fixing engines), while others are long, detailed resources (e.g., a comprehensive survival guide or books about WW2).

I'm unsure how to decide what counts as a "document" versus a "book." Should the distinction be based on length, purpose, or something else entirely?

Additionally, what would be the best folder structure to accommodate both types of files? Should I have separate folders for "Documents" and "Books," or combine them into a single folder with subcategories?

I'd love to hear how others approach this kind of organization!

8 Comments

2025/01/21
15:40 UTC

Should I put folders in C:/ or use the C:/users/username?

If my files weren't so interconnected with files that are automatically generated, then I would probably find organizing much easier. I have blender projects, coding projects. I attached image of my C:\users\me. There's stuff I manually created like Projects and portable apps, but it's mixed with alot of autogenerated files. Also, are there any templates I can model based off of that have autogenerated files in mind

https://imgur.com/a/V8zXAiB

5 Comments

2025/01/19
23:45 UTC

looking for a good file integrity checker app for my hdd , open for suggestions

So I moved my files from the old HDD to the new HDD, and I want to check if there are any corrupted files that appeared during the process, or if there are any corrupted file/video on the old HDD (there are about 200k files, so I can’t check each one).

I need an app that checks video or photo files for playability issues. I also need a modern-looking (highly preferred but not necessary) app that can check for corrupted files in a huge batch (it includes non-media files too, by the way)
(also i might need another app that fixes those files as well)

(also some of the videos have names like VTS_01_1.vob, and their playing length is 14 seconds, but the video continues after those 14 seconds as well. Any idea how to fix it? (they might have been extracted from an old DVD to an old hard disk about 10 years ago))( Also, if I were to convert the video to another format like .mp4, would that solve the problem, and would I lose any data during the process?)

Also, if this isn’t the right place to ask the second question, any idea where I should ask it?

11 Comments

2025/01/17
14:55 UTC

Common file format / tools for recursive indexing of filesystems?

It's a common task for me to need to create big recursive file lists saved to something like a .csv / .tsv / .sfv file
- Fields usually include: filepath, size, modtime
  - Sometimes I store various types of checksums and other metadata too
- I'll usually generate these lists using /usr/bin/find -printf, but I also export and load them in other programs like voidtools-everything, wiztree, ncdu (json) etc.
But over the years, I've created and used so many similar-but-different formats for this...
- and it's always struck me as odd that there isn't really a common file format for this in a standard way?
- nor really any CLI tools that seem to be centered around saving the results to some kind of standard/consistent file format
- Is there anything I'm missing? Either formats or tools?
Once again... I'm spending my day on re-inventing the wheel, because I need something more efficient...
- So I'm looking at using parquet files...
  - Something like this that stores structured metadata about what fields it contains is pretty useful for varying use cases, e.g. when I do include checksums vs not needing them
  - Keen to hear any thoughts on this format, or if there might be anything better?
But still... yeah... surely lots of people across all sectors of IT + just home enthusiast would be just like me?
- It's just weird that I haven't even come across what is even an attempt here re xkcd 927?

12 Comments

2025/01/10
07:14 UTC

Books and other resources about digital organization, data curation, etc.

Hi everyone,

This subreddit is like a goldmine, and it got me thinking about how valuable curated information on data curation itself could be. I’m on the hunt for books, articles, and other resources that provide coherent, systematic approaches to the following topics:

Digital organization - frameworks or strategies for efficiently organizing digital information. This could include personal or team-level systems for structuring files, naming conventions, or general workflow organization.
Data curation, tagging, and metadata creation - best practices for designing meaningful tagging systems, creating metadata, or curating data so it remains usable and relevant in the long term.
Optimizing retrieval and search - methods for improving how stored data or information is retrieved later, such as organizational techniques, filing systems, or other search optimization strategies.
High-level data management - more abstract approaches to organizing, storing, and categorizing different types of data. Not from an analytical perspective like data science or machine learning, but practical, general-purpose advice for handling diverse data types. Also, avoiding data duplication or redundancy.
Keeping data safe - recommendations for backup strategies, redundancy practices, or methods to minimize risks of data loss.

If you know of any resources that cover these areas in a structured and practical way - books, articles, blog posts, or anything else - I would love to hear your recommendations. Tools or courses that explore these ideas would also be appreciated.

Thanks for any input!

6 Comments

2025/01/07
20:02 UTC

How to organise containerised apps and config on a dev/prod server?

I have been setting up a VPS with Docker on Debian 12. I want to use this server as a compute platform to host several applications. Both third party applications such as Twenty CRM, Kuma Uptime, etc. as well as my own custom in-house applications that may be python or PHP applications. And also several websites that are typically static websites made with jekyll.

I have been mostly using docker-compose.

I want to learn how to organize this host properly such that it is easy to maintain and manage. And also to be sure to keep anything needed to bootstrap a new replacement host separate from all the generated stuff. What I mean is, lets say I need to switch hosting provider, I may rent a VPS at a different provider. I want to be able be confident I have all config, code, etc. in version control such that I just need to copy over the data folder/database dumps and check out the apps and config from version control and then basically be able to run a script or two to entirely configure the host and containers...

I would like your advice on how to handle deployment of my apps, websites, etc. How to handle having dev and prod versions of each app. How to package and deploy my apps. How to organise my repos.

I would like specific recommendations such as directory structure on where to store working copies, (i use SVN), docker-compose files, etc.

What to put in version control, what not to.

How to organize nginx configurations, firewall settings, etc.

Would this directory structure make sense?

/opt/apps/                    # Main directory for all applications
  third_party/                # For third-party applications
    twenty_crm/               # Directory for Twenty CRM app
    kuma_uptime/              # Directory for Kuma Uptime app
  custom/                     # For custom in-house applications
    my_python_app/            # Example Python app
    my_php_app/               # Example PHP app
  websites/                   # For static websites
    site1/                    # Example static site 1
    site2/                    # Example static site 2
/docker/                      # Directory for Docker-related configurations
  compose-files/              # Docker Compose files for each service
  images/                     # Custom Docker images, if needed
/srv/data/                    # For persistent application data
/srv/logs/                    # Centralized log storage
/etc/nginx/sites-available/   # Nginx configuration files
/etc/nginx/sites-enabled/     # Symlinks to active Nginx configurations

For version control, I am considering a layout such as this:

/trunk/
  apps/
    my_python_app/
    my_php_app/
  websites/
    site1/
    site2/
/branches/
/tags/

Not sure how to handle secrets...

If this does not belong here, I really hope you can point me in the right direction. The reason I find this relevant here is that I think this is mostly about how to organise the structure of these things and not so much how to actually configure and script stuff. I believe most of you in here have the right mindset and experience to know how to do this.

2 Comments

2025/01/07
18:31 UTC

Am I the only one with a Messy Downloads Folder?

As a dad, a student, and a researcher I have been asking myself:
"Isn't there a better way to easily organize my downloads and files into proper folders and give them proper names so I can easily find them?"

I wanted to know if this was also a problem for anyone else.

Having to always manually go into my downloads to keep things organized.

I wish I could make custom Rules for my downloads so that anytime I download something, it goes into its respective folder.

35 Comments

2025/01/01
15:10 UTC

how long did it take to tag your files? (and other concerns about time management)

i have a collection of memes and other media, i take about 1 hour to organize about 1k files, which is ok, but thats only by putting them into folders (eg. technology memes, fitness memes, esoteric memes, etc)

because of that, i run into the classic "file can be in 2 different folders problem" or the fact that i can't be hyper specific if i need to search for a file quickly, thats where tags (or even renaming) would come in handy, but the problem is that it would probably take waaaaay longer to tag all those files, and after a certain point i feel like it isn't worth it, curation is supposed to make your file easier, using AI to organize stuff would probably safe some people's time

so how long does it take to tag your files? was it worth it?

7 Comments

2025/01/01
07:34 UTC

Monthly /r/datacurator Q&A Discussion Thread - 2024

Please use this thread to discuss and ask questions about the curation of your digital data.

This thread is sorted to "new" so as to see the newest posts.

For a subreddit devoted to storage of data, backups, accessing your data over a network etc, please check out /r/DataHoarder.

2 Comments

2024/12/31
20:00 UTC

what are some of tools or tricks you use for managing your complexity/files (also what i use)

+ If there isnt an problem or unless i forget it am planning to update this as time goes as well

-For Backup

+ small trick : i take screenshots and screen records of my extensions , folders , desktops , apps etc once in a while and put them in a file named recovery at my desktop in case i accidently delete something or move to a new device etc ,mixing it with google drive sync i can recover my computer faster in case something happens ,mixing with everything this is a bit more complex but i can easily remember/recover the folder structure/hierarchy as well (you cant use it for copying it but its good to see for checking if u missed something)
+ google drive sync : i use it with 2 tb size limit with my family and backup my whole desktop , photos folder , videos folder , documents folder ; also i move my files from my desktop section to my drive for backing up the whole desktop at once as well (also i wouldnt suggest using sync btw they say it might collide with other apps or system so just use backup on cloud)(you cant move the main desktop section at once so you have to cancel sync from app and then get in the desktop page at drive press ctrl+a ctrl+x or drag and move to my drive and ctrl+v or release the click(or something along those lines , i havent mastered it yet ))
+everything : this is a bit advanced one , i use it to back up the whole folder structure and put it in my recovery folder and see if i missed and app or folder while i was moving
i havent used them yet but teracopy or Unstoppable Copier for moving folders(like 200 gb i suppose), they say its faster then windows explorer , like i said i havent used it yet afaik but teracopy has a modern interface while Unstoppable Copier is better in damaged disk and file recovery (it appears teracopy has an transfer confirmation as well which is a plus imo)

-In Browser Extensions

+Bookmarks
bookmarks function itself : i use it to backup tab windows by the right click and choosing bookmark the window and let myself access my whole tabs from my phone and manage another huge folder hierarchy in browser

https://chromewebstore.google.com/detail/bookmark-dupes/ombpkjoelcapenbepmgifadkgpokfgfd https://chromewebstore.google.com/detail/bookmarks-clean-up/oncbjlgldmiagjophlhobkogeladjijl
these ones at above are for detecting duplicate bookmarks

https://chromewebstore.google.com/detail/rewind/oghafdocdmlkkjipdmnikdcgekjpiapf

this is for figuring when you bookmarked a thing or etc in case you need it for some reason sometime

+Tabs

https://chromewebstore.google.com/detail/session-buddy/edacconmaakjimmfgnblocblbcdcpbko
i discovered this one new so am not expertised in this but i use it to various purposes and backing up tabs

https://chromewebstore.google.com/detail/tab-manager-plus-for-chro/cnkdjjdmfiffagllbiiilooaoofcoeff

this was the one i used before , its more visual for seeing many at once and few more better things etc (it also shows more duplicate tabs compared to for some reason i dont know yet)

https://chromewebstore.google.com/detail/wayback-machine/fpnmgdkabkmnadcjpehmlllkndpkmiak

for backing up tabs in case something gets deleted (a youtube video for example)

-for file related

+treesize : i use it for finding big folders , apps , games etc when am low on storage and erase them (quite useful) (also sometimes when i erase app datas from app they dont get smaller much so i delete the whole app and redownload it (e.g. spotify))
+duplicate cleaner : like the name suggests i use it for deleting duplicate files and folders , and finding too similiar folders(by manual obervation)
+free file sync : i use it for finding differences between 2 too similiar folder and if you were moving a folder to another device and it got interruptted you can continue it from here imo (not sure how it reacts for half files e. g. an unfinished torrent file (both would look like theyre 3.6 gb video if am not wrong idk)
+a tag based folder app which i didnt decide yet (tag studio or etc)
fourth one of the backup ( i havent used them yet but teracopy or Unstoppable Copier for moving folders(like 200 gb i suppose), they say its faster then windows explorer , like i said i havent used it yet afaik but teracopy has a modern interface while Unstoppable Copier is better in damaged disk and file recovery )(it appears teracopy has an transfer confirmation as well which is a plus imo)

5 Comments

2024/12/30
11:22 UTC

Is there any app that puts all my health data together and gives AI based insigths?

3 Comments

2024/12/27
11:28 UTC

Fastest possible hard drive RAID?

2 Comments

2024/12/25
16:23 UTC

Where do you store everything?

So far I’ve been using a private discord user as my own dump for content I wanted to save (like urls, vids to watch later, memes, etc) but I’ve realized this probably isn’t the most secure so what works similar to discord that lets me organize and save content? I would also appreciate if it’s cross platform since I have an iPhone but use a windows desktop so something like apple notes wouldn’t work well

12 Comments

2024/12/22
18:11 UTC

Saving web articles and making them findable

I have a decent system for my documents and media, but I'm struggling a little with how best to save local copies of important reference articles (not scholarly-type works that often have reference systems built in) and how to find them. Link rot is a real thing and I fully expect it to get worse. Also, I'd like to clear out my browser tabs lol.

My initial thought, for longevity, is to just save the text of the article in a .txt file, with a filename of the originalHeadline_author_date_tag1tag2tag3.txt in one large folder so I can just search for tags. But then I thought, maybe I want the main tag first, since headline and author and date aren't likely to be good for organization. I'd prefer to at least look by Psychology or NaturalWorld or Politics, without necessarily needing to remember the tags I gave it.

Another option is to have a txt or md file with this info that I use as a guide, so any new article gets added there and as its own txt file. This would be faster to search, and I'd prepend an ID to each article txt file so I can easily find it. This does free me from a particular naming schema (though probably good to keep some data in the article txt files), but adds overhead for every article I add. I'm not anticipating doing thousands (or even hundreds) of articles to start, but over time, it should be robust. I'd also like to keep the original link somewhere, in case I need to hit it up for some reason (updates, clarifications, send to someone else).

Right now, this would all live in my NAS structure, and backed up to a cloud service periodically.

Thanks for any tips and ideas!

14 Comments

2024/12/18
14:47 UTC

Looking for a DAM for game development

Most DAM I look at only support image, video, audio and compressed file types. Im looking for something that can do 3d assets like .obj files. I would prefer something self hosted and with a visual grid instead of a large list of file names as the only way to view the files. Please help and thanks for taking the time to read the post.

6 Comments

2024/12/15
19:40 UTC

Cloud-based library app for movie, TV, and music collection?

5 Comments

2024/12/12
15:02 UTC

How to find origin of a pdf

Hi i am a student. I find a useful pdf resource. I couldnt track where it came from. So maybe i could find what did they create about another subjects. Any help is appreciated. Thank you all in advance.

6 Comments

2024/12/11
12:53 UTC

What’s your definition of data curation ?

Who has the best definition of what Data Curation is and definitely is not as I’m seeing confusion on this topic and overlaps with other things like Data Wrangling and Data Preparation - any thoughts 💭?

14 Comments

2024/12/11
11:24 UTC

How to extract transcripts from offline videos? Needs to have AI?

Is there a tool to extract the transcripts from offline videos? Something like Submagic for YouTube? The issue is I do not have the initial source URLs anymore, they are saved on the hard drive and I find it difficult to stay and play hundreds of hours of videos.

5 Comments

2024/12/07
12:39 UTC

What do I do with 17 TB of very well curated nsfw content?

56 Comments

2024/12/07
02:57 UTC

Curate old letters, news paper articles and similar?

I have some thousands scanned documents in form of hand written letters, old printed letters, news paper articles etc. Some are in PDF format, some are in JPG/HEIC. I recently figured out that those residing in Apple Photos are "automatically" made searchable for most of the text.

But what's your good expert advice here? If I both want to keep the original scans (in either PDF or JPG or similar), _and_ would like to have all the text as easily searchable as possible?

Apple Photos, iCloud Drive, OneDrive, OCR with WonderShare PDF and then into HTML files, or something completely different?

4 Comments

2024/12/06
12:00 UTC