/r/datacurator

Photograph via snooOG

A place for us less messy data hoarders.

/r/datacurator is the place for discussion about the curation of digital data. Be it sorting, file formats, file encoding, best practices, discussion of your setup, tips and tricks, asking for help etc.

Rules:

  1. Follow common Reddiquette rules.
  2. No Direct linking to pirated/illegal content.
  3. Keep discussion and commenting civil.

/r/datacurator

18,315 Subscribers

2

How to create custom metadata tags for .mp4 and .mov files on Windows?

Hi there, I am trying to solve an issue at my work which is going through the process of upgrading our digital filing system, which is currently a database type set up in File Explorer to an ECM system.We have been asked to tidy up and name thousands of data files spanning decades. Naturally, I am trying to automate a solution for my team as the time this would take is unachievable.

I have already been able to find a solution using Exiftool to add custom 'keywords' or tags for all the jpgs using a script that scans a parent folder and then all the sub-folders, and then assigns a common tag relating to that parent folder, in this case a site code. This is so that when you search for a site in the new ECM, any file with that site code will show up by using the associated metadata tag.

I am hoping to do the same for the hundreds of .mp4 and .mov files. I understand that Exiftool is limited when writing metadata to video files, and I am struggling to find the best solution. Any advice would be much appreciated!

0 Comments
2024/04/14
22:54 UTC

6

Do you store personal pictures/other personal files in a separate folder from your non-personal ones?

My root folder has a videos folder and an images folder that has further categorization within. But I also have a "personal" folder in the root folder that has personal videos/images. Ideally these would go in the respective videos/images folders but I keep it like this. Anyone else organize it like this or is it generally not recommended as when these folders grow it would get tough to manage? Sorry if this was a bit confusing to read.

4 Comments
2024/04/13
17:11 UTC

7

How to number folders and files?

Hi, how do you think I should number my folders and files? Adding the number of the parent folder to the number of the child folders (option 1) or not (option 2).

Option 1:

01 Animals (parent folder)

  • 01.01 Dogs (child folder)
    • 01.01.01 Bulldog (file)
    • 01.01.02 German Shepherd (file)
  • 01.02 Cats (child folder)
    • 01.02.01 Maine Coon (file)
    • 01.02.02 Siamese cat (file)

Option 2:

01 Animals (parent folder)

  • 01 Dogs (child folder)
    • 01 Bulldog (file)
    • 02 German Shepherd (file)
  • 02 Cats (child folder)
    • 01 Maine Coon (file)

    • 02 Siamese cat (file)

Edit: wrong numbering

7 Comments
2024/04/12
07:15 UTC

1

Is there an easy way to auto-organize/tag a collection of old memes / gifs / pics without reviewing each one?

For years, I have pretty much stopped saving new content because I can't keep track of existing data. Whenever I'm looking for a meme, it's easier to Google it than look at my folders. The problem is...a lot of good stuff I have is hard to find even on Google. I'm looking for a way to organize my content without expending days on it. Thoughts?

4 Comments
2024/04/12
04:20 UTC

10

Reorganizing files from scratch

I am going to be reorganizing a computer filing system for a friend. She basically has chaos as she has a few drives with home and work files, plus her deceased mother’s files to organize. This will be on a Mac system. I don’t think it’s an extraordinary number of files, maybe 20-30k possibly less.

My approach will be to first sort by media type (get photos and video separated), then to order by date and sort into broad categories, probably by file type. There will be a lot of .doc and .xls stuff. I’m not sure how much is already in project folders vs loose. But the final detailing will be her task — my job will be to set up a structure and group similar things together. I will use smart folders to do this (preserving whatever structure exists).

I’m thinking that I should append an ISO date to the beginning of all file names. I’m looking for an easy way to do this- I’m not a programmer and would prefer to not use the terminal. Anyone know of a good tool?

Then the big question… what file structure? I’m thinking J.D because it will impose structure in an understandable way, and most decisions can be made up front. It should be compatible with organizing by date, and eliminate the ambiguity inherent in descriptive naming. I’m prepared to alter it some if necessary, or create separate structures for home and work. I’m aware that it’s less flexible than others, but that may be a strength in this case. Thoughts?

16 Comments
2024/04/11
15:08 UTC

0

I can't create an Archive.org account because I'm not receiving the confirmation emails. Wat do?

halp

1 Comment
2024/04/06
12:05 UTC

0

Looking for a .m4a file renamer based on meta/exif data

Hi guys, I have a load of .m4a files that I'd like to quickly an easily batch rename based on their Media Created data.

Is there a Windows desktop app that can do this? I've tried searching but not found anything except ExifTool, but that's command line and I'd much prefer a GUI interface.

Many thanks! :-)

2 Comments
2024/04/05
12:22 UTC

8

Media sharing and collaborative curation software?

I'm looking for an open source program compatible with Linux that facilitates media sharing and collaborative curation among users. I would still like to hear about any similar software, even closed source or not compatible with Linux. Ideally the program would have an edit history or some way to approve/reject edits for moderation. I think the closest software to what I have in mind would be image boards, musicbrainz and stash-box. But those are specific to some kind of media only. On the other hand there's NextCloud or P2P file sharing programs where you can share any media but other users can't help curate the media or there is no moderation if you allow someone edit access. I would appreciate your suggestions.

3 Comments
2024/04/05
10:06 UTC

4

Can AI sort 'noisy' photos?

myself & several colleagues use iPhones to take 100's of photos day & night of posters (aka jobs) taped to poles. the posters are all 900mm x 320mm and generally advertise concerts.

Usually we're photographing 10-20 jobs each day. The photos are amalgamated every few days & manually sorted into 'jobs'. Sorting the photos is tedious & time consuming.

The clients are sent a link to a google pin map (where their posters can be found) & another link to the photos of their posters.

Could anyone suggest any software that could sort be trained to sort the photos based on pattern of shapes/colours/text that each photo contained?

thanks!

2 Comments
2024/04/02
17:27 UTC

3

Monthly /r/datacurator Q&A Discussion Thread - 2024

Please use this thread to discuss and ask questions about the curation of your digital data.

This thread is sorted to "new" so as to see the newest posts.

For a subreddit devoted to storage of data, backups, accessing your data over a network etc, please check out /r/DataHoarder.

1 Comment
2024/03/31
20:00 UTC

18

I made a 1 click AI-powered windows app that renames and adds relevant metadata to your images and gifs, where you could change language and use custom prompts for renaming

14 Comments
2024/03/31
16:08 UTC

0

Does anyone here have email list of all principal or management of schools in South Korea?

3 Comments
2024/03/23
21:17 UTC

3

Training tesseract 5 on custom dataset

Hi everyone, I'm working on my senior project which involves reading a card from a game using a screenshot. To do so I'm using Tesseract but there's a part of the card the OCR doesn't identify so I wanted to train it using a bunch of different screenshots of that part along with a text file containing the content of that screenshot. If anyone knows how that could be done or if there's a better alternative I'd love the help!

1 Comment
2024/03/20
10:11 UTC

1

Similar / not same file identification

Goal - find "oh, I forgot that" useful data, documents, and emails for various projects (personal and professional=) that I have in flight. Maybe even some of my web-bookmarks. Tagging and maybe some content clustering (extract text, then cluster on bag-of-words).

As part of this, I found myself writing a tool that includes a locality preserving hash to identify "similar" files that are not exactly the same, like revisions and re-orderings of documents and code. That way I can put all of "one" document in one place, and then link into that from a project-oriented directory.

Does anyone else use (or even have) a tool that already does something like this?

10 Comments
2024/03/18
17:28 UTC

8

Sorting chaotic backups on external drives and

Hello everyone,

I have 4 external hardrives (between 1 and 5TB of space per hardrive), that are filled with all my files from the last ~6-8 years. The problem is, that the files are not sorted properly and a mix between Time Machine Backups, copy-and-paste backups, backups of backups ect.

The type of files also ranges from text/pdf documents, media files to programming projects.

Can you recommend any resources and/or programs to help me sort this chaos and set up a longterm sustainable backup system that is not dependent on any main platform (like Time Machine is on Mac)?

3 Comments
2024/03/14
11:37 UTC

12

I've built a CLI tool for file management automation

2 Comments
2024/03/10
01:30 UTC

3

Embedding barcode info in photos

Hey all,

We have a fabrication joint where expensive parts are used for prototype systems for our customers. Occasionally, these parts will be damaged in shipping and thank goodness insurance covers that! But we have to prove that it was in good condition when it left our place. For this reason we've got TONS of photos of parts, and it's become cumbersome to sort through them when we have to.

Someone came up with the idea of using something like Entagged to put the barcode information of the parts in the metadata of the photo. This would allow us to simply search up the barcode and see all photos of that part. From there, it would be easy to narrow down which photo is for that project based on date, context etc.

My issue with Entagged is that it seems like a frustrating workflow. We'd need to buy a compatible camera, the device itself, train everyone on how to use it, and have all the techs download the app. Then if Tech 1's phone is connected to it via bluetooth but Tech 2 needs to use it...

I need this to be easy to do, otherwise the techs won't use it at all!

I'd love to A) buy a camera with this feature inbuilt, so we don't have to use peripheral tech or B) find a simple cellphone app that everyone can learn to use

Any help pointing me in the right direction would be greatly appreciated! Thanks!

Edit: we generate codes for our products, so this could be done with QR codes or whatever would work as well.

5 Comments
2024/03/04
17:28 UTC

17

Batch rename audio files from an excel list

Hello,

I have hundreds of audio files named "Track 01, track 02" and so on, and I'd like to rename them sequentially using an excel file where all the correct names are written. So, track 01 would become whatever it's written in cell 1, track 02 from row 2, and so on.

Is there a way to do so? I'm not a programmer, so if we can avoid coding it'd be better, but I'm willing to learn something if that's the only way to do this.

I'm using Mac Ventura 13.5.2

14 Comments
2024/03/01
13:04 UTC

2

Monthly /r/datacurator Q&A Discussion Thread - 2024

Please use this thread to discuss and ask questions about the curation of your digital data.

This thread is sorted to "new" so as to see the newest posts.

For a subreddit devoted to storage of data, backups, accessing your data over a network etc, please check out /r/DataHoarder.

0 Comments
2024/02/29
20:00 UTC

6

Help! Seeking cure/mitigation - How do I stop myself?

I have pathological OCD with organising text from a monthly scrapbook into separate word docs by the topic of the text, which takes massive amounts of time and leaves me exhausted.

Typically the text is extracts from things Ive read or random thoughts.

The desire to organise takes precedence over socialising which isnt great. Though ngl it does feel good when I get some chunk of organising done.

Has anyone found any effective strategies / techniques / therapies to help please?

I also have a problem with saving pdf/bookmark reading material

PS. Is there a good program for tagging sections of text in a large document by topic and then applying filters to view by topic?

This would reduce the cut-paste work.

[Elaboration:

I dont have capacity to switch to linux or mac. Windows is a must and a small learning curve is important.

I currently save everything of varied topics as I go in a monthly docx scrapbook which fills to >70 pages.

Then at end of month I cut-paste from that monthly scrapbook docx to >30 longterm topic docx documents. Lots of low-skilled admin in clicking around :'(

I havent found it useful to decrease the number of topic types unfortunately

The topical docx can be read like normal documents with no further clicking, which I like.

I search for strings using AgentRansack]

19 Comments
2024/02/28
15:47 UTC

7

Anybody recognize this OCR format?

Anybody recognize this OCR format?

It seems not to be a proper hOCR because gImageReader which use tesseract cant open it.

https://preview.redd.it/7ihzblwb07lc1.jpg?width=1433&format=pjpg&auto=webp&s=a5009402464be898591b21b80da8d3ad5083791b

0 Comments
2024/02/27
21:05 UTC

12

Thinking about building a NAS

I used Drobos in the past to backup and archive my data. Lesson learned - do not rely on proprietary systems. I'm now considering building my own NAS and need a little advice. As far as software, I'm undecided between Unraid and TrueNAS but leaning toward Unraid because it seems a little easier to set up and manage. As far as hardware, I already have lots of SATA drives (5 x 14TB, 10 x 10TB, 10 x 8TB, 6 x 6TB, plus a few other scattered sizes) so I think I would like to stick with those instead of reinvesting in SAS drives. Beyond that, I don't really know. I kind of like the idea of a desktop setup because I've built several Windows/Linux PCs before and am familiar with the process. I don't know anything about rack-mounted homelabs and wouldn't know where to begin. But at the same time I recognize that a desktop setup isn't going to accommodate as many drives or be as expandable as a rack system so I am wondering if climbing that learning curve would be worth the while.

My purposes for the NAS would be 1) backup of my main PCs hard drives and SSDs, 2) media player (Plex, Jellyfin, etc) 3) file server 4) maybe some VMs. Budget: maybe $5000. I wouldn't need to buy any drives at least to start out since as mentioned I already have a lot of drives lying around.

Advice please?

Xposted to r/DataHoarder and r/datacurator. Thanks!

14 Comments
2024/02/25
15:15 UTC

7

Looking for a good table OCR softwave to convert mutiple tables in the same format from books in image format to a single speardsheet table.

Currently im using docsumo table OCR. It is the most accurate one i could find but the problem is i have ~1000 images = ~ 1000 tables (with the same formatting) in total and if im doing it manually it is very time consuming (around 5 minutes per table so 5000 min/83 hours total). I could merge all the images into a single .pdf file > convert but from past experiences the result is horrible with misaligned data in different columns everywhere. Any help is much appreciated.

0 Comments
2024/02/20
13:33 UTC

59

M-Disc archive + QR code organisational system (work in progress!)

19 Comments
2024/02/12
01:10 UTC

7

What do you think are the most important metadata for an archive file containing manga images?

Comics have ComicRack's comicinfo.xml, but that isn't very specific to manga and the main data source is ComicVine. You can't really do anything with the language aspect and alt titles. Like if I wanted to store the mangaka's Japanese name and a furigana/kana version of it, I couldn't. If you were to make a mangainfo.xml, what would you include?

1 Comment
2024/02/08
15:11 UTC

35

I'm currently at stage 3.

4 Comments
2024/02/06
20:24 UTC

3

Service to extract images from scanned PDF?

Would be very glad if anyone can recommend OCR but for images

9 Comments
2024/02/05
17:30 UTC

26

I made a script to bulk convert videos and preserve their metadata

Me and a friend are in the process of converting several TBs of recordings made with SONY cameras and action cameras. They all have insanely high bitrates and use H264.

Our GPUs are pretty fast in converting to H265 format, to halve the used space (at least).

I noticed that Handbrake doesn't keep the metadata of recorded time, so converted videos loose all time information which is a huge issue to me.

So I created a Powershell script that uses HandbrakeCLI and exiftool to automate the job. You just need provide source and destination folders, and to choose which profile you want to use. The script will convert and transfer the medatata of every video file found (MTS and MP4).

Would you be interested in this? I also created a light version that only does the metadata part without the conversion.

I can tidy up these scripts and publish them on GitHub.

12 Comments
2024/02/04
17:55 UTC

Back To Top