/r/datacurator

Photograph via snooOG

A place for us less messy data hoarders.

/r/datacurator is the place for discussion about the curation of digital data. Be it sorting, file formats, file encoding, best practices, discussion of your setup, tips and tricks, asking for help etc.

Rules:

  1. Follow common Reddiquette rules.
  2. No Direct linking to pirated/illegal content.
  3. Keep discussion and commenting civil.

/r/datacurator

20,427 Subscribers

2

Monthly /r/datacurator Q&A Discussion Thread - 2024

Please use this thread to discuss and ask questions about the curation of your digital data.

This thread is sorted to "new" so as to see the newest posts.

For a subreddit devoted to storage of data, backups, accessing your data over a network etc, please check out /r/DataHoarder.

0 Comments
2024/10/31
20:00 UTC

1

Saving favorite Threads on Site that is going down?

Is there a good "tool" to use to extract some of my favorite thread, favorite writings of my friends there? It's a senior site, having a lot of trouble, and I fear some threads will be gone forever??

I heard of a "scraping tool" but couldn't find one, and if possible, I'd like to have Opensource tool/software. Thank you for any help at all ;)

3 Comments
2024/10/31
16:40 UTC

3

New Solution Thoughts?

0 Comments
2024/10/30
00:08 UTC

5

Compressed folder (zip.rar...) Images (jpg.png...) in to pdf?

(Translation is used  I am not computer savvy)
I am addicted to scanning and collecting various manuals
For many years images were saved as png
Then the folder was a zip folder
Many a little makes a mickle
We had to review it because the capacity had increased too much
I made it into a pdf and it took up a lot less space.
compared to the identical one.
It doesn't look or feel any worse for wear.
Why is it lighter?
When converting to pdf, there is a function in the software called “optimization
This is to reduce the image quality and make it lighter.
However, I am not using this function and the image is still in png format.
strange!
I'm thinking of changing everything to pdf if it makes no difference and only reduces the capacity.
Is there a reason why most of the world uses formats like zip or rar instead of basic pdf?

9 Comments
2024/10/28
21:05 UTC

6

Managing Bookmarks, Images, and Memes: A Digital Overload?

2 Comments
2024/10/27
19:22 UTC

3

speed up the tagging process

Hi, I've a problem to be solved and I'm here to ask you for suggestion.

I've a huge quantity of files (~50_000 PDFs) that need to be tagged and I've a fixed structure where store these tags.

something like:

_system VARCHAR(50) NOT NULL,
version VARCHAR(10) NOT NULL,
name VARCHAR(50) NOT NULL,
y YEAR NOT NULL,
language ENUM (...) NOT NULL,
type ENUM (...) NOT NULL,

so some tags are limited (enums) but other are not (strings, numbers).

My question is: "is possible to automate or speed up someway the process ?" because manually process these files will consume hundreds of working hours.

16 Comments
2024/10/26
12:44 UTC

7

10 years and 30,000 files of audit data

Greetings! I am a data hoarder/curator in my spare time and a compliance engineer by trade. After our last audit I'm starting to dig into the task of curating all of our previous audit responses to help looking up answer for future audits.

To that end I'm looking for a tool or combination of tools that process all 30,000 files (Word, Excel, PDF, TXT and image files) and curate them. Auto-tag them, pull everything into one big searchable database to search on key words & phrases, etc.

As this audit data this would have to stay on prem but in my early searches I've found if I want something that leverages AI for auto-tagging, it isn't on-prem.

Any suggestions are appreciated. Really just trying to wrap my arms around it at this point.

11 Comments
2024/10/25
14:00 UTC

50

TikTok Bots Using Layered Video Encoding to Bypass Moderation?

Hey everyone,

I've recently noticed an increase in bot accounts on TikTok posting inappropriate content that promotes OF accounts. However, these accounts don’t seem to get banned, despite violating TikTok’s ToS. After digging into this, I downloaded one of these videos and found something interesting.

When I download the video through TikTok, the frames appear as abstract patterns (like lines over gradient backgrounds). However, when I download the same video externally, it shows the inappropriate content that users are seeing. This leads me to believe that these bots are using a technique where they layer video content, sending one version of the video to TikTok's moderation tools and another version to actual users.

Here’s what I think is happening: The video likely uses layered video encoding, where it has two "layers" or streams—one with harmless frames and another with the actual inappropriate content. It could be manipulating metadata, specifically keyframes and predictive frames, so that TikTok’s AI moderation only detects the innocuous content, while human viewers see the real video. This allows the bots to bypass moderation since TikTok’s AI may be scanning the abstract frames, approving the video, while different frames are shown to users.

  • Has anyone seen or experienced something similar with layered video encoding?
  • How do these bots achieve this separation between frames seen by TikTok’s moderation system and frames seen by users?
  • What tools (FFmpeg, HandBrake, etc.) and techniques might be used to encode videos like this?

Looking forward to your insights on this!

10 Comments
2024/10/23
16:31 UTC

8

TikTok Bots Using Layered Video Encoding to Bypass Moderation? v

Hey everyone,

I've recently noticed an increase in bot accounts on TikTok posting inappropriate content that promotes OF accounts. However, these accounts don’t seem to get banned, despite violating TikTok’s ToS. After digging into this, I downloaded one of these videos and found something interesting.

When I download the video through TikTok, the frames appear as abstract patterns (like lines over gradient backgrounds). However, when I download the same video externally, it shows the inappropriate content that users are seeing. This leads me to believe that these bots are using a technique where they layer video content, sending one version of the video to TikTok's moderation tools and another version to actual users.

Here’s what I think is happening: The video likely uses layered video encoding, where it has two "layers" or streams—one with harmless frames and another with the actual inappropriate content. It could be manipulating metadata, specifically keyframes and predictive frames, so that TikTok’s AI moderation only detects the innocuous content, while human viewers see the real video. This allows the bots to bypass moderation since TikTok’s AI may be scanning the abstract frames, approving the video, while different frames are shown to users.

  • Has anyone seen or experienced something similar with layered video encoding?
  • How do these bots achieve this separation between frames seen by TikTok’s moderation system and frames seen by users?
  • What tools (FFmpeg, HandBrake, etc.) and techniques might be used to encode videos like this?

Looking forward to your insights on this!

0 Comments
2024/10/23
16:31 UTC

12

dublin core to mods crosswalk transfers

hello, i’m not sure if this is the correct subreddit for this but i’m currently completing a task for school that requires me to create crosswalk transfers from dublin core to mods. i need to convert a sizable amount of dc elements (alongside their respective metadata) but i can’t seem to locate a program that can do this for me—it seems as though i have to manually map each element individually using this guide: https://www.loc.gov/standards/mods/dcsimple-mods.html.

so, i’m pondering these two things: 

  1. am i stupid and is there actually an encoding program that does exist but i just can’t find it. i’ve used this program (https://nsteffel.github.io/dublin\_core\_generator/generator\_nq.html) for a past assignment to generate xml from dc elements so surely there should be one for dc2mods?
  2. if no such programs exist, does this mean that in professional settings massive collections are all encoded by hand? that seems a bit unreasonable and a bad use of time

for example if i have the dc element “bluebird” as the title, can’t i just input it somewhere so it gives me the mods version "<titleInfo><title>bluebird</titleInfo></title>" without having to manually type it all out?

i apologize if this sounds really asinine, pls be kind. i’m incredibly new to the field of metadata and am still a student

1 Comment
2024/10/18
22:23 UTC

6

Looking for free bulk image OCR?

Hello, I have thousands of image files that all follow the same format, and I'd like to extract the data from about 20 fields in the images. I currently have 500 images but anticipate gathering many more. Do you know of any free image OCRs with high accuracy and that allow customization of which fields of pixels on the image to pull from? I'll be compiling all of the data into a CSV and there's too much data to split it myself, which is why it's important I find an OCR where I can specify which pixels on the image to look at for each data point. Thank you in advance!

2 Comments
2024/10/10
23:12 UTC

13

Help in applying OCR to 3000 Pages (1.5 GB) scanned PDF file

Hello Guys,

I have a problem that I need to solve. I have a huge file (1.5GB) that is 3000 pages. i need to make it editable/searchable with great precision. What is the best approach for this and how to make it? even if an by using online or cloud tool? The file mostly contains drawings, and reports. Any help is appreciated

3 Comments
2024/10/10
16:26 UTC

0

Do you hate all these invoice(7).pdf filenames? PDFnamer is the Solution

Hi,
I recently launched pdfnamer.xyz
A tool that helps you rename your PDF Files according to their content.
I started this project for myself because I hated it to search through PDF Invoices when I was doing my vat tax.
If you download or scan PDFs they have all kinds of naming (invoice.pdf, 2134343223.pdf, etc.), but none was matching my template YYMMDD_Supplier_Topics.pdf (I am a Monk in this regard).
So I created this tool for myself and after a lot of friends and colleagues told me to make it public, I invested some time and created a SaaS around it.
And here we are :)

If you are interested, please check it out. Your feedback is highly welcome!

Regards Christian

Rename your PDFs now: pdfnamer.xyz

30 Comments
2024/10/08
19:07 UTC

7

Is there a way to set a custom thumbnail for a folder in Windows 11?

I do digital art as a hobby so I have a big folder of projects that need organizing. My workflow is that I create a folder for every drawing that I do. Inside is the PSD file and all the reference images I need plus intermediate output files etc. But the end result of any project is just one image. I would love it if that final image were the thumbnail for my folder so that I can get a birds eye view of my portfolio and search more easily through it. I don't like relying on names because as we know file names are a complicated topic. Plus if I want to show it to someone they can easily get a quick glance.

Ideally I'd like to be able to say "For each folder look for a file called PREVIEW.jpeg and use it as a thumbnail". Just like a README.

Edit: Also if there is a way to set the preview of a each PSD file that would be useful.

2 Comments
2024/10/06
07:40 UTC

19

Photo sorting suggestion needed

I have 500gb+ worth of family photos that my parents keep, they never really sorted anything properly so it's a complete mess, I wanna make it easier to navigate, it's gonna be hard but possible.

So I wanted to ask if there are any good tools or something that can help me/do exactly that? It might be ready hard as many of the extremely old photos are from a digital camera and old 2008 phone.

If I'm gonna do it myself, I seriously have no damn clue how I'll do it.

13 Comments
2024/10/03
20:15 UTC

5

Monthly /r/datacurator Q&A Discussion Thread - 2024

Please use this thread to discuss and ask questions about the curation of your digital data.

This thread is sorted to "new" so as to see the newest posts.

For a subreddit devoted to storage of data, backups, accessing your data over a network etc, please check out /r/DataHoarder.

0 Comments
2024/09/30
21:00 UTC

0

There is a problem exporting my camscanner word (or OCRed) document

When I export my normal 8 pages document, the document becomes 23 pages long with blank pages and separated paragraphs. Please help.

0 Comments
2024/09/28
03:55 UTC

4

OCR automation software for Windows. Batch OCR converter with folder monitoring

OCR automation software for Windows that can help you batch OCR an entire folder of scanned PDFs. Simply configure any folder in your computer as a magic folder. OCRvision automatically adds an invisible text layer to the scanned PDF document, making it easy to retrieve important information. Try OCRvision today and see how it can streamline your workflow!

https://www.ocrvision.com/

0 Comments
2024/09/24
11:13 UTC

5

Trying to remember name of an unusual photo organising program

I'm trying to find an image viewing/categorisation program that I used a few years ago. I cannot remember its name. It had an unusual way of presenting the images in the collection (directory): they were all shown as items arranged on an infinite canvas. They were not necessarily arranged in a highly ordered way, but might be shown scattered or in clumps. You could zoom in and out very far. You could sort the photos, which would "clump" them together based on the sorting criteria. You could also manually arrange them. You could drag to select a group of photos, then tag them, move them, etc.

It was obviously not the most efficient way to browse or organise photos but it had its appeal. To emphasise, the "canvas" was the program's metaphor for viewing the photos, not any kind of document that was being created.

Does this ring any bells for anyone?

10 Comments
2024/09/24
02:57 UTC

3

Moving files with same name into folder

I am currently in the process of cleaning up all of my different folder systems and consolidating them in a PARAs frame work. I have run into the issue multiple times where I have a folder with files in it (e.g. Planned Projects.md) and I find a file that is named exactly the same but shall be put in the same folder. Now because of bulk moving folders, I don't want to rename each file (even with PowerToys a pain) but just want to be able to drop it in the folder and it gets "renamed" automatically like Windows those when creating copies. I currently run Windows 10 and Windows 11. I am very grateful for any tips, tricks and software recommendations.

4 Comments
2024/09/22
21:46 UTC

1

(ab)using git for a collaborative non-chronological historical archive? [ideas wanted]

1 Comment
2024/09/22
19:34 UTC

11

Why is removing exact duplicates still so hard?

8 Comments
2024/09/20
20:41 UTC

17

Where do you put a file, when it belongs in one or more places in a file structure?

Hi All. I recently purchased a NAS, and am in the process of moving and backing up heaps of files, from various places, onto this NAS.

While I am at it, I'll sort them for future reference.

One issue that regularly occurs for me is files that could be dropped in multiple folders within a folder structure system.

Consider, vehicle insurance, under assets/vehicle, or insurances? A health report, under the person's individual folder, or under the Medical folder?

I get to thinking about this and then it just becomes unproductive.

But it got me to thinking about a folder structure which commences again every month, like

-- 2024

---- 01 - January

------ 2024.01.00 - Vehicle Insurance

------ 2024.01.15 - Bobs medical reports

This structure would self sort by the dates, It'd be mandatory that all files are named appropriately and tags added into the filename. Search would be my best friend if I couldn't find the year or month the file belong to.

Has anybody else setup something like this? It's less of a strict folder structure, and more an organisational system around file creation / retrieval dates.

I'd be interested to get feedback please.

Thanks all.

20 Comments
2024/09/20
11:36 UTC

4

Using Cleanarr or Maintainarr to Remove Duplicates?

I was going through my Plex content and when I toggled over the library to show duplicated content, I had more than 2800 records. it looks to be about 17TB worth of storage being taken up by dupes. I'd really like to just have one copy of each show/movie in my library, and I'd like it to be the lower bit-rate (~12-15mbps) option. Consequently, The TRaSH Guide ended up adding a few movies from the 1980s with bitrates up around 125. Yikes.

I've tried using Cleanarr, but there's very little documentation for it, and what there is is poorly written. I'm finding that Cleanarr crashes about 20 seconds into a run, only deleting a few tens of files at a go. My file permissions are good, so beyond that I'm at a loss on how to make it work.

People have also said that "Maintainarr is the new Cleanarr" so I also tried spinning up a copy of Maintainarr, but I'm having a hard time figuring out how I set up a rules to both identify and choose the dupes I want to remove.

Can anyone guide me in the right direction?

Oh, I've also tried running Plex Duplicate Detector python script, but without a docker with its dependencies supporting it, I can't get it to run on Unraid. (slackware is pretty limited) If I can get it running, I'd be fine using this and just running it once or twice a year to keep the library a little cleaner.

Thanks.

9 Comments
2024/09/18
18:37 UTC

2

OCR translation?

They know an OCR that also translates text other than Capture2Text

0 Comments
2024/09/18
16:38 UTC

8

Anxiety Log -- Could use some data advice!

Hi all! I have always been obsessed with collecting data for myself using Google Forms, to help with some physical and mental issues I've been encountering. I work in finance.. so have decent skills but am looking for some advice on how you might organize the following data.

Type: Google Form that I will out during an anxiety episode. Data received from form:

TimeStamp:

Date:

Scale:

Trigger:

Description:

I would love to convert the data into a visual of some sort, to show # of anxiety episodes & severity over a course of time. I'm open to Sheets, Excel, or any other free platform to try!

I will share a screenshot of some data (personal notes removed), and try to link the dummy data as well.

LINK (editable): https://docs.google.com/spreadsheets/d/1zPWbt8oIQociic3wioDW7IxQmVXO-B3DeeYoS3Vnhao/edit

https://preview.redd.it/kfs85b2jiuod1.png?width=883&format=png&auto=webp&s=69fbaae87706585f97b9abc7473857e7c082a85f

I would love to hear any feedback or direction! I also have other response sheets on medication use, and physical symptoms that I'm hoping to integrate after I have a better picture of where to start.

0 Comments
2024/09/14
21:49 UTC

2

YouTube channels or playlists

I've just starting dipping my toe in to archiving YouTube channels, and in some cases just certain playlists. Wondering what channels/playlists others think are worth archiving?

2 Comments
2024/09/13
13:23 UTC

11

Entry Level Archivist Seeks Advice

Hello!

I'm a recent graduate of a master's program and am beginning to build my career as an archivist. I am among the candidates for a project to establish an archive of alumni records held in an offsite archive center. These are hard-copy records I would parse through and create an inventory for the org's permanent usage (not an exhibition). I've worked on numerous archiving projects, almost always dealing with textiles and garments, but in those cases, I entered a job with already established archival procedures and proprietary software. I'm seeking advice on how I can approach this project as a consultant; do you have any recommendations for how I can establish archiving procedures for a project of this nature? How I might log this kind of data/inventory any additional material for individual alums? Any software you recommend aside from microsoft/google spreadsheets? Any advice would be greatly appreciated :)

2 Comments
2024/09/11
14:32 UTC

27

How do you organize your data when you have many digital hobbies? (Music, videos, art, programming, etc.)

I'm always curious how other people do this and I feel like there's gotta be a way to do this more effectively.

So I have a number of hobbies: music production, video production, programming, digital art, and animation.

All of this stuff lives on my D:/ Data drive. I also have an R:/ Resources drive that contains just big sets of downloaded data. Like music sample packs, video asset packs, sound effect libraries, etc. It's the kind of stuff I wouldn't sleep if I lost but it would suck. So that gets backed up to my server but doesn't also get backed up to the cloud like my Data drive does.

Overall the issue I have is when things bleed between "areas". So the difference between a music project file for my band, or a music project file for a background track for a video. I typically store the final mix in the folder itself, but when I'm creating assets, those final .wav files are best viewed in a single folder, but then where do I store the project files?

Then there's stuff like graphics for videos, pictures that I save that I didn't make but I like the look of (wallpapers, inspo, etc.), and digital art. Plus digital art I made myself or for a client.

Then with like sound effects, I have sound effects for videos but sometimes I do like to use those for music, too. And I have samples for music (drum loops, instrument loops, maybe samples I've made) but sometimes I like to use those for videos, too.

Not asking for direct answers to these questions, just overall trying to paint a picture of the frustrations of organizing data for multiple areas.

I think there's essentially 2 ways I could do this:

  1. Generalize everything to asset type. Keep music together, keep audio files together, keep image files together, just keep things together based on what type of "art" they are.
  2. Specify everything to specific areas. Have a clear video production area, music production area, graphic design area, and don't cross boundaries. Potentially also allow myself redundant data if I have sound effects for videos that also could work in musical contexts (booms, transitions, etc.)

Curious if anyone else deals with this and how you structure your files! Would love to see some file trees if possible.

15 Comments
2024/09/11
00:25 UTC

Back To Top