Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I'm working on a utility to archive and organize old data that I want to keep forever, but I don't want cluttering up my local hard drives and Dropbox account. The initial goals of the project are:

* Design for cold data only. Stuff that is done changing and won't be accessed regularly if at all: completed projects, annual financial records, RAW image files organized by month, etc.

* Store items in flat collections, not in folder hierarchies. Store directories as compressed archives. I'm not a librarian and I find folder hierarchies difficult maintain.

* Store everything such that the data will still easily accessible even if the index is lost or the software stops working. Use well-known formats and human-readable file names.

* Automatically store data in multiple locations, including S3-compatible services, Amazon Glacier, file servers or locally. Allow each collection of items to have its own mix of storage locations.

* Organize within collections using tags and metadata.

* Provide a simple checkout system to download items when needed.

I have the core features working and I am now building the desktop application, which I intend to be cross-platform. However, I've never written a desktop application before, let alone a cross-platform one, so development has slowed while I learn and experiment.



I wrote a similar app for archivists to push materials into preservation repositories. I used Electron, since it has good cross-platform support. The source is at https://github.com/aptrust/dart with documentation at https://aptrust.github.io/dart-docs/users/getting_started/

The underlying JavaScript code in that app started getting messy because I was working on several other projects simultaneously, but you might find it useful to play with as you consider your desktop app.

The archival community uses a simple text-based packaging format from the Library of Congress called BagIt, which allows you to include metadata and checksums with your archived materials so you can ensure their integrity and make sense of them when you get them back.

Anyway, you're working on an interesting problem. I'd be interested to see how it goes.


Thanks! Your project looks like great inspiration and BagIt might be very useful too.


For anyone interested in these features today, check out Git Annex. Fits all these requirements except for tagging. Add files just like with git, then git annex copy file --to some-remote. Intended for large files, you can Zip directories too if you like. I personally like directory organization but that's optional.


FWIW I've often thought of building a cold storage cloud for this type of stuff. Basically the same functionality of [api, web gui, etc] that everyone has except, files need to be requested and may take some time to become hot/available to the user. It's really just because I think it's silly that the only reason I pay $100+/year to my provider is because I have some archived videos/photos that put me over their free limit. I never touch those files but don't want get rid of them either (I realize I could store myself but them I'm the one responsible if they get lost :))


Have you got a site/email list/github/twitter I can follow for a release announcement?


Not yet, but you can email me at the address in my profile and I'll let you know when something is available.


What's a flat collection hierarchy?


I just mean that a collection has no subfolders or other structure. It is simply a list of items like an S3 bucket.


How do you find an item then? I've read numerous research studies that prove people still prefer navigation over search. Ofer Bergman has done a lot of work.


The thought is that collections should be homogeneous so that for most use cases,

* The number of items would be so small that search would not be necessary, e.g. a collection personal projects

* The items would fall naturally into a timeline so you can search trivially by scrolling, e.g. RAW photos grouped together by month

* The items would be easily identified by name, e.g. MP3 files grouped by album (why am I still holding onto these?)

The intention is not to upload 1000s of individual files in a jumble, but instead, a much smaller number of archives. E.g. If you are archiving the previous semester's homework assignments, instead of uploading a bunch of random documents, each item would be an archive of the assignments from a particular class. You could tag each item with 'Fall 2020' if you want to improve the organization. I'm intending to make that an easy process, where you point the program at a directory and it packages, tags and uploads each subfolder.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: