Recent Posts

Archives

Best Practices for Data Integrity and Security are Frequently Written in Blood

Gavin
Latest posts by Gavin (see all)

In my job I touch on a lot of technologies. That’s good, because otherwise I would probably get pretty bored! But every now and again I get a good question come up about an aspect of technology that I think bears publishing an article about.

The question in its unedited form;

Serious but probably stupid question. No one in my household has more than a halfway remedial knowledge regarding computers. (My partner was a self-proclaimed “technotard” before our activist teenage daughter gave him some much needed “sensitivity training” lol). All of these comments are talking about taking, what seems to me at least, extreme measures to back up their important files and photos.


For all my pics and important docs I scan or just upload to my Google drive or Amazon photos account. For pictures alone, is using a cloud-type storage not sufficient? I have let my Amazon account lapse before and was able to access it again months later with all the photos I ever uploaded still stored. Is there any threat that I could somehow lose them on either of these? Of course barring a complete breakdown of society.

It of course got me thinking about an appropriate response. While I won’t post the exact response, here’s the meat (slightly expanded) of what I posted;

Is Cloud Storage Best for Photos?

We can probably short-circuit this entire article with the simple answer; in the use case of the questioner, yes; cloud storage is probably just fine for her use case.

So why the long build up for such a terse little answer? Well, there’s a lot more to it than that, and of course I am going to delve into that as we proceed. There are some factors that you need to consider when deciding how to store your photos.

Integrity

If you’re not tech savvy and want to offload the responsibility of ensuring your data integrity to a third party, then cloud-based systems are just fine. The only caveat is that none of the providers make any guarantees as to the long term viability of your data… like will you be able to retrieve it in a decade? The answer is likely yes, but they don’t guarantee that. Not even slightly. They don’t even guarantee your data will be there tomorrow. But statistically speaking, the number of things that can go wrong that will render your data irretrievable is actually pretty small. But it does, and has happened to others.

Data integrity in “The Cloud” is “best effort”

Security

There’s also the privacy aspect of entrusting your data to a third party who promise to do their best to ensure your data is secure. Worked well for Scarlett Johansson… and Jennifer Lawrence (the iCloud hack nicknamed “The Fappening” if you want to read up on what I’m talking about).

Simply put; you are entrusting your data to a third party who actually have only a passing amount of responsibility to secure your data. Depending on your perceived sensitivity of the information at-hand that could be either a “Meh, I don’t care” or a “WTF, that’s insane!” sort of response. Cloud storage in general relies on the general idea that in a literal sea of information, your little puddle will basically go unnoticed; a bit of “Security by Obscurity”.

But that’s not always the case as can be seen with any number of hacks and cracks that have put all of our personal information out there in the last few years.

Data security in “The Cloud” is “best effort”

But is it best for photos?

With photos in particular you also have to note that only specialist providers actually give you the ability to store raw photos in any significant quantity. Almost all of them are compressed in some way which loses data. This doesn’t mean much when you’re talking about snapshots of your kids at the zoo (well, it might… but that’s another thing).

But to me who does studio type photography and portraits for professionals as a little bit of a sideline want to keep the RAW data around (and it can be huge) so I can clean up, edit and re-balance shots in ways you can’t do with JPEG or other compression schemes because they’ve lost so much of their original data to compression. This is acceptable to most people because JPEG or PNG at 10MP resolutions and up (pretty typical cellphone camera) looks good enough on most modern monitors… even 4K.

But here’s the rub; you don’t really have a 10MP picture; you have roughly a 4MP picture (an example, not literally) with artifacts and literal missing data. To get 10MP or above you need to be using RAW format… and don’t even get me started on lenses…

I have literally just gone into my Photoshop and created a quick albeit “worst case” scenario. I took a 12MP shot from a Canon camera, shot in RAW with a good quality lens on a good day. I then compressed it using a pretty typical loss factor for JPEG in order to see the differences at high zoom. This is the result I got;

Yes, it’s a picture of the front of my car…

As you can see, there’s a pretty significant loss of information on the right in the interest of creating a smaller image. Now granted this is at a high level of zoom and as such when zoomed out you probably wouldn’t notice the difference. But this is exactly the profile used by Instagram and Facebook… did you know your photos were compressed this much?

And it gets worse. Your compressed photos on your phone? When you upload them to Facebook or Instagram they’re almost certainly compressed again, which can make it even worse. There’s just not a good way realistically to recover the data that was lost.

So what do you do for your data?

Would I entrust all my data to a cloud provider? Oh hell no. But that’s not specifically because of any of these reasons I’ve already cited, though they are all factors. I don’t have to; I am a technical specialist and build large computer systems (including archival storage systems) for a living… so doing the same thing on a smaller scale in my own home is actually a bit of a hobby to me and I can leverage the tools and knowledge I have professionally for exactly this purpose. It’s fun to do this (to me) and I love being able to entrust my RAW photos to storage that I know is highly unlikely to lose or corrupt that data. Because I can.

I also back that data up in two places in my house and one outside (Amazon S3 Glacier storage… encrypted when I upload it). I intend to add another one as soon as my neighbor friend has gotten his Internet installed in his new house and we start backing up each others critical data to each others arrays. Do I need to do all this? No… but as I said, it’s also a hobby of mine.

Why so serious?

Years ago I lost a bunch of photos and documents to a hard drive failure. It took out a bunch of pictures of my kids when they were little, pictures from my first house, my first actually new car, my wedding day and so on. Documents weren’t quite as critical, but I know I lost a lot that day. As a result, I tend to lean toward storage that is data-centric and thus provides a modicum of protection from “bit-rot”.

For those that don’t know; Bit-Rot is when a random bit or series of bits in a piece of data like a file becomes corrupted. This could be due to a hard drive just aging, or could be due to literally a cosmic ray hitting the platter. It could also occur due to the shifting of the Earth’s magnetic fields (in theory, though in practice this happens far too slowly to be a likely source of issues). In essence, silent corruption happens, and WILL happen to all data stored on magnetic media or in transistors over time. This is also referred to as silent corruption because virtually no solutions have a method to detect this. They will just faithfully return the data that comes from the media itself whether it’s “right” or not.

There are filesystems that are designed to detect and even correct bit-rot, and I happen to use one of these called ZFS. How it does this is by using a bunch of your space for copies of data and/or checksums spread across multiple devices. Statistically the odds of silent corruption then are drastically reduced to near zero because ZFS can detect and even correct errors if your array is set up correctly. The cost? Well, in my main storage array I have 12x 4TB hard drives to store my data. Because of the overhead of checksums I actually only have about 23TB of usable space on that array instead of the 48TB you’d predict from the raw sizes. But to me the protection from bit-rot is invaluable.

But what about backups?

Backups can also protect from bit-rot simply because they are a separate copy of your data you store in a different location. However, this is “cold” (or possibly “warm”) storage and if you could detect bit-rot then you would still have to restore the data from a backup. And the problem is unless you detect it immediately, how do you know how far back you have to go in your backups to restore it? It can be a challenge. For that reason though, your backups need to be immutable; that is they never change. If a file is altered then instead of overwriting your backup then it must be versioned; a new copy created but maintaining the old copy as well. This is critical to a good backup methodology; just constantly creating new fresh backups and discarding the old ones opens you up to silent corruption.

Is there a solution that’ll work for me?

Bear in mind my problem of lost data was years ago, on a technology you probably aren’t using any more (spinning rust hard drives). I made a stupid decision not to have good backups… I learned that one the hard way but I’ve moved on from it thankfully. For most people, I would say the “Cloud” is perfect. It’ll suit your needs; statistically you probably won’t ever have a data loss problem and it just doesn’t require any real thought on your part.

HOWEVER, I will caveat that with the idea that you SHOULD have another backup. Yes, you have your photos probably on your phone and computer, and uploaded to the cloud. Great. Copy them to another cloud. Sign up for SmugMug or some such and upload your photos there as well… because you never know when that data may suddenly become inaccessible.

There are plenty of horror stories out there about people whose accounts got hacked, or locked, or deleted… rendering all their data inaccessible. Having multiple accounts with different passwords is vital in my opinion in case a service gets hacked. Use a password manager like LastPass (I use Dashlane personally) to ensure your cloud accounts are locked up and change your passwords often.

Don’t let stories like mine scare you. I just happen to have enough caution with technological solutions that I tend to overdo redundancy at times. But as I’m fond of saying to my customers; “Best practices for data integrity and security are frequently written in blood.” I have had experiences I’ve learned from… and I’ve got the knowledge and skills to build something that I’m comfortable entrusting my data to.

I have bled for this knowledge, as have others in my field. That does mean that today’s cloud providers are better than ever at ensuring the integrity of your data because they use many of the same tools I do.

The average home user doesn’t… and shouldn’t. We as technology experts are entrusted with making these complex things accessible to people like yourself… and honestly your cloud backup is probably backed by something not dissimilar to what I’ve built, just on a far larger scale.

Leave a Reply

Your email address will not be published. Required fields are marked *