Making Obnam faster

Lars Wirzenius (liw@liw.fi)

2015-11 Debian miniconf, Cambridge

Abstract

Obnam is the backup program I have been developing for about ten years. It seems to have mostly the right feature set, and is easy to use. Unfortunately, it is only barely faster than breeding a genetically engineered leopard whose spots encode the data that gets backed up. This talk covers my efforts of speeding up Obnam so it is usable for people with real amounts of data and finite patience.

Obnam is a backup program

store backups on disk (not tape)
every generation is a full snapshot (not a delta from a previous backup)
data is de-duplicated in backups

Glossary

live data: the data you actually use
repository: where the backup is stored
repository format: on-disk data structures for backups
generation: one copy of live data
chunk: a piece of data from a live data file
metadata: any data about a live data file that is not its contents (filename, stat(2), extended attributes, ...)

Obnam is not fast: Real data (1)

Home file server to home backup server.

4.6 TiB, 560 thousand files
sftp over gigabit to another local machine
initial backup: 2 weeks, or about 4 MiB/s
and 0.5 files/s
incremental: at least a two hours with no changes (about 80 files/s)

This is not impressive.

Helllooooo, leopard spots.

Obnam is not fast: Real data (2)

Laptop to remote backup server.

350 GiB, 170 thousand files
sftp over 50/5 Mbit/s cable modem to remote server, 50 ms ping
initial backup: measurement not recorded
incremental: at least three hours with no changes (20 files/s)

This backup is encrypted. The image is of an encrypted leopard.

Obnam is not fast: Real data (3)

Laptop to home backup server.

Same data as before.
SFTP over wifi to local server, ping times a few ms
Initial backup: overnight
Incremental, no changes: 5 mins (600 files/s)

This is acceptable, not just tolerable. No leopard for you.

Obnam is not fast: benchmarks

http://benchmark.obnam.org/e2obbench-6-v1/html-6/
many files: 1 million 1-byte files, to local disk:
- initial backup: 6195 seconds (103 minutes), or about 161 files/s
- incremental (no changes): 1015 seconds (17 minutes, or 985 files/s)
one big file: 1 file with 10 GB, to local disk:
- initial backup: 3409 seconds (57 minutes), or about 3 MiB/s

High level overview of how Obnam works

For each new or changed live data file:

read a chunk of data
is the chunk already in the repository? if not, upload it, else reuse the existing chunk
store file's metadata and list of chunks

Backup repository structure

The chunks
- just files with random names (NOT content hash), in a directory tree
- no locks required
Chunk indexes
- find chunks that whose data contains a given checksum
- a Larch B-tree
Per-client data
- generations for that client
- filenames, other metadata for each generation
- a Larch B-tree

This is simplified, but not incorrect.

Why is Obnam slow?

a lot of round trip times to the repository
- every non-empty file is 1+ chunks, each chunks is a round trip
Larch copy-on-write B-trees are fun, but slow
- might be possible to optimise, but never going to be quite fast enough
- no local caching, either
- also not used very wisely
too much data in memory

Strategy

measure, change, measure again, analyse, merge/discard
- Python profiler, custom instrumentation
- systematic benchmarks: http://benchmark.obnam.org/
simplify, have fewer (but more sensible) abstraction layers, making it easier to reason about the program
design a new on-disk data structure: FORMAT GREEN ALBATROSS
- don't break compatibility with existing backups

Strategy, concretely

avoid unnecessary work
- calling fast code is slower than not calling slow code
avoid unnecessary round trips
- ping time to my off-site backups: about 50 ms
- that means 20 SFTP round trips per second!
avoid rewriting existing files in the repository
- will enable locally caching repository objects, avoiding round trips for cached objects
faster lookup structures
- Git-like tree objects (metadata about tree and its files, plus refs to its subdirectories)

Useful changes

DIR objects that contain all the metadata for files in the directory, plus references to DIR objects for subdirectories
- this is a simpler, easier data structure, that seems already to be faster than Larch
combine DIRs and chunks into bags
- a no-brainer
most of the rest is just avoiding doing unnecessary stuff
- this is not always easy

Not so useful

not measuring before making changes
putting all GA bags in the same directory: filesystems can't cope well
- find MY/DIR -delete
setting in-memory cache sizes to 0
not measuring after making changes
slow, stupid algorithms/data structures in memory
NOT MEASURING

Current results (2015-11-01)

Many files benchmark (1 million files, 1 byte each):

what	initial	no-op
FORMAT 6	6915	1015
FORMAT GREEN ALBATROSS	842	355

One big file (10 GB random data):

what	initial
FORMAT 6	3409
FORMAT GREEN ALBATROSS	1419

All values are times in seconds.

Things I want to try next: benchmarks

Bigger benchmarks: more files, more data in the one big file.
More benchmarks:
- all Obnam operations: FUSE mount, forget, verify, fsck
- use sftp (even if physically on the same machine)
- use encryption
Different measurements:
- memory use
- bandwidth use
- repository (sftp) round trips
Try FORMAT GREEN ALBATROSS with our real data at home.

Things I want to try next: changes

Find and fix bottlenecks.
Split chunk indexes into smaller objects.
Local caching of some repository objects (DIR objects).

Someday:

More concurrency inside Obnam.
Try for more parallelism over ssh connection: multiple SFTP connections multiplexed over the same ssh connection.
Server mode Obnam: SFTP is not very smart.
Better chunking approaches
- more de-duplication means less data to upload?
- but many more chunks?

How you can help

Experiment with Obnam FORMAT GREEN ALBATROSS under profiling with real data.
- analyse and find bottlenecks
- send patches that improve things (measured, not guessed)
Improve the obbench program.
- more measurements, more benchmarks
- better reporting of results

All help most welcome. Think of the genetically modified leopard kittens!

Thank you

Bytemark: donated capacity in the BigV cloud for running benchmarks
FUUG Foundation: gave a grant for buying a development/benchmark machine I can run at home

Images:

https://en.wikipedia.org/wiki/File:African_Leopard_5.JPG by user Danh on Wikipedia.
https://en.wikipedia.org/wiki/File:Blackleopard.JPG by user Qilinmon on Wikipedia.
https://commons.wikimedia.org/wiki/File:Zambia_leopard.jpg by FrontierEnviro