1 files changed, 365 insertions, 0 deletions
diff --git a/doc/details.mdwn b/doc/details.mdwn
new file mode 100644
index 0000000..e0f85e5
--- /dev/null
+++ b/doc/details.mdwn
@@ -0,0 +1,365 @@
+[[!toc]]
+
+## Storing a key
+
+* Input password and name from user. The name is a combination of their own
+  name and a more obscure name (such as the name of their high-school
+  sweetheart).
+* Get the keyid of the key. (This can be any a public value
+  unique to the private key, eg a gpg keyid. It's only used to allow
+  storing multiple keys under a given name. If a gpg public key is not on the
+  keyservers, or the key is not a gpg key, can use "" for the keyid.)
+* Generate N by argon2(name, salt=keyid), tuned to take 10 minutes.
+* Generate N1-N3 by sha256 of N+1,2,3
+* Generate decryption puzzle P, a byte chosen at random.
+* Generate K by argon2(password, salt=name+P), tuned to take 0.195 minutes.
+* AES encrypt (data + checksum) with K as the key.
+* Shamir the encrypted key with N=2, M=3, yeilding S1-S3.
+* Servers reject attempts to store an object under a name that is
+  already in use.
+* Servers do not allow enumerating all objects stored,
+  and require a proof of work to handle any request.
+* Upload S1-S3 to separate servers under the names N1-N3.
+  If any of the uploads is rejected as name already in use, 
+  ask user to enter a different name or password.
+
+So, storing a key takes 10 minutes.
+
+## Recovering a key
+
+* Input password and name from user.
+* Calculate N and N1-N2
+* Request N1-N3 from servers until two objects are available.
+* Shamir recombine the objects.
+* Guess a value for P.
+* Generate K by argon2(password, salt=name+P)
+* AES decrypt
+* Repeat with new P until checksum verifies.
+
+This takes 10 minutes to calculate N, plus on average 128 guesses of P.
+Total recovery time varies from 10 minutes to 60 minutes, with an
+average of 35 minutes.
+
+## Difficulty of brute forcing a single encrypted key
+
+* Assume we know the name and keyid, or have otherwise found a way to
+  determine the shards of a key. Download and recombine the shards.
+* Guess a password.
+* For each possible value of P, AES decrypt with
+  K = argon2(password, salt=name+P), and check if checksum verifies.  
+  This takes 0.195 minutes * 256 = 50 minutes total.
+* Repeat for next password.
+
+So, for a password with N entropy, the number of CPU-years of work
+is to crack it is: `2^(N-1)*50/60/24/365`
+
+* Strong password (50 entropy): 53553077761 CPU-years
+* Weak password (30 entropy): 51072 CPU-years
+* Super-weak password (19 entropy): 25 CPU-years
+
+So, if an attacker is able to find the right shards for a secret key, it's
+feasible for them to crack super-weak and weak passwords, assuming the
+secret key is worth the cost of doing do. Stronger passwords quickly
+become infeasible to crack.
+
+## Attack methods
+
+An attacker who wants to target a particular person can guess the name they
+used, derive N1-N3, download two of S1-S3 and start brute forcing the
+password soon after the object is stored. This is the most likely attack
+method, so any user who could potentially be targeted like this should
+choose a strong password. A name that attackers are unlikely to guess
+prevents this attack, which is why keysafe prompts for not only the
+user's name, but also a more obscure name. Each name guess that the
+attacker makes takes 10 minutes of CPU time to generate N, as well
+as whatever proof of work the servers require.
+
+The sharding prevents a single malicious server from blindly 
+guessing weak passwords across its entire collection of objects.
+It takes two servers colluding to try to recombine their shards.
+
+If recombining two shards yielded data which could be checked to see if
+it's valid, then it would become fairly inexpensive to try all combinations
+of shards, and obtain all the encrypted keys for further cracking. So it's
+important that there not be any checksum or header in addition to the AES
+encrypted data. (AES encrypted data cannot be distinguised from random
+garbage except by block size.) Get that right, and with N keysafe users, an
+attacker would need to try `2^(N-1)` combinations of shards to find one on
+average, and would need to brute force the password of each combination.
+With only 20 keysafe users, assuming all users have super-weak passwords,
+this attack takes 13107200 years of CPU work `(2^19*25)` to crack one. With
+50 users, this jumps to quadrillions of years of CPU work.
+
+Colluding servers can try to correlate related objects based on access
+patterns and recombine pairs of those, and then brute force the password.
+The correlation needs to be very fine-grained for this to work.
+If 15 users' objects are all bucketed together by the correlation,
+then the attacker has 16384 combinations to try on average before
+finding a correct combination. Multiply by 5 years CPU work for cracking
+only super-weak passwords.
+
+Colluding servers who want to target a particular person
+can guess their N1-N3, check if those objects exist on the server, and
+begin brute-forcing the password. This is not much cheaper than the same
+attack performed by a third-party attacker, except that the attacker
+doesn't need to wait for an object to download from the server to check if
+it exists.
+
+A state-level entity may try to subpoena the entire contents of keysafe
+servers. Once 2 servers are compromised, the state-level entity can try the
+same attacks that colluding servers can use (but probably cannot use
+attacks involving correlation because the server operators should not be
+retaining the data needed for correlation). Of course, they probably have
+many more resources to throw at the problem. But with only 50 keysafe
+users, recombining shards and trying super-weak passwords would be
+prohibitively expensive as detailed above. So, a state-level entity
+will probably only find it feasible to target particular people.
+Since such an attack can be performed by anyone as discussed above,
+there seems to actually be no incentive for a state-level to subpoena data.
+
+A state-level entity's best bet at getting lots of keys is probably to use
+their resources to compromise keysafe servers, and modify them to log data.
+Then a correlation attack can be done as discussed above.
+
+A different kind of attack: Legal/extralegal action to get a 
+particular person's key removed from storage servers to try to
+deny them access to it. Or, to entirely take down storage servers.
+
+### Malicious data attack
+
+Two servers could collude to serve up malicious data to try to exploit the
+user's system.
+
+For example, if the user is using their gpg key to encrypt emails,
+and they restore a different gpg key, they might use it to encrypt with and
+then what they said could be decrypted by the attacker.
+
+To perform this attack, the attacker first has to manage to crack the user's
+password. Then they can replace the objects with malicious versions, encrypted
+with the same password.
+
+So, this is not too useful for gpg key replacement, since the attacker
+must already know the secret key. However, perhaps they could exploit bugs
+in gpg to compromise the user's system.
+
+## Server list
+
+There's a server list shipped with the client, giving their tor onion address
+and the organization providing the server. 
+
+Three of the servers in the list are recommended servers.
+Shards are stored on these unless overridden by other configuration.
+
+When recovering a key, the client tries the recommended servers first. But,
+if it fails to recover the key using those, it goes on to try other servers
+on the list. This way we don't need to remember which servers the shards
+were stored on, and can change the recommended servers when necessary.
+
+See [[servers]] for more on the server list.
+
+## Servers
+
+Servers run exclusively as tor hidden services. This prevents them from
+finding out the IP addresses of clients, as well as providing transport
+level encryption.
+
+Servers should avoid keeping any logs, and should santize
+the timestamps of any files stored on them. (Eg set back to epoch
+and store on filesystem mounted with noatime.)
+
+Only small objects are accepted to be stored. This is to prevent this from
+being used as a general purpose data storage system, and only be useful
+for storing keys.
+
+However, `gpg --export-secret-key` can be hundreds of KB in size
+when a key has a lot of signatures etc. While this size can be cut
+down to 4 KB using `paperkey` to only include the secret key, it's
+desirable to store the public and secret key together. This way,
+a user does not have to publish the key to keyservers, which makes some
+attack methods much less likely to try to crack their key.
+
+So, the object size should be at least a few tens of KB. 64kb seems
+reasonable. If keysafe needs to store a larger key, it can chunk it up.
+
+Objects shoud be padded to a fixed size before they are encrypted, to
+prevent attackers from correlating and targeting particular objects based
+on size.
+
+## client-server Proof of Work
+
+The Proof of Work prevents servers being flooded with requests.
+This is particularly important to prevent misuse of keysafe servers
+to store large data on them. It's also somewhat useful to prevent attackers
+guessing the name someone used to store a key; but the cost of generating
+N from a name makes the server's proof of work only a secondary line of
+defense against such an attack. Finally, PoW is useful to protect against
+DOS attacks.
+
+Assuming that the client communicates with the server over http:
+
+	PUT /keysafe/objects/N
+	GET /keysafe/objects/N
+
+The server's response can be either the result of the request,
+or a proof of work requirement, which specifies the difficulty D 
+(number of 0's needed), random salt RS, and the number of argon2
+iterations. The client provides the proof of work in a query parameter,
+which is a string S such that argon2(N,S+RS) starts with a given number
+of 0's.
+
+(The server should only be able to control the number of iterations,
+not other argon2 parameters, to avoid forcing the client to use too much
+memory. Normally, the server will keep iterations small, probably 1,
+since it does need to calculate the argon2 hash once itself.)
+
+The server can use a [token bucket](https://en.wikipedia.org/wiki/Token_bucket)
+to throttle requests to a given rate. In fact, there can be a whole
+series of token buckets B0,B1.., for increasing difficulty proofs of work.
+
+A request without a proof of work is checked in B0. If that bucket is empty,
+the server responds with a proof of work requirement D=1, and
+the client makes a new request whose proof of work allows access to B1.
+If the server is under load, B1 might also be empty, and so the client 
+will have to try again with D=2 to access B2, and so on.
+
+If there are 4 buckets, and each bucket refills at the rate of a
+token every minute, then the maximum allowed throughput is 4 requests
+per minute. If calculating the proof of work takes 2^D seconds on average, 
+then it will take on average 16 minutes work to access B4.
+
+The server can generate a different RS for each request, and can
+insert them into a bloom filter to keep track of ones it has given out.
+Bloom filter false positives are not a problem because they are quite
+rare and so it is not efficient for a client to make up RS in hope that
+there will be a false positive.
+
+To guard against DOS attacks that reuse proofs of work, the server can
+maintain a second bloom filter, putting RS into it once it's used, and
+rejecting attempts that reuse a RS. Since the bloom filter will
+(with a low probability) yield a false positive, the server should reject
+an attempt by generating a new RS' and requesting a new proof of work from
+the client.
+
+## Avoiding correlations
+
+As noted above, the more objects that are stored, the more secure
+keysafe becomes, because it becomes harder to correlate objects
+that belong to a user.
+
+Ways the server could draw correlations include:
+
+* Time that the data was stored.  
+  (Avoid by waiting some period of time between uploading
+  various objects, probably on the order of days. If every keysafe uploads
+  its objects at midnight GMT, the server will find it hard to correlate
+  them.)
+* IP address used to store the data.  
+  (Using tor avoids this.)
+* When the data was retrieved.  
+  (Avoid by waiting some period of time between retrieving shards.
+  Although, this makes key recovery take longer, which could be
+  frustrating..)
+* A user ID associated with the data.  
+  (Avoid by allowing anyone to store data and don't associate data with
+  any user ID.)
+* Same sized objects may be related shares.  
+  (Avoid by making all objects stored be a standard size.)
+
+## Detecting corrupt data
+
+	ori> if a single server is compromised, it can return bogus data when you request its fragment of the shared secret
+	ori> if you only have three servers, you can't know which two to trust, short of just trying
+	ori> you end up with three possible ways to reconstruct the secrets, A-B, A-C, B-C. only one is legit.
+
+This could also happen due to error not compromise. Or even due to
+asking the wrong server for an object and getting back a dummy response.
+
+To guard against this, include a sha256sum of the secret key in the
+data that is sharded. This way the validity of the restored key can be
+verified.
+
+Note that this may make it marginally easier for brute force attacks, since
+they can use that checksum to detect when the attack is successful. Only
+marginally easier because it's not hard to fingerprint a gpg secret key.
+Even the minimised raw key output by paperkey can be recognised by a parser.
+
+## Versioning
+
+The algorithms and parameters chosen by keysafe could turn out to be
+wrong, or need adjusting to keep up with technology. While we'd like to
+avoid this by getting the design right, it's important to have a plan in
+case it becomes necessary.
+
+The simplest approach would be to include a version number with each shard.
+But, this causes a problem: When switching to a new version, the early
+atopters of that version are a small group, and so it becomes easier to
+correlate their shards.
+
+The version number could instead be included in the data that is sharded.
+This avoids the above problem, but leads to an even worse problem:
+An attacker can look for the version number after combining some random
+shards, and if they see it, they know they picked shards that are actually
+related. As described in "Attack methods", if an attacker can do that,
+it becomes easy for two colliding servers to find the right combinations of
+all shards.
+
+A safe alternative to embedding a version number anywhere in the data is
+to adjust the parameters of the argon2 hash used to generate the shard
+names. Maintain a list of versions and their shard name argon2
+parameters (as well as other parameters etc). During key recovery,
+keysafe can try different versions in turn, use argon2 to generate the
+shard names, and see if those shards can be downloaded. Once it finds
+shards, it knows the version that created them, and the other parameters
+of how the data is encoded.
+
+The only downside to this method is that each new version added to the list
+makes recovering data from all older versions take 10 minutes longer, as
+the argon2 hash has to be run an additional time.
+
+Note that the argon2 hash parameters to vary between versions should not be
+merely the number of rounds, as that would allow an attacker to hash for
+each version's number of rounds in one pass. Instead, vary the hash
+memory parameter.
+
+## Ideas
+
+Some ideas for improvements to keysafe.
+
+### Assisted Password-based Key Derivation
+
+An idea from [Nigori](http://www.links.org/files/nigori/nigori-protocol-01.html#anchor15):
+
+If the user has an account at some server, the server can contribute part
+of the data used to generate the key encryption key K. So not only is the
+user's keysafe password needed for key recovery, but the user has to
+authenticate to the server. As well as adding entropy to K, the server 
+can do rate limiting to prevent password guessing.
+
+Risks include:
+
+* The server becomes a target to be compromised.
+* If the server goes down, the user loses the ability to recover their gpg
+  key from keysafe.
+
+### Entangled objects
+
+An idea from Anthony Towns:
+
+Two or more objects stored on a keysafe server can be entangled.
+When one of them is requested, the server should delete the others.
+
+This could be used in several ways:
+
+* Keysafe could upload random objects entangled with a key's object,
+  and keep local records of the names of the random objects. Then if the
+  user wants to stop storing a key on keysafe servers, keysafe can request
+  the entangled objects to (hopefully) delete the real objects from the
+  server.
+* Keysafe could pick or prompt for some tripwire names that an attacker
+  might try if they were looking for a user's data. Then an attacker
+  risks deleting the data stored on the server. Although this also has 
+  DOS potential.
+* The user could pick a real name and fake name and keysafe uploads
+  objects for both. If the user is being forced to give up their keysafe
+  name and password, they could provide the fake name, and if it were
+  used, their data would get deleted from the keysafe servers.