doc/details.mdwn


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399

[[!toc]]

## Storing a key

* Input password and name from user. The name is a combination of their own
  name and a more obscure name (such as the name of their high-school
  sweetheart).
* Get the keyid of the key. (This can be any a public value
  unique to the private key, eg a gpg keyid. It's only used to allow
  storing multiple keys under a given name. If a gpg public key is not on the
  keyservers, or the key is not a gpg key, can use "" for the keyid.)
* Generate N by argon2(name, salt=keyid), tuned to take 10 minutes.
* Generate N1-N3 by sha256 of N+1,2,3
* Generate decryption puzzle P, a byte chosen at random.
* Generate K by argon2(password, salt=name+P), tuned to take 0.195 minutes.
* AES encrypt (data + checksum) with K as the key.
* Shamir the encrypted key with N=2, M=3, yeilding S1-S3.
* Servers reject attempts to store an object under a name that is
  already in use.
* Servers do not allow enumerating all objects stored,
  and require a proof of work to handle any request.
* Upload S1-S3 to separate servers under the names N1-N3.
  If any of the uploads is rejected as name already in use, 
  ask user to enter a different name or password.

So, storing a key takes 10 minutes.

## Recovering a key

* Input password and name from user.
* Calculate N and N1-N2
* Request N1-N3 from servers until two objects are available.
* Shamir recombine the objects.
* Guess a value for P.
* Generate K by argon2(password, salt=name+P)
* AES decrypt
* Repeat with new P until checksum verifies.

This takes 10 minutes to calculate N, plus on average 128 guesses of P.
Total recovery time varies from 10 minutes to 60 minutes, with an
average of 35 minutes.

## Difficulty of brute forcing a single encrypted key

* Assume we know the name and keyid, or have otherwise found a way to
  determine the shards of a key. Download and recombine the shards.
* Guess a password.
* For each possible value of P, AES decrypt with
  K = argon2(password, salt=name+P), and check if checksum verifies.  
  This takes 0.195 minutes * 256 = 50 minutes total.
* Repeat for next password.

So, for a password with N entropy, the number of CPU-years of work
is to crack it is: `2^(N-1)*50/60/24/365`

* Strong password (50 entropy): 53553077761 CPU-years
* Weak password (30 entropy): 51072 CPU-years
* Super-weak password (19 entropy): 25 CPU-years

So, if an attacker is able to find the right shards for a secret key, it's
feasible for them to crack super-weak and weak passwords, assuming the
secret key is worth the cost of doing do. Stronger passwords quickly
become infeasible to crack.

## Attack methods

An attacker who wants to target a particular person can guess the name they
used, derive N1-N3, download two of S1-S3 and start brute forcing the
password soon after the object is stored. This is the most likely attack
method, so any user who could potentially be targeted like this should
choose a strong password. A name that attackers are unlikely to guess
prevents this attack, which is why keysafe prompts for not only the
user's name, but also a more obscure name. Each name guess that the
attacker makes takes 10 minutes of CPU time to generate N, as well
as whatever proof of work the servers require.

The sharding prevents a single malicious server from blindly 
guessing weak passwords across its entire collection of objects.
It takes two servers colluding to try to recombine their shards.

If recombining two shards yielded data which could be checked to see if
it's valid, then it would become fairly inexpensive to try all combinations
of shards, and obtain all the encrypted keys for further cracking. So it's
important that there not be any checksum or header in addition to the AES
encrypted data. (AES encrypted data cannot be distinguised from random
garbage except by block size.) Get that right, and with N keysafe users, an
attacker would need to try `2^(N-1)` combinations of shards to find one on
average, and would need to brute force the password of each combination.
With only 20 keysafe users, assuming all users have super-weak passwords,
this attack takes 13107200 years of CPU work `(2^19*25)` to crack one. With
50 users, this jumps to quadrillions of years of CPU work.

Colluding servers can try to correlate related objects based on access
patterns and recombine pairs of those, and then brute force the password.
The correlation needs to be very fine-grained for this to work.
If 15 users' objects are all bucketed together by the correlation,
then the attacker has 16384 combinations to try on average before
finding a correct combination. Multiply by 5 years CPU work for cracking
only super-weak passwords.

Colluding servers who want to target a particular person
can guess their N1-N3, check if those objects exist on the server, and
begin brute-forcing the password. This is not much cheaper than the same
attack performed by a third-party attacker, except that the attacker
doesn't need to wait for an object to download from the server to check if
it exists.

A state-level entity may try to subpoena the entire contents of keysafe
servers. Once 2 servers are compromised, the state-level entity can try the
same attacks that colluding servers can use (but probably cannot use
attacks involving correlation because the server operators should not be
retaining the data needed for correlation). Of course, they probably have
many more resources to throw at the problem. But with only 50 keysafe
users, recombining shards and trying super-weak passwords would be
prohibitively expensive as detailed above. So, a state-level entity
will probably only find it feasible to target particular people.
Since such an attack can be performed by anyone as discussed above,
there seems to actually be no incentive for a state-level to subpoena data.

A state-level entity's best bet at getting lots of keys is probably to use
their resources to compromise keysafe servers, and modify them to log data.
Then a correlation attack can be done as discussed above.

A different kind of attack: Legal/extralegal action to get a 
particular person's key removed from storage servers to try to
deny them access to it. Or, to entirely take down storage servers.

### Malicious data attack

Two servers could collude to serve up malicious data to try to exploit the
user's system.

For example, if the user is using their gpg key to encrypt emails,
and they restore a different gpg key, they might use it to encrypt with and
then what they said could be decrypted by the attacker.

To perform this attack, the attacker first has to manage to crack the user's
password. Then they can replace the objects with malicious versions, encrypted
with the same password.

So, this is not too useful for gpg key replacement, since the attacker
must already know the secret key. However, perhaps they could exploit bugs
in gpg to compromise the user's system.

## Server list

There's a server list shipped with the client, giving their tor onion address
and the organization providing the server. 

Three of the servers in the list are recommended servers.
Shards are stored on these unless overridden by other configuration.

When recovering a key, the client tries the recommended servers first. But,
if it fails to recover the key using those, it goes on to try other servers
on the list. This way we don't need to remember which servers the shards
were stored on, and can change the recommended servers when necessary.

See [[servers]] for more on the server list.

## Servers

Servers run exclusively as tor hidden services. This prevents them from
finding out the IP addresses of clients, as well as providing transport
level encryption.

Servers should avoid keeping any logs, and should santize
the timestamps of any files stored on them. (Eg set back to epoch
and store on filesystem mounted with noatime.)

Only small objects are accepted to be stored. This is to prevent this from
being used as a general purpose data storage system, and only be useful
for storing keys.

However, `gpg --export-secret-key` can be hundreds of KB in size
when a key has a lot of signatures etc. While this size can be cut
down to 4 KB using `paperkey` to only include the secret key, it's
desirable to store the public and secret key together. This way,
a user does not have to publish the key to keyservers, which makes some
attack methods much less likely to try to crack their key.

So, the object size should be at least a few tens of KB. 64kb seems
reasonable. If keysafe needs to store a larger key, it can chunk it up.

Objects shoud be padded to a fixed size before they are encrypted, to
prevent attackers from correlating and targeting particular objects based
on size.

## client-server Proof of Work

The Proof of Work prevents servers being flooded with requests.
This is particularly important to prevent misuse of keysafe servers
to store large data on them. It's also somewhat useful to prevent attackers
guessing the name someone used to store a key; but the cost of generating
N from a name makes the server's proof of work only a secondary line of
defense against such an attack. Finally, PoW is useful to protect against
DOS attacks.

Assuming that the client communicates with the server over http:

	PUT /keysafe/objects/N
	GET /keysafe/objects/N

The server's response can be either the result of the request,
or a proof of work requirement, which specifies the difficulty D 
(number of 0's needed), random salt RS, and the number of argon2
iterations. The client provides the proof of work in a query parameter,
which is a string S such that argon2(N,S+RS) starts with a given number
of 0's.

(The server should only be able to control the number of iterations,
not other argon2 parameters, to avoid forcing the client to use too much
memory. Normally, the server will keep iterations small, probably 1,
since it does need to calculate the argon2 hash once itself.)

The server can use a [token bucket](https://en.wikipedia.org/wiki/Token_bucket)
to throttle requests to a given rate. In fact, there can be a whole
series of token buckets B0,B1.., for increasing difficulty proofs of work.

A request without a proof of work is checked in B0. If that bucket is empty,
the server responds with a proof of work requirement D=1, and
the client makes a new request whose proof of work allows access to B1.
If the server is under load, B1 might also be empty, and so the client 
will have to try again with D=2 to access B2, and so on.

If there are 4 buckets, and each bucket refills at the rate of a
token every minute, then the maximum allowed throughput is 4 requests
per minute. If calculating the proof of work takes 2^D seconds on average, 
then it will take on average 16 minutes work to access B4.

The server can generate a different RS for each request, and can
insert them into a bloom filter to keep track of ones it has given out.
Bloom filter false positives are not a problem because they are quite
rare and so it is not efficient for a client to make up RS in hope that
there will be a false positive.

To guard against DOS attacks that reuse proofs of work, the server can
maintain a second bloom filter, putting RS into it once it's used, and
rejecting attempts that reuse a RS. Since the bloom filter will
(with a low probability) yield a false positive, the server should reject
an attempt by generating a new RS' and requesting a new proof of work from
the client.

## Avoiding correlations

As noted above, the more objects that are stored, the more secure
keysafe becomes, because it becomes harder to correlate objects
that belong to a user.

Ways the server could draw correlations include:

* Time that the data was stored.  
  (Avoid by waiting some period of time between uploading
  various objects, probably on the order of days. If every keysafe uploads
  its objects at midnight GMT, the server will find it hard to correlate
  them.)
* IP address used to store the data.  
  (Using tor avoids this.)
* When the data was retrieved.  
  (Avoid by waiting some period of time between retrieving shards.
  Although, this makes key recovery take longer, which could be
  frustrating..)
* A user ID associated with the data.  
  (Avoid by allowing anyone to store data and don't associate data with
  any user ID.)
* Same sized objects may be related shares.  
  (Avoid by making all objects stored be a standard size.)

## Detecting corrupt data

	ori> if a single server is compromised, it can return bogus data when you request its fragment of the shared secret
	ori> if you only have three servers, you can't know which two to trust, short of just trying
	ori> you end up with three possible ways to reconstruct the secrets, A-B, A-C, B-C. only one is legit.

This could also happen due to error not compromise. Or even due to
asking the wrong server for an object and getting back a dummy response.

To guard against this, include a sha256sum of the secret key in the
data that is sharded. This way the validity of the restored key can be
verified.

Note that this may make it marginally easier for brute force attacks, since
they can use that checksum to detect when the attack is successful. Only
marginally easier because it's not hard to fingerprint a gpg secret key.
Even the minimised raw key output by paperkey can be recognised by a parser.

## Versioning

The algorithms and parameters chosen by keysafe could turn out to be
wrong, or need adjusting to keep up with technology. While we'd like to
avoid this by getting the design right, it's important to have a plan in
case it becomes necessary.

The simplest approach would be to include a version number with each shard.
But, this causes a problem: When switching to a new version, the early
atopters of that version are a small group, and so it becomes easier to
correlate their shards.

The version number could instead be included in the data that is sharded.
This avoids the above problem, but leads to an even worse problem:
An attacker can look for the version number after combining some random
shards, and if they see it, they know they picked shards that are actually
related. As described in "Attack methods", if an attacker can do that,
it becomes easy for two colliding servers to find the right combinations of
all shards.

A safe alternative to embedding a version number anywhere in the data is
to adjust the parameters of the argon2 hash used to generate the shard
names. Maintain a list of versions and their shard name argon2
parameters (as well as other parameters etc). During key recovery,
keysafe can try different versions in turn, use argon2 to generate the
shard names, and see if those shards can be downloaded. Once it finds
shards, it knows the version that created them, and the other parameters
of how the data is encoded.

The only downside to this method is that each new version added to the list
makes recovering data from all older versions take 10 minutes longer, as
the argon2 hash has to be run an additional time.

Note that the argon2 hash parameters to vary between versions should not be
merely the number of rounds, as that would allow an attacker to hash for
each version's number of rounds in one pass. Instead, vary the hash
memory parameter.

## Ideas

Some ideas for improvements to keysafe.

### Assisted Password-based Key Derivation

An idea from [Nigori](http://www.links.org/files/nigori/nigori-protocol-01.html#anchor15):

If the user has an account at some server, the server can contribute part
of the data used to generate the key encryption key K. So not only is the
user's keysafe password needed for key recovery, but the user has to
authenticate to the server. As well as adding entropy to K, the server 
can do rate limiting to prevent password guessing.

Risks include:

* The server becomes a target to be compromised.
* If the server goes down, the user loses the ability to recover their gpg
  key from keysafe.

### Entangled objects

An idea from Anthony Towns:

Two or more objects stored on a keysafe server can be entangled.
When one of them is requested, the server should delete the others.

This could be used in several ways:

* Keysafe could upload random objects entangled with a key's object,
  and keep local records of the names of the random objects. Then if the
  user wants to stop storing a key on keysafe servers, keysafe can request
  the entangled objects to (hopefully) delete the real objects from the
  server.
* Keysafe could pick or prompt for some tripwire names that an attacker
  might try if they were looking for a user's data. Then an attacker
  risks deleting the data stored on the server. Although this also has 
  DOS potential.
* The user could pick a real name and fake name and keysafe uploads
  objects for both. If the user is being forced to give up their keysafe
  name and password, they could provide the fake name, and if it were
  used, their data would get deleted from the keysafe servers.

### Better object-id derivation

An idea from Ben M:

> I was the fellow who mentioned using an HMAC instead of
> append-index-and-hash to generate the object-ids in keysafe.
> 
> That's probably an okay approach if you need to bind the output to a
> particular input string, but on reflection (unless I missed something)
> it would be equivalent for keysafe to take a stream and chop it up, then
> just "number" the chunks sequentially.
> 
> In that case, the "most correct" choice would probably be HKDF (RFC5869
> [1]).  Specifically, the second part of HKDF -- "HKDF-Expand".
> 
> (The first part, HKDF-Extract, is appropriate to apply /before/ key
> stretching, but stretching itself serves much the same purpose --
> removing "structure" from the input key.  Especially given that Argon2
> is designed specifically to handle user passwords, I expect that
> HKDF-Extract is entirely unnecessary here.)
> 
> HKDF is what TLS 1.3 will use to expand its per-session master keys into
> individual keys for encryption and MACing [2], and AFAIK is generally
> considered The Right Way to generate a stream of distinct keys from a
> master key, where the compromise of any key should not permit derivation
> of the others.
> 
> So, um.  Pretend I never mentioned HMAC, but spruiked HKDF instead :)
> 
> (Of course, this is pretty much bikeshedding.  A first pre-image attack
> on SHA-2 in the near term would be a rude shock, and a full break would
> break HKDF too.  But HKDF may prove more robust in the face of partial
> breaks, giving more time to move everyone to a new hash or scheme.)