ńňđ. 10 
and Highland [16] provide a brief, analysis and conjecture on possible moti
vations for some aspects of the attack. Here, we attempt to summarize the
crucial points of the attack, primarily following Daum.
First, it is interesting to consider the development of attacks on ciphers
in the MD4 familywhich includes MD5. Dobbertin applied the method he
used to attack MD4 to MD5. He had considerable success against MD5, as
indicated by the discussion in [34]. However, Dobbertin was unable to obtain
a collision for the full MD5 hash function. Dobbertinâ€™s technique is based on
modular differences and equation solving.
Chabaud and Joux [27] had success against SHA0 using XOR differences.
In their method, the analysis is accomplished by approximating the nonlinear
parts of the hash by XOR. This technique is somewhat analogous to what is
done in the linear cryptanalysis of block ciphers.
Wangâ€™s attack [157] on MD5 uses the modular difference for inputs and
the â€śmore preciseâ€ť signed differences for outputs. In this way, Wang has
considerable control over the outputs, yet she is still able to work with the
actual step functions, not approximations. Another important feature of
Wangâ€™s attack is that the colliding messages she finds consist of two message
blocks. In this way, the attack on the second block becomes, in effect, a
chosen IV attack.
But the real magic in Wangâ€™s attack lies in the selection of the difference
patterns. Apparently, the input differences were selected so as to behave
nicely in the later rounds. More precisely, Wang takes advantage of the
fact that an output difference of 2â€ť â€śis propagated from step to step with
probability 1 in the third round and with probability l / 2 per step in a large
part of the fourth roundâ€ť [34]. In fact, this is precisely the â€śspecial propertyâ€ť
HASH FUNCTIONS
238
of MD5 alluded to in Section 5.4.3 that allows Wangâ€™s methodwhich would
be expected to work on any threeround MDlike hashto succced on the
fourround MD5 hash.
The choice of the output differences is the greater mystery. Perhaps the
best that can be done is to analyze the bit differences, as per 1641, in an effort
to gain some insight into the Wangâ€™s approach. We delve into this analysis
tielow.
There is no known method for automatically generating useful difference
patterns. Daum 1341 suggests building a â€śtree of difference patternsâ€ť, in
cluding both input and output differences. Specifying the input difference
pattern would limit the branching to something more manageable, but the
the tree must still be pruned since the growth is exponential. However, most
branches would have low probabilities, so a cost function that incorporates
probability could be used for pruning. The suggestion is t o use use a meet
inthemiddle approach to find â€śinnerâ€ť collisions, that, is, collisions after a
few steps. Then, presumably, these inner collisions could be strung together
to create a collision for an entire message block. However, Daumâ€™s approach
has not yet produced a useful difference pattern, and neither has any other
approach other than Wangâ€™s intuitionat least not yet.
In spite of the fact that the mechanics behind Wangâ€™s attack are now
relatively wellunderstood, and several incremental improvements in the at
tack have been made, nobody has been able to produce a different useful
differential pattern. This is perhaps the strongest indication of the extreme
cleverness that underlies Wangâ€™s attack, even if Wang herself cannot fully
explain it.
Reverse Engineering Wangâ€™s Attack
5.4.5
As mentioned above, when Wangâ€™s team initially revealed an AID5 collision,
t,hey provided virtually no information on how the collision was obtained.
This led Hawkes, Paddon, and R.ose [64] to do some extremely detailed de
tective work, based entirely on the one published collision. This work is
interesting in its own right, but it is also significant since the most efficient
a.tt,acksare based on this work, not on the limited details provided by Wang.
Amazingly, this reverse engineering effort discovered iiseful information on
the computational part, of the attack that was apparently unknown to Wangâ€™s
teairi. In addition, this analysis provides the best hope of obtaining additional
insight into the constriiction of output differentials.
In 1641, the authors began with the only MD5 collision known at that
timc, and carefully analyzed the intermediate values (the outputs) that are
generated when these two inputs are hashed. By carefully analyzing the
differential conditions at each step, they werct able to derive conditions on the
the outputs and thereby obhiri a set of conditions that, if satisfied, will yield
5.4 MD5 239
a collision. Remarkably, this reverseengineering work is the basis for all of
the fast MD5 collision attacks developed to date. Here, we only consider the
first few steps for the first message block Ado.
Conditions on Tj
In this section. we use the notation
+ Qj4 + Kj + Wj
Tj = F ( Q j  1 , Q j  2 , Qj3)
Rj = Tj << s j
<
Rj
Qj = Qji+
which is valid for the first round, j = 0 , 1 , . . . ,15, since the function F is
specified. Here, each Q j , K j , W j , and sj are identical t o those given in the
description of the MD5 algorithm as presented above, and the initial values
are denoted Q  4 , Q  3 , Q  2 , and Q1.
Define A t o be the modular difference operator, A X = Xâ€™  X , where
the difference is taken modulo 2â€ť. For j = 0, 1, . . . ,15, we have
+ A Q j  4 + AWj (5.49)
ATâ€™ = A F ( Q j  1 , Q j  2 , Qj3)
ARj (ATj)<< sj < (5.50)
+ ARj (5.51)
AQj1
AQj=
where the approximation in the second line holds with a high probability (see
Problem 23) and
= F(QS1, Qi2, Qi3)  F ( Q j  1 , Q j  2 , Q j  3 ) .
AF(Qj11 Qj3)
Qj2,
We use AFj as shorthand for A F ( Q j , Q j  1 , Q j  2 ) .
Using the only MD5 collision available a t the time, the authors of [64]
computed AQj, A F j , AT,, and A Rj for each j . Then they were able t o derive
conditions on the the bits of AT, that ensure the desired differential path
will hold. These conditions for the first round of the message block Ado are
summarized in Table 5.11. To save space, in Table 5.11 and in the remainder
â€ś+â€ť
on top of n t o indicate 2n, â€śâ€ť t o indicate 2n
of this section, we put
and P to indicate that the number could be 2n or 2n. Then, for example,
(31 23 6 ) = f 2 3 1 + 223  26.
ff
This compact notation appears in several MD5 papers.
Next, we analyze the first few rows of Table 5.11 in some detail to de
termine conditions on Tj that will ensure the rotation yields the desired ef
fect. The rotation requires careful analysis. For example, suppose Tâ€™ = 2z0
and T = 219 and s = 10. Then A T = 219 and
( A T ) < s = ( T I T ) << = (Tâ€™ << s)  (T << s ) = 229.
<< < <
<
HASH FUNCT1OI"US
240
Table 5.11: First Round of Mo [64]
 
31 31
* + 1â€™9 li &&
12
0
&T
Il
+ +
& ?76 + +
 

14 10 151410 17
0
+ ++
5+ ++
+ 
27 2 16 10 5 2

2725 16 10 5 2 22 27 24 17 15 6 1
0
* + ++ + + +
+˜++t
 
31 24 16 10 8
31 24 16 10 8 6 7 312317156
0
++
+
'& 2 6 80
0
i 
62
312623206 0 12
0
+++ +
1â€™3
1
 
27 30 12
23 13 6 0 0 17
I
+ 
     
15 23 17 8
80 3013 7
22
2+41+3$
i 7
$ 1'7 1'786
0
12 1
l1
3i & 13

0 12 12 24
13
+ +
i
3i 18 30&

c5
31 17
31 15 3 â€™
13
++
2'5 13 7
31; 5 G T5 3
0
31 25
9 22
15
In this example, the difference and rotation commute, that is, it does not
matter whether the difference or rotation is applied first. But this is not
+
+
always the case. Consider, for example, T' = 222 and T = 221 220 2".
Then. as in the previous example, AT = 2'", and hence for s = 10,
( A T ) < s = 229,
<<
but
+ 22") = 229 + 1
(2"1 + 230
(T' << s)  (T << s) = 2 O
<
< ˜
Now consider another example involving negative numbers. Suppose that
wc have T' = 2" and T = 220 and s = 10. Then
( A T )<< Ly = ( T I<< s )  (T << s ) = 229.
< < <
17,
However, if's =
(AT)< s
<< 25
=
but
(T' << S )
< (T <<
< 24  25 24.
= =

These examples illustrate that when AT is specified, we can still obtain
differcnt values for A R = AT << s by placing various restrictions on T .
<
In particular, it is possible to specify T so that bits propagate via the left
rotation into lowerorder bit positions, regardless of the magnitude of the
5.4 MD5 24 1
rotation s. This fact can be used to determine conditions on Tj that must be
satisfied for the required differential path t o hold.
Next, we outline a few steps in the process whereby restrictions on Tj are
deduced from a given collision. But first note that 2"' = 231 (mod 232)
+
always holds and, modulo 232, we also have 231 231 = 0. We make use of
these facts below.
Now consider the second row of Table 5.11 (step 4). From the known
collision we find AQ3 = 0, for AQ4 = 26. From (5.51), we have
AQ4 = AQ3 + ARq = 2 6 ,
which implies that with high probability
AR4 = (AT4<< 7 ) = 2 6 .
<
Since AT4 = Ti  T = 231, the condition (T4 = l ) is sufficient t o ensure
4 ˜
that AT4 = 231. As in the MD4 attack, here we use the notation (Y = a ) i
to indicate that bit i of Y is a , and (Y = ˜ ) i , ,denotes that bits i through j
.˜
of Y are all set t o a.
Step 5 is also reasonably straightforward to analyze. In this step, we
+
have AT, = 219 211. Also, we have
AQ4 = 26 and AQs = 1 t 2 ˜ ˜ + 26, ˜
2 ˜
and it follows that
= 12, we want
Since sg
which holds provided that AT, = 2'' + 2l' does not propagate into higher
+
order bit positions. For example, we cannot have AT5 = 220 2'' 211, since
this would cause bits t o "wrap around" after the shift by 12, resulting in an
incorrect value for ARs. We see that the condition (Ts = 0)12 is necessary,
as is a more complex condition that will restrict borrows; see [64] for details
on this latter condition.
For step 6, the situation is slightly more complicated. At this step, we
have AT,; = 214  21Â°. Since we have s6 = 17, we want
But if we simply rotate AT6 = 2'*  21Â° by 17, we obtain 231  227, which
is not the desired result. In this case, we must rewrite AT6 so that a bit
HASH FUNCTIONS
242
wraps around into the loworder position after the rotation. This is easily
accomplished by rewriting AT, as
+ 214  2â€™O.
AT, = 215
Among other conditions, this implies we must have (Tb = 0)17; again, see [64]
for more details on the additional restrictions.
Continuing in this nianner, it is possible to obtain a set of conditions on
the T3 that must be met for the differential to hold. These conditions are
specified i n excruciating detail in [64]. It is interesting to note that prior to
Stevensâ€™ work [144], these conditions were not used directly in the collision
finding attack. We briefly discuss Stwensâ€˜ implementation of Wangâ€™s attack
after analyzing the outputs.
Conditions on Qj
Next, we consider conditions on the outputs, that is, the Qj. These arc the
conditions that all attacks to date attempt to satisfythe rnore of these con
ditions that can be satisfied deterministically, the more efficient the resulting
a,tt,ack will be. For this analysis, we again restrict our attention to the first
few rounds.
To analyze the output differences, we require a difference operator that
is rnore â€śpreciseâ€ť than either the modular difference or the XOR difference.
Here, we use the signed difference, V X , discussed above. Recall that this
difference operation provides more information than the modular difference
arid the XOR difference combined.
WC consider 32bit words so that, for example, if
x 0x02000020 and X 0x80000000,
= =
then V X is given b y
........ ........
.....+. ..+.....â€ť
â€˜i
For all steps of the MD5 collision provided by Wang, the authors of [64]
computed the values of AQj and 0Q.j. In Table 5.12 we have reproduced
these results for the first round of Mc,,translated into our notation. Of course,
this MD5 collision was generated using hâ€™angâ€™s method, so an analysis of this
collision should provide clucs as to Wa,ngâ€™s approach.
Also, for this same &ID5 collision, the authors of [64] found the values
of AFj arid V F j for all steps. III Table 5.13 we have reproduced these resiilts
for t,lie first round. The results in Table 5.13 were computed from the same
WID5 collision that was used to generate the data in Table 5.12.
Here, we only consider the first rouiitl; where the function F is used.
Recall that
F ( A ,B, ) = (A A B)V ( 1 A A C ) .
C
5.4 MD5 243
Table 5.12: First Round Output Differences [64]
0Q.j
................................
0
03
. . . . . . . . .++t+++ ++++++++  I  t . . . . . .
˜
6
4
3i s *. . . . . . . + . . . . . . . . . . . . . . . .  . . . . . .
$3
5  +
*+tttt ., . . . . . . . . .  +++ +++++++
6 2723 6 0
17 ti . . . . . . .  . .+++ + . . . . . . . . . . . . . . +
.
7 ˜3 15
&s?j * . . . . . . . . . . . . . . . . . . . . . .  ++....+
8
f+
*. . . . . . . . . . . . . . . ..+.... . . . . . . . .
31 12
9
&3fo *+...... . . . . . . . . . . . . . . . . . . . . . . . .
10
* . . , . , . . . . . .  + +t +++.... + . . . . . . .
I
$ 13
1
11
*+
* . . . . .+ . . . . . . . . . . . . . . . . . . . . . . . .
31 24
12
*. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
i
31
13
3i 15; zk . . . . . . . . . . . . . . .  . . . . . . . . . . . f . . .
14


This function uses the bits of A to choose between the corresponding bits
of B and C. That is, if bit i of A is 1, then bit 1; of F ( A , B , C )is bit i
of B ; otherwise, bit i of F ( A , B , C )is bit i of C. Using the information in
Tables 5.12 and 5.13, and the definition of F , we can derive conditions on the
bits of the Q j .
For j I 3 we have AQj = AQj1 = AQj2 = 0 which implies AFj = 0,
that is, Fj = Fj. Using the notation defined above, it follows that V F j is
................................ >
LL
for j 5 3 . This imposes no restrictions on the corresponding Qj. Conse
quently, no conditions are imposed on any Qj based on the analysis of steps 0
through 3 .
Now consider F4. In this case, the attacker has AQ2 = AQ3 = 0
and AQ4 = 26, and wants to obtain AFq = 219 + 2â€ś. From the collision
results in Tables 5.12 and 5.13, the relevant information for F4 is collected in
Table 5.14.
From VQ4 in Table 5.14, it immediately follows that
o)˜˜...˜˜.
and
(Q4 = 1 ) g (Q4 =
We refer to bits 9,10,. . . , 2 5 as the Lâ€˜nonconstantâ€ťbits of Q 4 , while the re
maining bits are the â€śconstantâ€ť bits of Q 4 . That is, Q&= Q 4 on the constant
bits, while Q&# Q 4 on the nonconstant bits.
HASH FUNCTIONS
244
AF, VF7
jl
. . . . . . . . .++++++ +.++++.. . . . . . . . .
14 10
51
+ ++
27 25 16 10 5 2 . . . . _ . . . . . . . .+ . . . . . + . . . . + . .  . .
.
+ ++
i
. . . . . . .+ . . . . .+ . + . + . . . . . .

31 24 16 10 8 6 i
7
*I
f+ i+
31 26 23 20 6 0 I....+.. . . . . . . . . . . . . . + . . . , . +
..
 ++f
. . . . . . . . . . . . . . . ,.+..... .+.....+
23 13 G 0
9  
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 
.
80
10
*.. . . . . . . . . .+. . . . . . . . . +. . . . . . .
$ ?7?
1
11
* . . . . . . . ....+++ +++ . . . . . . . . . . . . .
F3
? l
12
*+ *. . . . . . . . . . . . + . . . . . . . . . . . . . . . . . .
31 18
13
it
i. . . . . + . . . . . . . . . . . . . . . . . . . . . . . . .
31 25
14
Tablc 5.14: F4 Computation
v
A
................................
0
Q2
................................
0
Q3 
++......
. . . . . . . . .++++++ ++++I+++
6
Q4
++
F$ 1911 . . . . . . . . . . . . + . . . . . . . + . . . . . . . . . . .
First, we consider the constant bitas of Q 4 . l l i e function F ( Q 4 , QR, z )
Q
selects the bits of Q:j or Q 2 based on the hits of Q 4 . From VQ4, we have
and for each of these bits of Q 4 it, follows that:
( Q 4 = Q&)0...8,2(j...31
Qi)j.
l ) j , then (Fa = Q 3 ) j and ( F i =
If (Q.1 =
0
and (Pi = Qâ€™,),?.
0 ) j , then
If (F4 = Q 2 ) . j
(Q4 =
0
Siiicc: Q 2 = Qh arid Q:j = Q:%, have ( F i = F4)j for each constant bit j .
we
From Table 5.14 we see that this is the desired condition, since each of tllc
constant bit,s of Q d is also a constant bit of F4. Conscquently, the desircd
coiidit,ions on F4 arc rnet, and no restrictions on Q j are implied.
Next, we deal with the noiicorista,rlt bits of Q 4 . Note that VQ4, which
appears in Table 5.14:immediately implies
5.4 MD5 245
Also, on the nonconstant bits of we have:
Q4,
and ( F i = Qâ€™,)j.
If l ) j ,then
(Q4 = (F4 = Q 3 ) j
0
and ( F i = Q k ) j .
If O ) j , then (F4 = Q 2 ) j
(Q4 =
0
For VFq in Table 5.14 to hold, it is necessary that (Fi = F 4 ) l o , l l , l 3... 19,21... 25.
But, we have (Q& = 1)10,11,13... 19,21...25 so that (Fi = Q ˜ ) i o , i i , 3 . . .19,2125.
i
Since (Q4 = 0)10,11,13 ... 19,21...25 it follows that ( F 4 = Q2)10,11,13 ...19,21...25. Fur
thermore, since Q3 = Qâ€™,, for the V F 4 condition to hold for bits 10, 11, 13
through 19 and bits 21 through 25, the conditions
(Q3 = Q2)10,11,13 ...19,21...25
are sufficient.
We still must consider the nonconstant bits in positions 12 and 20. Here,
we require that (Fi = 1)12,20 and (F4 = 0)12,20. From V Q 4 it follows
that (Qk = 1)12,20 and (Q4 = 0)12,20, which implies that ( F i = Qk)12,20
and (F4 = Q 2)12, 20. Since Qâ€˜, = Qs, the desired condition on F 4 holds pro
vided that
(Q3 = 1)1 2 ,2 0 and (Q2 = 0)12,20.
Finally, we have to deal with bit 9. From V Q 4 we have (Q& = O ) g
and ( Q 4 = 1 ) 9 which implies (8â€™; = Q k ) g and (F4 = Q 3 ) g . From VF4,
the desired condition is ( F i = F 4 ) g . Since Q 2 = QL, the required condition
here is
(Q2 = Q 3 )9 .
All of the conditions derived at step 4 are listed in Table 5.15
Table 5.15: Step 4 Conditions
0 ) i o ...,25
(Q4 =
(Q4 = 1 ) 9
(Q3 = 1)12,20
(Q2 = 0)12,20
(Q2 = Q 3 ) i o , i i , i 3 ... 19,21...25
Next, we analyze one more step in this process. Note that all of the steps
for both message blocks are analyzed in [64].
From Tables 5.12 and 5.13, we have extracted the relevant information
for step 5 and collected it in Table 5.16. In this case, the attacker knows
that AQs = 0, A Q 4 = 26 and A Q 5 = f 2 3 1 + 223  26, and he wants to
ensure AF, = 214  21Â°.
HASH FTJNCTIONS
246
Table 5.16: F:, Computation
v
In1
From VQs, we have (Q5 = 0 ) 8 and (Q5 = l ) 2 5 . Here, no restriction
is implied on ( Q 5 ) O (although we do obtain a restriction on this bit in the
analysis below).
Now consider the constant bits of Q5, that is, those bits where Qk = Q5.
From Table 5.16, we have (Qk = Q 5 ) 1 ...7,9...24,26...31. The function F5 selects
between the bits of Q 4 and Q 3 , depending on the corresponding bits of Qj,
with Q 3 selected if the bit of Q5 is 0, and Q 4 selected if the bit of Qs is 1.
The analogous statement holds for Fi on the constant bits of Q 5 .
From Table 5.16, we see that the desired condition on VF:, is gâ€™ riven
by (VF5 = VQ4)9 ... 16.18...21 and this will hold provided F selects Q&and F5
A
selects Q 4 on these bits. Since we have (Qk = Q5)9...16,18...21, (i.e., these
bits are among the constant bits of Q5) this will hold true provided that
we have ( Q 5 = 1 ) 9...16,18...21. Also, (VF5 = V Q 3 ) 1 7 , 2 2 , 2 3 , 2 4 and this holds
provided F selects Q and F5 selects Q 3 . Since these are also among the
A L
constant bits of Q 5 , we require t,hat ( Q 5 = 0)17,22,23,24.
For the remaining constant bits of Q we want (I$ = F5)j. But each such
5
constant bit of Q5 is also a constant bits of Q 4 and Q3, that is, (Qi= Q 4 ) j
and (QS = Q 3 ) j . Consequently, on these bits it does not matter which bits
arc selected, and therefore, no additional restrictions are implied.
Now consider the nonconstant bits of Q5, that is, ( Q 5 ) 0 , 8 . 2 5 . Since we
havc (Qh = 1)s and ( Q 5 = 0)s. it follows that (I$ = Q&)8and (F5 = Q:3)8.
From Table 5.16, the desired condition here is (F; = F5)8. Since (Qi= Q4)8,
wc require that ( Q 4 = Q:3)8.
Also, we want (F; = F,)25. From VQ5 in Table 5.16 we havc: the condi
tions (Qk = 0 ) 2 5 and ( Q 5 = 1 ) ˜ s Therefore, (Fâ€™; = Qi)2:, and ( f = Q4)25. ˜
.
Sincc we have (Q3 Q!3)25r this yields the condition (Q3 = Q 4 ) 2 5 . How
ever, from V Q 4 in Table 5.16, we have ( Q 4 = 0)25 (which was already noted,
above), and, consequently, the new condition here is ( Q 3 = 0)25.
Finally, we have (Qk # Q 5 ) 0 arid we want t o ensure that ( F i = F5)o.
Sirice (QL = Q 3 ) o and (Q& Q 4 ) 0 , the desired condition will hold provided
=
that ( Q 4 = Q s ) ˜ .In Table 5.17 we list all of the conditions derived based on
of the constant and nonconstant bits in step 5.
5.4 MD5 247
Table 5.17: Step 5 Conditions
Continuing in this manner, it is possible to obtain a set of conditions on
the outputs Qj. If these conditions are all satisfied, then a collision will result.
Each such condition can be expected to hold, at random, with a probability of
about 1/2. Based on the examples presented here, it is apparent that there
will be a large number of conditions that must be satisfied. In fact, there
are over 300 conditions on the first message block that must be satisfied.
If the attacker simply generates randoin messages and checks whether the
conditions are met, the â€śattackâ€ť would be much worse than the birthday
attack.
Fortunately, all conditions that occur in the first round (that is, within
the first 16 steps) can be satisfied simply by directly modifying the message
words. In addition, some conditions in later steps can be satisfied by a more
complex modification procedure. It is critically important that the differential
has been constructed so that the number of conditions is large in the early
steps, but very small in the later steps. In this way, the number of conditions
that can be satisfied deterministically is large, while the number that must be
satisfied probabilisticallywhich determines the work factor of the attackis
much smaller.
The work presented in [64] (and outlined above) can be used t o develop an
efficient attack to find MD5 collisions, as suggested in the previous paragraph.
The idea behind the attack is straightforward. First, we choose the message
words so that all of the conditions on Q j , for j = 0, 1,.. . ,15, are satisfied
using a â€śsinglestep modificationâ€ť technique (as explained below). Then we
use a â€śmultistep modificationâ€ť technique, due to Wang, whereby some of the
conditions on Qj for j > 15 can be forced to hold, while all of the conditions
on the Qj, for j 5 15, still hold (this is outlined below). Then we test all
of the remaining &;I conditions. If all of these are satisfied, then we have
found a collision; if not, we generate another candidate message that satisfies
all of the deterministic conditions (that is, the conditions in the early steps)
and again test the probabilistic conditions (that is, the conditions in the later
steps), and we repeat this until it collision is found.
The Q j conditions that are satisfied probabilistically each hold with a
probability of about 1/2. Consequently, the work factor for the attack is de
termined by the number of conditions that must be satisfied probabilistically.
HASH FUNCTIONS
248
Of course, this entire process needs t o be repeated for both message blocks,
A40 and MI. With some refinements, this is the path that all of the MD5
collisioii attacks have followed up to the time of this writing.
All iniprovements in the collision attack to date revolve around providing
a way to det.erministically satisfy more conditions by choosing the incssage
block appropriately, thereby reducing the number of conditions that must be
satisfied probabilistically. Before leaving this part of the attack, we show that
it is also possible to force some of the conditions on the Tj to hold, that is,
some of the Tj conditions can be satisfied deterministically.
Stevens [144] observes that
antl he notes that this opens the possibility of specifying conditions on Qj
antl Q,i1 that will force conditions on Tj to hold. For example, above we
derived the condition (T4 = 1)". Since 74 = 7, we have (T4)o = ( 4 2 and R)5
from the analysis above, ( Q 4 = 0)25, ( Q 4 = O ) ˜ Sand (Q4 = 1 ) 2 6 . With the
additional condit,ions (Q3 = l ) 2 7 , ( Q 3 = 1 ) 2 6 and ( Q 4 = 0)25, we are assured
that ( 8 4 = 1 ) 2 5 , and hence (T4 = l ) ˜ , desired. To see that this is the case,
as
riotc that the subtraction is given by
*
010 . . . (Q4)25,26,27
* (Q3)25,26,27
. . . IIx.. . (R4)253,27
t
f
where (R, = l ) 2 5 follows since a borrow from higherorder bit positions must
occur.
Continuing in this manner, Stevens [144] is able to derive conditions that
force several of the Tj conditions to hold. This work yields the fastest MD5
attack as of the time of this writing. Using a typical modern PC, Stevens'
attack takes about, two minutes, on average, to find a collision. We give
Stevens' a.lgorithni, below.
The conditions for the first message block, as specified by Stevens' attack,
are given in the Appendix in Tables A6 and A7. For Q,j, a indicates
"^"
that, thc specified bit must agree with the corresponding bit of Qj1, while
L( I " indicates that the specified bit must not agree with the corresponding
.
bit of Qj1. On the other hand, a "." indicates that there is no restriction
on that, particular bit.
The corresponding tables for the second message block M I are due to
Klima [81] and these tables can also be found in [144]. For completeness, we
reproduce these in the Appendix in Tables A8 antl A9. The attack to find
the first message block, A[", is more costly than the attack for the second
rnessagc hlock, MI, the overall work is dominated by finding a 512bit
so
rriessagc block satisfying the conditions in Tables A6 and A7.
5.4 MD5 249
Singlestep Modification
The idea behind the singlestep modification technique (also known as single
message modification) is straightforward. We simply use the fact that each
of the 16 message words appears once in the first 16 steps, and the fact that
by modifying a message word W j ,we can change the output Q j . An example
should make the process clear.
As mentioned in the previous section, we first select the message block at
random. Then we use the singlestep modification technique to force all of
the conditions on Q j to hold, for j = 0 , 1 , . . . ,15, by modifying the message
words. Next, we show precisely how this works, but first note that if
then from Table 5.10 we have Wi = X i , for i = O , l , . . . ,15.
Suppose that we have randomly selected &fo = ( X o , Xi,. . . , X I S ) the
as
first message block. Let mi, for i = 0,1,. . . , 6 3 be the corresponding input
words to the MD5 algorithm. Our goal is to modify &fo to obtain a mes
sage block Mo = ( X o ,X I , .. . , X i s ) for which all of the first round output
conditions hold, that is, all of the conditions on Q for i < 16 hold.
i
Now suppose that we have already found Xo and X1 and consider step 2.
Recall that the IV is denoted Q  4 , Q  1 , Q2, Q  3 . Then using A&, we com
pute
++ + +K2) << SZ,
< (5.52)
Q2 = Qi Q2 W2
(fi
where fi = F ( Q j , Q o , Q  l ) . We want to transform Q 2 to QZ so that the
conditions in the Q 2 row of Table A6 hold, namely, (Qz 0)12,20,25. For
=
each i = 0 , 1 , . . . ,31, let Ei be the 32bit word defined by
# i,
(Ei = l ) i and (Ei = 0 ) j for j (5.53)
that is, Ei is 0 except for bit i, which is 1. Then we have Ei = 231i. Denote
the bits of Q 2 as 
(qo,qi,q2,...,qx).
Q2=
Let D = 412E12  q20E20 Then the desired conditions on are
q25E25. Q2

satisfied by letting
+ D.
Qz = Q 2 (5.54)
Now suppose that we replace 1?/2 in (5.52) with the value of W2 for which
be determined algebraically. Doing so, we find
Then this value W2 can
HASH FUNCTIONS
250
From (5.54), we see that, Q2 is known, and all other terms on the righthand
side of (5.55) arc known, so we have determined W2. Then letting X z = W2,
we obtain Qz which satisfies the required output conditions at step 2.
After repeating a similar process for each of steps 0 through 15, we will
have determined a message Mo = ( X o ,X I ,. . . , X I S for which all of the output
)
conditions in these steps hold. We could then simply test the remaining
conditions and if all hold, we have found a collision. If any condition beyond
step 15 does not hold, then we could select a new random A& and repeat
the entire process. Since each condition is expected to hold at random with
a probability of about 1 / 2 , this yields an attack with a work factor on the
order of 2': where c is the number of conditions in steps 16 through 63.
Using only singlestep modificat,ions provides a feasible shortcut attack.
However, it is possible to fiirt,her reduce the work factor by using the multi
st>epmodification technique described in the next section.
Multistep Modification
Wang's multistep modifications (also known as multimessage modifications)
make it possible to satisfy some of the conditions in steps beyond 15. It is
critical that when we satisfy conditions by this approach, we do not violate
the out,put conditions from previous steps. This makes the multistep modi
fication more complex than the singlestep modifications.
There are actually several multistep modification techniques, some of
which are very convoluted, and some of which are not entirely deterniinis
tic, that is; the condition can fail with some small probabilit,y. Here, we
describe the simplest example of a multistep modification. The paper [16]
discusses some other multistep modifications, while Daum [34] provides a
good description of several such techniques.
Let, f = (XO, I ,
i
" X 2 1 5 ) be the message block h f ˜
after singlestep
p 16, where we want the output condition specified
modifications. Consider
by ( Q l 6 = 0)" to hold (see the Qle row of Table A6). We have
+ (f15 + Q I + w1(i + K l s ) << S I A :
<
˜
Q15
Q1(;
where I ˜˜ I= .X i and f 1 5 = G(Q15, Q 1 4 , Q I : ˜ ) .
o
and define D = qoEo, where E is given
i
Let Q l ( j = (qO,ql, ? q : % l )
in (5.53). Then it is easy t,o verify that Q16 = Q l s + D will satisfy the required
condition at step 16. As with the singlestep modification, we replacc
with W1(; that
so
+
+ ( f i 5 + Q12 + w˜c, KIG)< S I F
<<
Qic, = Q i 5
Solving, we find
Q I ˜ >> c51s)
)>
Wici KIG.
f15
= ((QIA Q12
   
5.4 MD5 251
Since W ˜ S= XI, we must ensure that all of the conditions in the first round
involving X1 still hold. Since Qi, for i = 1,2,3,4,5, also depend on XI, we
must carefully consider each of these steps. However, since no conditions were
previously specified on Q1, the i = 1 case is not a concern.
We have determined a new input at step 16, namely, W ˜ S XI. From
=
the singlestep modification, we have
+
Q + (.fo + K1) << ˜ 1 .
o <
Q1 = Q3 +Xi
Now compute
+ (fo + Q3 + + Ki) <<
= QO <
Xi SI,
that is, 2 is the new Q1 that results from the modified X1 computed in
step 16. Since no conditions were specified on Q1, we will not violate any
previous conditions by letting Q1 = 2.
Next, recall that
z+ ( f l ( Q 1 , Qo, QI) + Q2 + 2;. + K2) << ˜
<
Q2 = 2.
so
Using the same approach as the singlestep modification, we choose X2
that
z+ ( f l ( z ,Qo, Q1) + Qz + X2 + K2) << 5 2 ,
<
Q2 =
which implies that
2)>> s2)  f l ( z , Qo, QI)
>
((Qz
X=
2 K2.
Q2
  
Observe that by selecting X2 in this way, the modification we made when
selecting XI, as required for step 16, will not affect any of the output condi
tions from step 2. That is, all of the conditions on Q 2 that hold as a result
of the singlestep modifications still hold true.
Similarly, we choose
((Q:< Q2) >> ˜
> Qo)
X3 = 3  f2(Q2,z,
)
and
>> s4)  f 3 ( Q 3 ,
>
Q3)
X = (64
4 (2 Q2,

and, finally,
>> ˜
>
X5 = 2  K5.
5  f 4 ( Q 4 , Q3, Q2)
((Q5 )
Q4)
 
Since Z (the new Q1) is not used in the calculation of any other Q i , no
other X i must be modified.
The bottom line here is that we now have deterministically satisfied the
conditions on step 16, while maintaining all of the conditions on steps 0
HASH FUNCTIONS
252
through 15 that resulted from the singlestep modifications. The multistep
modification considered here is the sirnplest case. Several other methods have
been developed in an effort t o slightly reduce the work factor of Wangâ€™s attack.
The evolution of Wangâ€™s attack seems t o have reached the point wherc the
attack is so efficient (about two minutes for Stevens [144] implementation)
and the difficulty and complexity of finding improved multistep modifications
is now so high and many of the more advanced modification techniques only
hold probabilisticallythat it appears likely that further improvement along
these lines will be incremental, a t best.
Stevensâ€™ Implementation of Wangâ€™s Attack
5.4.6
Stevens [144] gives the algorithm in Table 5.18 for finding a message block n/l,
satisfying Wangâ€™s differential conditions. This attack is based on the set of
output conditions given in Tables A6 and A7, which appear in the Appendix.
Table 5.18: Efficient Algorithm t o Find h (
f,
// Find 1 1 = ( X â€ť ,X I , .. . , X I S )where â€śall 1110 conditionsâ€ť refers to:
,
10
// all Table A7 conditions,
all IV conditions for M I (see Table A8),
//
// both ( T 2 1 = 0)14 and ( T J = O),G ˜
Find Al
repeat
Choose Q o ,Q 2 , Q 3 , . . . , Q15 satisfying conditions in Table A6
Compute X o , X s ,X i , . . . , Xi5
repeat
Choose Q ˜ satisfying conditions
G
Compute X I using j = 16
Compute Q1 and X 2 , X : $X4, X s ,
Compute Q17, Q I ˜ QN, Q ˜ o ,
u n t i l QlH, Q 1 7 , . . . , Q 2 0 satisfy conditions in Table A6
f o r ( Q 8 , Q9) consistent with X I1
Compute X x ? g , Xio, Xi2, Xi:$
X
Compute Q21, Q 2 2 , . . . , Qss
i f all A40 conditions arc satisfied then
return M
end i f
next ( Q x , Q g )
u n t i l all 1110 conditions are satisfied
end Find
Strveris also presents ail efficient algorithm for the second block, All,
5.4 MD5 253
which we give in Table 5.19. This is identical to the algorithm given by
Klima in [81]. However, finding MO dominates the overall collisionfinding
work, so all efforts to improve Wangâ€™s attack have been focused on the first
message block, Mo.
Table 5.19: Efficient Algorithm to Find M1
// Find M1 = (XO, I , ,. . , XIS),where â€śall MI conditionsâ€ť refers to:
X
// all Table A9 conditions,
// both (7â€™21 = 0)14 and (T33 = 0)ls
Find M1
repeat
Choose Q1, Q 2 , . . . ,6215 satisfying conditions in Table A8
Compute Xd, Xg , . . . , XI4
repeat
Choose Qo satisfying conditions
,
Compute Xo, XI,XZ X 3 , X ,
Compute Q16, Qi7, Ql81Q19, Q20
until Q16, Q 1 7 , . . . ,Q 2 0 satisfy conditions in Table A8
f o r (Q8, Qs) consistent with X11
Compute X8rXS,XlOrX12,X13
Compute Q21, Q22, . . . , Q63
i f all M1 conditions are satisfied then
return M
end i f
next ( Q 8 , Q g )
until all M I conditions are satisfied
end Find M I
A Practical Attack
5.4.7
It is sometimes claimed that most hash collision attacks are of little or no
practical significance. For the MD5 attack, it is presently not possible to
produce arbitrary collisions, so it seems highly improbable that a meaningful
collision can be constructed. However, there are cases where an apparently
useless collision can be used to create a security vu1nt:rability. Here, we con
sider one such scenario; see Daum and Lucks [35] for the original description
of this clever attack.
Suppose that Alice wants to digitally sign the letter of â€śrecommendationâ€ť
shown in Figure 5.6, which is in the form of a postscript file, rec.ps. Alice
carefully reads the letter, which was created by her new secretary, Trudy, then
Alice digitally signs it. As usual, the signature is computed by first hashing
HASH FUNCTIONS
254
the file and the resulting hash value is signed. Suppose that the MD5 hash
function is used in this signing operation. In this case, Alice computes
s = [h(rec.Ps)lAlice.
where h is the MD5 hash, and [M]Alice denotes the digital signature of M
using Aliceâ€™s private key. Then S and the original letter, r e c . p s , can be sent
to the intended recipient, who can verify the signature using Aliceâ€™s public
key.
To Whom it May Concern:
Tom Austin and Ying Zhang have demonstrated decent programming
ability. They should do OK in any programming position, provided that
the work is not too complex, and that the position does not require
any independent thought or initiative.
However, I think they like to steal office supplies, so I would keep
a close eye on them. Also, their basic hygiene is somewhat lacking
so I would recommend that you have them telecommute.
Sincerely,
Alice
Figure 5.6: Recommendation letter.
Now consider the letter in Figure 5.7, which was printed from the file
a u t h . p s . This letter is obviously much different than the letter in Figure 5.6,
hut, incrcdibly, the files auth . p s and rec . p s have the same MD5 hash valucs.
For the specific files used in this example,
h,(rec . p s ) = h ( a u t h . p s ) = Oxc3261825f 024565d0731f a07ed660f 22,
where h is the MD5 hash.
How can these two very different letters have the same MD5 hash? After
all, in the MD5 collision attack discussed above, the colliding messages are
almost identical, with the precise bit difference per 512bit message block
specified by (5.47) and (5.48). It would seem that these two messages could
riot possibly have been generated using the MD5 attack outlined above.
In fact, the messages in Figures 5.6 and 5.7 were not directly generated
using the attack above and, furthermore, the printed text from these two
letters does not yield an MD5 collision. The identical hash values of the two
files as distinct from the actual displayed text is made possible by the fact
that postscript has a conditional statement, which enables Trudy to include
5.4 MD5 255
To Bank of America:
Tom Austin and Ying Bang are authorized access to all of my account
information and may make withdrawals or deposits.
Sincerely,
Alice
Figure 5.7: Authorization letter.
the text of both letters in both postscript files. Trudy can use a â€śmeaninglessâ€ť
MD5 collision to force the hashes of the two files to be the same, even though
the hashes of the actual printed text is not the same. To see how this works,
we need to examine the postscript inside these files.
Figure 5.8 contains excerpts of rec .ps. These excerpts have been slightly
modified for clarity. Also, â€śuâ€™â€™ represents a blank space inserted so that the
header information is precisely 64 bytes.
The postscript conditional statement is of the form
where TOis processed if the text X is identical to Y and TI is processed
otherwise. Let W be the first 64 bytes of the file in Figure 5.8. Then W
consists of all bytes up to and including the opening â€˜â€˜(â€ť in â€ś ( X )(Y>eq{.
Now let 2 = R/ID50,..63(IV,W),that is, 2 is the result of compressing the
initial block of the file. Using the MD5 attack discussed above, we find a
collision where 2 is used as the IV (the MD5 attack can be modified to work
for any IV). Denote the resulting pair of 1024byte values as M and 111 .
Let L be the file obtained by letting X = Y = M in Figure 5.8, and let L â€˜
be the file obtained by by letting X = Mâ€™ and Y = M in Figure 5.8. Then for
the file L , the two strings before the eq are identical (since both are M ) ,
which causes the postscript interpreter to only display (or print) the text in
Figure 5.6. On the other hand, in the file L the two strings before the eq
â€™
differ and therefore the else condition holds, which implies that only the text
in Figure 5.7 will be visible when the file is processed through postscript.
Furthermore, the MD5 hashes of the files L and Lâ€™ are identical. This follows
from the fact that M and Mâ€™ have the same MD5 hash (since the initial block
is W for both files), and from the fact that all bits after X are identical in
both L and Lâ€˜. Letting rec.ps = L and auth.ps = Lâ€˜ yields the results
displayed in Figures 5.6 and 5.7, with the two files having identical MD5
hashes.
H A S H FUNCTIONS
256
%!PSAdobe1.0
(x)Y )
( eq{
%%BoundingBox: 0 0 612 792, I II ,I I
IuI ,I
/TimesRoman findfont 20 scalefont setfont
25 450 moveto (To Whom it May Concern:) show
25 400 moveto
(Tom Austin and Ying Zhang have demonstrated . . .
(Sincerely,
show
25 150 moveto
(Alice)
show
}{/TimesRoman findfont 20 scalefont setfont
25 450 moveto (To Bank of America:) show
25 400 moveto
(Tom Austin and Ying Zhang are authorized access
(Sincerely,)
show
25 250 moveto
(Alice)
show
}ifelse
showpage
Figure 5.8: Postscript file.
Of course, anyone who examines either of the postscript files in a text
editor will quickly realize that something is amiss. But the whole point of
a cryptographic integrity check is that integrity problems can be detected
automatically, without human intervention. To detect this particular attack
aiitornatically is possible, but to deal with all possible attacks of this type
would be a challenge. Consequently, this attack is a realistic threat and
it nicely illustrates that there is a potential risk when any hash collision is
known, whether or not the collision itself is meaningful.
5.5 Summary
For many years, it seems that hash functions had been largely ignored by
cryptographers. But with the successful attack on MD5, and similar results
for SHA1 pending, hash functions have moved from a sleepy cryptographic
backwater to the forefront of research. It is likely that new hashing techniques
that thwart differential attacks like those described in this chapter will soon
ernrrge. This might occur through the usual research process, or there might
5.6 PROBLEMS 257
be an organized â€śbake offâ€ť similar to the process that produced the Advanced
Encryption Standard (AES).
Finally, we note that MD5 and SHA1 are not only the two most widely
used hash functions, but they are also very similar in design. This was prob
ably not a major issue when both hash functions were considered secure, but
when MD5 was broken, it became clear that this lack of â€śgenetic diversityâ€ť in
hashing was a potential problem. It is worth noting that public key cryptog
raphy suffers from a similar lack of diversity. Today, almost all of t,he public
key cryptosystems used in practice rely on the difficulty of factoring or the
difficulty of the discrete log problem (or the elliptic curve equivalents). If
a significant shortcut is found for either of these problems, it would leave a
gaping hole in the realm of public key cryptography, which would be far more
severe than the temporary turmoil in the world of hashing that was created
by Wangâ€™s MD5 attack.
Problems
5.6
1. Justify the following statements regarding cryptographic hash func
tions.
a. Strong collision resistance implies weak collision resistance.
b. Strong collision resistance does not imply oneway.
2. Suppose that we have a block cipher, where C = E ( P ,K ) , and want to
use this block cipher as a hash function. Let X be a specified constant
and let M be a message consisting of a single block, where the block
size is the size of the key in the block cipher. Define the hash of M
as Y = E ( X , M ) .
a. Assuming that the underlying block cipher is secure, verify that
this hash function satisfies all of the requirements of a hash func
tion as listed at the start of this chapter.
b. Extend the definition of this hash so that messages of any length
can be hashed.
c. Why must a block cipher used in this way be resistant to a â€śchosen
keyâ€ť attackâ€˜? Hint: Suppose that if we are given plaintext P , we
can find two keys KO and K1 such that E(P,Ko) = E ( P , K l ) .
Show that this block cipher is insecure for use as a hash function.
3 . Suppose that a hash function h has M different possible outputs. In
Section 5 . 2 we showed that about M hashes must be computed before
we expect to find a w such that h(w)= h ( z ) .In terms of M , precisely
how many hashes must, we compute before we expect to find such a w?
HASH FUNCTIONS
258
4. Suppose that a hash function h has A4 different possible outputs. In
Section 5.2 we showed that about hashes must be computed before
we expect to find a collision, that is, before we can expect to find x and w
such that h(x) = h ( w ) . In t,erms of M , give an explicit and simple
formula that is more accurate than for the number of hashes that
must be computed before we expect t o find such a x and w?
5. How could the digital signature attack discussed in Section 5.2.3 be
prevented?
6. Recall the online bid scheme discussed in Section 5.2.4.
a. Describe a forward search attack (see Section 6.5.1 for a definition)
against this scheme.
b. Describe a simple modification to this scheme that will prevent a
forward search attack
7. The Nostradamus attack is discussed in Section 5.2.4. We showed that
to apply this attack to MD5, which generates a 128bit hash, the dia
mond structure has a height of 2â€˜â€™ and the work factor is on the order
of 2x7.
a. What is the height of the diamond structure arid what, is the work
factor to apply this attack to the Bobcat hash, which generates a
48bit output (Bobcat is a â€śtoyâ€ť hash function discussed in [142])?
b. What is the height of the diamond structure arid what. is the work
factor to apply this attack to the SHA1 hash, which generates a
160bit output?
c. What is the height of the diamond structure arid what is the work
factor to apply this attack to the Tiger hash, which generates a
192bit output?
ti. Suppose that the Nostradarnus attack in Sect,ion 5.2.4 uses the diamond
structure in Figure 5.2. Then the â€śpredictedâ€ť hash value is y = d:(o.
If f ( I V , P, Sâ€™) = dos, what is the resulting message M . in terms of P ,
Sâ€™ and Mi,, such that h ( M )= y?
9. Write computer programs to verify the following, where a , b and c are
%bit bytes (as opposed to the 32bit words used in MD4).
a. For F ( a , b, c) = (a, A b ) V ( l a . A c ) :
i. F ( a , b, c) = F ( l u , b, c ) if and only if b = c
ii. F ( a , b, c ) = F ( a , h, c ) if and only if a = Ox00
iii. F ( a , b, c ) = F ( a , b, i c ) if and only if a = Oxff.
5.6 PROBLEMS 259
b. For G(u,b, c) = (u A b) V (u A c) V ( b A c),
i. G ( a ,b, c) = G ( l u ,6 , c) if and only if b = c
ii. G ( a ,b, c) = G(a,b, c ) if and only if a = c
iii. G(a,b, c) = G(a,b, y e ) if and only if a = b.
c. For H ( a , b, c) =a @b@ c,
i. H ( u ,b, c) = H(a, b,c) = y H ( a , b, c ) =  H ( u , b, y e )
ii. H ( a , b, c ) = H ( l a , 4 , c ) = H ( i u ,b, y e ) = H ( a , 4c).
,
10. Recall that for the differential attack on MD4 (which is outlined in
Table 5.5), the only difference between M and Mâ€˜ is in word X12,
+
and Xi2 = X12 1. Assuming that a similar attack could succeed
using a different word, why would it make sense to focus the attack on
word 12?
11. Show that the following statements are true, where â€śstepsâ€ť refer to the
steps of the differential attack on MD4, as given in Table 5.5. Also, G
and H are defined in (5.5) and (5.6), respectively.
a. To verify steps 22, 23, 26, 27, 30, and 31 it is sufficient to show
that
+
2 ( f l << n)),
G(X, Y , = G(X, Y,
2)
with a probability of about p = 1/3 for any n E {0,1,2,. . . ,31}.
b. To verify steps 24, 25, 28 and 29 it is sufficient to show that
+ ($1 << n ) ,2 + ( ˜ l <m ) ) ,
G ( X ,Y, = G ( X ,Y
2) <
for any n, E { 0 , 1 , 2 , . . . ,31}
m
with a probability of about p = 1/9
with n # m.
c. To verify steps 32 and 33 it is sufficient to show that
H ( X ,Y , = H ( X ,Y + l,z  l ) ,
2)
with a probability of about p 1/3.
=
d. To verify steps 34 it is sufficient to show that
+ 1)
H ( X ,Y , = H ( X ,Y ,
2) 1,
2 
with a probability of about p 1/3.
=
12. Consider the formulas given in Problem 11. Suppose that we replace
the 32bit words X, Y ,and Z with 8bit bytes x, y and z . Then we can
compute the probabilities exactly. Write a computer program to verify
the claimed probability in each part of Problem 11 using 8bit words
instead of 32bit words.
H A S H FUNCTIONS
260
13. Verify equations (5.30) and (5.32).
ńňđ. 10 