The initialization me->locked=1 in lock() must happen before
next->locked=0 in unlock(), otherwise a thread may hang forever,
waiting me->locked become 0. On weak memory systems (such as ARMv8),
the current implementation allows me->locked=1 to be reordered with
announcing the node (pred->next=me) and, consequently, to be
reordered with next->locked=0 in unlock().
This fix adds a release barrier to pred->next=me, forcing
me->locked=1 to happen before this operation.