In this paper, we propose a class of division algorithms with the aim of reducing the delay of the selection of the quotient digit by introducing more concurrency and flexibility in its computation. From the proposed class of algorithms, we select one that moves part of the selection function out of the critical path, with a corresponding reduction in the critical path compared with existing alternatives: we present the algorithm and describe the architectures for radix 4 and for radix 16. For radix 16, we use the scheme of overlapping two radix-4 stages. In both cases, radix 4 and radix 16, we show that our algorithms allow the design of units with well-balanced critical paths with consequent decreases of the cycle times. Moreover, in the radix-16 case, we include some additional speculation techniques. To estimate the speedup, we used a rough timing model based on logical effort. For both radices, we estimate a speedup of about 25 percent with respect to previous implementations. In the radix-4 case, this is achieved by using roughly the same area, while, in the radix-16 case, the area is increased by about 30 percent. We verified our estimations by performing a synthesis of the radix-4 units.