Try different strides. Compare large powers of two (like 4096) with slightly different strides (like 4095). Depending on the system (particularly the memory/cache architecture) you may see very different performance.
The MPICH implementation and some others detect some kinds of vector datatypes and optimize for them. The Type_struct form (using the MPI_UB) is less likely to be optimized.