For lots of CUDA kernels, we conventionaly write as:
for (int i = thread_idx; i < numel; i += num_threads)
out[i] = 0;
But in tilelang:
for i in T.serial(0, T.ceildiv(numel - thread_idx, num_threads)):
j = thread_idx + i * num_threads
out[j] = -1
which is annoying to manually compute the range and index transformation, also introduces more registers and lower index computation performance.
So I purpose supporting strides in T.serial.