[Feature request] Support strides for `T.serial`

For lots of CUDA kernels, we conventionaly write as:

```CUDA
for (int i = thread_idx; i < numel; i += num_threads)
    out[i] = 0;
```

But in tilelang:
```python
for i in T.serial(0, T.ceildiv(numel - thread_idx, num_threads)):
    j = thread_idx + i * num_threads
    out[j] = -1
```
which is annoying to manually compute the range and index transformation, also introduces more registers and lower index computation performance.

So I purpose supporting strides in `T.serial`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature request] Support strides for `T.serial` #991

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Feature request] Support strides for T.serial #991

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature request] Support strides for `T.serial` #991