[FEA]: cccl.c and cuda.parallel should support indirect_iterator_t which can be advance on both host and device to support streaming algorithms #4148

oleksandr-pavlyk · 2025-03-14T19:24:46Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

cuda.parallel (Python)

Is your feature request related to a problem? Please describe.

To attain optimal performance kernels for some algorithms must use 32-bit types to store problem size arguments.

Supporting these algorithms for problem sizes in excess of INT_MAX can be done with streaming approach with streaming logic encoded in algorithm's dispatcher. Dispatcher needs to increment iterators on the host.

This is presently not supported by cccl.c.parallel, since indirect_arg_t does not implement increment operator.

Since indirect_arg_t is used to represent cccl_value_t, cccl_operation_t and cccl_iterator_t, and incrementing only makes sense for iterators, a dedicated type indirect_iterator_t must be introduced, which may implement the operator+=.

If the entirety of iterator state is user-defined, cuda.parallel must provide host function pointer to increment iterator's state by compiling advance function for the host.

If we define the state of a struct that contains size_t linear_id in addition to user-defined state, we could get rid of user-defined advance function altogether, but would need to provide access to linear_id to the dereference function.

Approached need to be prototyped and compared.

Describe the solution you'd like

The solution should unblock #3764

Additional context

#3764 (comment)

The text was updated successfully, but these errors were encountered:

oleksandr-pavlyk · 2025-03-17T20:14:37Z

Approach 1: augment the state of the iterator

`CCCL_POINTER`

It should be noted that for cccl_iterator_kind_t::CCCL_POINTER we can advance the state on the host if the host type representing the type-erased device pointer stored the value size in bytes.

struct indirect_pointer_t {
  void *ptr;
  size_t value_size;

  indirect_pointer_t(cccl_iterator_t &it) : ptr(&it.state), value_size(it.value_type.size) {
     assert( it.type == cccl_iterator_kind_t::CCCL_POINTER );
  }

  indirect_pointer_t& operator+=(uint64_t offset) {
    char **ptr_ptr = reinterpret_cast<char **>(ptr);
    *ptr_ptr += (offset * value_size);
    return *this;
  }

  void* operator&() const
  {
    return ptr;
  }

};

With no modification to the device side, launcher would only copy the pointer value pointed to by ptr.

`CCCL_ITERATOR`

For CCCL_ITERATOR we need to augment state of the iterator with diffence_type offset struct member on both the host and the device side.

struct indirect_iterator_t {
    // -----
    // type definitions
    // ----
    void *ptr;
    difference_type *offset_ptr;

    indirect_iterator_t(cccl_iterator_t &it) : ptr() {
      size_t offset_offset = align_up(it.size, sizeof(difference_type));
      size_t combined_nbytes = offset_offset + sizeof(difference_type); 
      // allocate memory for user-defined state followed by the offset 
      ptr = calloc(combined_nbytes);
      // copy content of state from `cccl_iterator_t` to allocation
      ::memcpy(ptr, it.state, it.size);
      // set offset to zero
      offset_ptr = ptr + offset_offset;
    }

    ~indirect_iterator_t () noexcept {
      // deallocate memory 
      free(ptr);
    }

    indirect_iterator_t& operator+=(difference_type offset) {
      *offset_ptr += offset;
      return *this;
    }
}

On the device side, the make_input_iterator needs to be modified as suggested by @gevtushenko:

struct __align__(ALIGNMENT) input_iterator_t {
  // type definitions

  __device__ inline value_type operator*() const { 
       const input_iterator_t &it = (*this + *offset_ptr);
       return DEREF(it.data);
  }
  __device__ inline input_iterator_t& operator+=(difference_type diff) {
      ADVANCE(data, diff);
      return *this;
  }
  __device__ inline value_type operator[](difference_type diff) const {
      return *(*this + diff);
  }
  __device__ inline input_iterator_t operator+(difference_type diff) const {
      input_iterator_t result = *this;
      result += diff;
      return result;
  }

   char data[ STATE_SIZE ];
   int64_t offset;
};

Somehow these two structs have to be combined into a single type to service cccl_iterator_t regardless of the iterator kind it represents.

Approach 2: user-provided host function

We could augment cccl_iterator_t with function pointer to advance_host_fn(void *, uint64_t offset). This pointer will only be used for cccl_iterator_kind_t::ITERATOR.

Numba allows us to compile advance function to get a native function pointer:

import numba
import ctypes

def advance(state, incr):
    state[0] = state[0] + incr

numba_t = numba.types.CPointer(numba.types.int64)
sig = numba.void(numba_t, numba.int64)

c_advance_fn = numba.cfunc(sig)(advance)

state_ = ctypes.c_int64(73)
state_ptr = ctypes.pointer(state_)

c_advance_fn.ctypes(state_ptr, ctypes.c_int64(17))

assert state_.value == 73 + 17

c_advance_fn.ctypes(state_ptr, ctypes.c_int64(10))

assert state_.value == 100

raw_function_ptr = ctypes.cast(c_advance_fn.ctypes, ctypes.c_void_p)

Approach 3: Make `advance` function internal

We could make linear_id a mandatory member of indirect_iterator_t struct, and generate both device and host functions for mutating linear_id in cccl.c.parallel.

Then the dereference and assign functions will need to become like getitem and setitem functions in python and take linear_id as an argument. So instead of dereference(state) we would be calling dereference(state, linear_id) and instead of assign(state, val) we would be calling assign(state, linear_id, val).

oleksandr-pavlyk added the feature request New feature or request. label Mar 14, 2025

github-project-automation bot added this to CCCL Mar 14, 2025

github-project-automation bot moved this to Todo in CCCL Mar 14, 2025

gevtushenko assigned oleksandr-pavlyk Mar 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: cccl.c and cuda.parallel should support indirect_iterator_t which can be advance on both host and device to support streaming algorithms #4148

[FEA]: cccl.c and cuda.parallel should support indirect_iterator_t which can be advance on both host and device to support streaming algorithms #4148

oleksandr-pavlyk commented Mar 14, 2025 •

edited

Loading

oleksandr-pavlyk commented Mar 17, 2025

[FEA]: cccl.c and cuda.parallel should support indirect_iterator_t which can be advance on both host and device to support streaming algorithms #4148

[FEA]: cccl.c and cuda.parallel should support indirect_iterator_t which can be advance on both host and device to support streaming algorithms #4148

Comments

oleksandr-pavlyk commented Mar 14, 2025 • edited Loading

Is this a duplicate?

Area

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Additional context

oleksandr-pavlyk commented Mar 17, 2025

Approach 1: augment the state of the iterator

CCCL_POINTER

CCCL_ITERATOR

Approach 2: user-provided host function

Approach 3: Make advance function internal

oleksandr-pavlyk commented Mar 14, 2025 •

edited

Loading

`CCCL_POINTER`

`CCCL_ITERATOR`

Approach 3: Make `advance` function internal