![]() |
CUTLASS
CUDA Templates for Linear Algebra Subroutines and Solvers
|
| aligned_buffer.h | AlignedBuffer is a container for trivially copyable elements suitable for use in unions and shared memory |
| arch.h | Defines tags for architecture-specific configurations |
| array.h | Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union |
| array_subbyte.h | Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union |
| batched_reduction.h | Implements a software-pipelined efficient batched reduction. D = alpha * Reduction(A) + beta * C |
| batched_reduction_traits.h | Defines structural properties of complete batched reduction. D = alpha * Reduction(A) + beta * C |
| command_line.h | |
| complex.h | |
| conversion_op.h | Functor performing conversion operations used by epilogues |
| coord.h | A Coord is a coordinate of arbitrary rank into a tensor or matrix |
| core_io.h | Helpers for printing cutlass/core objects |
| cutlass.h | Basic include for CUTLASS |
| include/cutlass/util/debug.h | Debugging and logging functionality |
| tools/util/include/cutlass/util/debug.h | Contains code for debugging cutlass code |
| default_epilogue_complex_tensor_op.h | Epilogue for threadblock scoped complex GEMMs using Tensor Ops |
| default_epilogue_simt.h | Epilogue for threadblock scoped GEMMs using SIMT |
| default_epilogue_tensor_op.h | Epilogue for threadblock scoped GEMMs using Tensor Ops |
| default_epilogue_volta_tensor_op.h | Epilogue for threadblock scoped GEMMs using Tensor Ops on Volta |
| default_epilogue_wmma_tensor_op.h | Epilogue for threadblock scoped GEMMs using Tensor Ops |
| default_gemm.h | Default kernel-level GEMM definitions combine threadblock-scoped matrix multiply-add with the appropriate threadblock-scoped epilogue |
| default_gemm_configuration.h | Definitions for GEMM structures |
| default_gemm_splitk_parallel.h | Default kernel-level GEMM definitions combine threadblock-scoped matrix multiply-add with the appropriate threadblock-scoped epilogue |
| default_gemv.h | |
| default_gemv_core.h | Defines basic properties needed by CTA-level batched GEMV assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes |
| default_mma.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K |
| default_mma_core.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes |
| default_mma_core_simt.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes |
| default_mma_core_sm50.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes |
| default_mma_core_sm70.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes |
| default_mma_core_sm75.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes |
| default_mma_core_wmma.h | Defines basic properties needed by CTA-level GEMMs assuming expectations about data layout of the global memory fragments, data types, and internal tile sizes |
| default_mma_tensor_op.h | Default warp-level GEMM operators selected by data type, size, and layouts of operands |
| default_mma_wmma_tensor_op.h | Default warp-level GEMM operators selected by data type, size, and layouts of operands |
| default_thread_map_simt.h | |
| default_thread_map_tensor_op.h | |
| default_thread_map_volta_tensor_op.h | |
| default_thread_map_wmma_tensor_op.h | |
| device_dump.h | C++ interface to dump fragments and shared memory contents for debugging |
| device_kernel.h | Template for generic CUTLASS kernel |
| device_memory.h | C++ interface to CUDA device memory management functions |
| direct_epilogue_tensor_op.h | Epilogue for tensor operations |
| distribution.h | This header contains a class to parametrize a statistical distribution function |
| epilogue.h | Epilogue for threadblock scoped GEMMs using Tensor Ops |
| epilogue_base.h | Epilogue for threadblock scoped GEMMs using Tensor Ops |
| epilogue_workspace.h | Epilogue for threadblock scoped GEMMs |
| exceptions.h | C++ exception semantics for CUDA error codes |
| fast_math.h | Math utilities |
| fragment_iterator_complex_tensor_op.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation |
| fragment_iterator_simt.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation |
| fragment_iterator_tensor_op.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation |
| fragment_iterator_volta_tensor_op.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation |
| fragment_iterator_wmma_tensor_op.h | This defines a "fragment" iterator for visiting the fragments of an accumulator tile that participate in one warp-level store operation |
| functional.h | Define basic numeric operators with specializations for Array<T, N>. SIMD-ize where possible |
| include/cutlass/gemm/device/gemm.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K |
| include/cutlass/gemm/gemm.h | Defines common types used for all GEMM-like operators |
| include/cutlass/gemm/kernel/gemm.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K |
| tools/util/include/cutlass/util/reference/device/gemm.h | Reference implementation for GEMM in device-side code |
| tools/util/include/cutlass/util/reference/device/kernel/gemm.h | Reference implementation for GEMM in host-side code |
| tools/util/include/cutlass/util/reference/device/thread/gemm.h | Reference implementation for GEMM in host-side code |
| tools/util/include/cutlass/util/reference/host/gemm.h | Reference implementation for GEMM in host-side code |
| device/gemm_batched.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K |
| kernel/gemm_batched.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K |
| include/cutlass/gemm/device/gemm_complex.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K |
| tools/util/include/cutlass/util/reference/host/gemm_complex.h | Reference implementation for complex-valued GEMM in host-side code |
| gemm_pipelined.h | Template for a pipelined GEMM kernel. Does not compute batching or support split-K |
| device/gemm_splitk_parallel.h | Template for GEMM performing a reduction over K partitions in parallel |
| kernel/gemm_splitk_parallel.h | Template for GEMM performing a reduction over K partitions in parallel |
| gemv.h | Template for a threadblock-scoped GEMV kernel |
| gemv_batched_strided.h | |
| half.h | Defines a class for using IEEE half-precision floating-point types in host or device code |
| host_reorder.h | Reorder data from the host side |
| host_tensor.h | HostTensor contributes management for both host and device memory |
| inner_product.h | Reference implementation for GEMM in host-side code |
| integer_subbyte.h | Defines a class for using integer types smaller than one byte in host or device code |
| interleaved_epilogue.h | Epilogue for threadblock scoped GEMMs using Tensor Ops |
| kernel_launch.h | Defines structures and helpers to launch CUDA kernels within CUTLASS |
| layout.h | Defines layout functions used by TensorRef and derived classes |
| library.h | CUTLASS Library is an object-oriented approach to managing operations implemented by CUTLASS |
| linear_combination.h | Functor performing linear combination operations used by epilogues |
| linear_combination_clamp.h | Functor performing linear scaling operations used by epilogues. Values are clamped before converting to the output element type |
| linear_combination_relu.h | Functor performing linear combination operations used by epilogues. Values are clamped before converting to the output element type |
| manifest.h | Manifest of CUTLASS Library |
| layout/matrix.h | Defines layout functions used by TensorRef and derived classes |
| thread/matrix.h | Defines a matrix object intended for storing data in registers and operations within a CUDA thread |
| matrix_coord.h | Defines a canonical coordinate for rank=2 matrices offering named indices |
| matrix_shape.h | Defines a Shape template for matrix tiles |
| matrix_traits.h | Defines properties of matrices used to denote layout and operands to GEMM kernels |
| memory.h | Architecture-specific operators on memory |
| memory_sm75.h | Architecture-specific operators on memory added for SM75 |
| arch/mma.h | Templates exposing architecture support for multiply-add operations |
| gemm/thread/mma.h | Templates exposing architecture support for warp-level multiply-add operations |
| gemm/warp/mma.h | Templates exposing architecture support for warp-level multiply-add operations |
| mma_base.h | Template for a double-buffered threadblock-scoped GEMM kernel |
| mma_complex_tensor_op.h | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores |
| mma_pipelined.h | Template for a double-buffered threadblock-scoped GEMM kernel |
| mma_simt.h | Templates implementing warp-level matrix multiply-accumulate operations |
| mma_simt_policy.h | Describes the lane policy used by warp-level matrix multiply operators targeting SIMT instructions |
| mma_simt_tile_iterator.h | Describes the lane policy used by warp-level matrix multiply operators targeting SIMT instructions |
| mma_singlestage.h | Template for a double-buffered threadblock-scoped GEMM kernel |
| arch/mma_sm50.h | Matrix multiply |
| gemm/thread/mma_sm50.h | Templates exposing architecture support for multiply-add operations |
| arch/mma_sm60.h | Matrix multiply |
| gemm/thread/mma_sm60.h | Templates exposing architecture support for multiply-add operations |
| arch/mma_sm61.h | Matrix multiply |
| gemm/thread/mma_sm61.h | Templates exposing architecture support for multiply-add operations |
| mma_sm70.h | Matrix multiply |
| mma_sm75.h | Matrix multiply for SM75 |
| mma_tensor_op.h | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores |
| mma_tensor_op_policy.h | Policy describing implementation details of warp-level GEMM targeting Tensor Cores |
| mma_tensor_op_sm70.h | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores |
| mma_tensor_op_tile_iterator.h | Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores |
| mma_tensor_op_tile_iterator_sm70.h | Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores |
| mma_tensor_op_tile_iterator_wmma.h | Defines iterators used by warp-level matrix multiply operations targeting Tensor Cores |
| mma_tensor_op_wmma.h | Templates implementing warp-level matrix multiply-accumulate operations targeting Tensor Cores |
| numeric_conversion.h | Boost-like numeric conversion operator for CUTLASS numeric types |
| numeric_types.h | Top-level include for all CUTLASS numeric types |
| output_tile_thread_map.h | Metaprogram for determining the mapping of output elements to threads for epilogue tiles |
| pitch_linear.h | Defines layout functions used by TensorRef and derived classes for pitch-linear memory |
| pitch_linear_thread_map.h | Templates implementing how threads are mapped to a given tile |
| platform.h | C++ features that may be otherwise unimplemented for CUDA device functions |
| predicate_vector.h | Defines container classes and iterators for managing a statically sized vector of boolean predicates |
| predicated_tile_access_iterator.h | Templates calculating the address and predicates to the load of tiles from pitch-linear rank=2 tensors |
| predicated_tile_access_iterator_2dthreadtile.h | Templates calculating the address and predicates to the load of tiles from pitch-linear rank=2 tensors |
| epilogue/threadblock/predicated_tile_iterator.h | Epilogue for threadblock scoped GEMMs using Tensor Ops |
| transform/threadblock/predicated_tile_iterator.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors |
| predicated_tile_iterator_2dthreadtile.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors |
| real.h | |
| reduce.h | Defines basic thread level reduction with specializations for Array<T, N> |
| reduce_split_k.h | Kernel performing a reduction over densely packed tensors in global memory |
| reduction_op.h | Functor performing reduction operations used by epilogues |
| reduction_operators.h | Kernel performing a reduction over densely packed tensors in global memory |
| regular_tile_access_iterator.h | Templates implementing the address computation of storing of tiles from pitch-linear rank=2 tensors |
| regular_tile_access_iterator_pitch_linear.h | Templates implementing computing the addresses of storing of tiles from pitch-linear rank=2 tensors |
| regular_tile_access_iterator_tensor_op.h | Templates implementing computing the addresses of storing of tiles from pitch-linear rank=2 tensors |
| regular_tile_iterator.h | Templates implementing storing of tiles from pitch-linear rank=2 tensors |
| regular_tile_iterator_pitch_linear.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors |
| regular_tile_iterator_pitch_linear_2dthreadtile.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors |
| regular_tile_iterator_tensor_op.h | Templates implementing storing of tiles from pitch-linear rank=2 tensors |
| regular_tile_iterator_tensor_op_sm70.h | Templates implementing loading of tiles from pitch-linear rank=2 tensors |
| relatively_equal.h | |
| semaphore.h | Implementation of a CTA-wide semaphore for inter-CTA synchronization |
| shared_load_iterator.h | Epilogue for threadblock scoped GEMMs using Tensor Ops |
| simd.h | Templates exposing SIMD operators |
| simd_sm60.h | Templates exposing SIMD operators for SM60 |
| simd_sm61.h | Templates exposing SIMD operators for SM60 |
| simt_policy.h | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of SimtOp instructions, of which a row-oriented slice is visible per iteration |
| subbyte_reference.h | Provides a mechanism for packing and unpacking elements smaller than one byte |
| tensor.h | Defines layout functions used by TensorRef and derived classes for common 4-D and 5-D tensor formats |
| device/tensor_compare.h | |
| host/tensor_compare.h | |
| tensor_coord.h | Defines a canonical coordinate for rank=4 tensors offering named indices |
| tensor_copy.h | |
| device/kernel/tensor_elementwise.h | |
| host/tensor_elementwise.h | |
| device/tensor_fill.h | |
| host/tensor_fill.h | |
| device/kernel/tensor_foreach.h | |
| device/tensor_foreach.h | |
| host/tensor_foreach.h | |
| tensor_norm.h | |
| tensor_op_multiplicand_sm70.h | |
| tensor_op_multiplicand_sm75.h | |
| tensor_op_policy.h | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration |
| tensor_ref.h | Defines a structure containing strides, bounds, and a pointer to tensor data |
| tensor_view.h | Defines a structure containing strides and a pointer to tensor data |
| tensor_view_io.h | |
| gemm/threadblock/threadblock_swizzle.h | Implements several possible threadblock-swizzling functions mapping blockIdx to GEMM problems |
| reduction/threadblock_swizzle.h | Defies functors for mapping blockIdx to partitions of the batched reduction computation |
| tile_iterator_simt.h | |
| tile_iterator_tensor_op.h | |
| tile_iterator_volta_tensor_op.h | |
| tile_iterator_wmma_tensor_op.h | |
| transpose.h | Basic copy routines for tensor views |
| type_traits.h | Type traits for common CUDA types |
| vector.h | Defines layout functions used for rank=1 vectors |
| volta_tensor_op_policy.h | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration |
| wmma.h | Templates exposing architecture support for warp matrix multiply-add (WMMA) operations |
| wmma_array.h | Statically sized array of elements that accommodates all CUTLASS-supported numeric types and is safe to use in a union |
| wmma_ptx.h | Templates exposing warp matrix multiply-add (WMMA) operations |
| wmma_sm70.h | Matrix multiply |
| wmma_sm72.h | Matrix multiply |
| wmma_sm75.h | Matrix multiply |
| wmma_tensor_op_policy.h | Defines basic structures needed for implementing the warp-scoped phase of the epilogue. These quantities assume a 'column-major' arrangement of TensorOp instructions, of which a row-oriented slice is visible per iteration |
1.8.11