Using images acquired by different satellite sensors has shown to improve classification performance in the frame workofcropmappingfromsatellite image time series (SITS). Existing state-of-the-art architectures use self-attention mech anisms to process the temporal dimension and convolutions for the spatial dimension of SITS. Motivated by the success of purely attention-based architectures in crop mapping from single-modal SITS, we introduce several multi-modal multi temporal transformer-based architectures. Specifically, we investigate the effectiveness of Early Fusion, Cross Atten tion Fusion and Synchronized Class Token Fusion within the Temporo-Spatial Vision Transformer (TSViT). Experi mental results demonstrate significant improvements over state-of-the-art architectures with both convolutional and self-attention components. Index Terms— Multi-modal fusion, time series classifi cation, crop mapping, transformers, remote sensing.