From code used in the paper “Gradient-based learning of higher-order image features” by Roland Memisevic, I’ve diagrammed the structure of the relational autoencoder. Note that input is in the form of corrupted samples from two different sources (X and Y). These are mapped, via 3rd-order tensor (W decomposed into wxf, wyf, and whf_in) to a hidden layer. On the other side of the hidden layer, the activations of the hidden layer are split according to the activations of the inputs and their weights. The actual output (corresponding to the input) is the dot product of the multiplicative activation of hidden and input with the transpose of the weights for that input. Reconstructed output is based on the type of output needed (i.e. binary or real).
 R. Memisevic, “Gradient-based learning of higher-order image features,” Proc. IEEE Int. Conf. Comput. Vis., pp. 1591–1598, Nov. 2011.