If you don't plan to compare visual to auditory conditions with a
contrast, then the combined model is inappropriate. Each modality (visual/auditory) should be
tested in its own model.
If you were able to setup the model (I wasn't aware that flexible factorial could take 4 factors as previously it was limited to three factors), then you can follow the approach used here:
Complex example:
This is for a design with 18 subjects in group 1, 9 subjects in group
2, 2 group terms and 2 conditions: Start with the simpliest element,
single subject in a single condition, build its contrast, repeat for
all subjects and conditions, and then combine the ones you want.
S1G1C1=[1 zeros(1,26) 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
S1G1C2=[1 zeros(1,26) 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
....
Now average your G1C1 and by summing and dividing by the number of
subjects, you'd get
G1C1=[ones(1,18)/18 zeros(1,9) 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0]
and
G1C2=[ones(1,18)/18 zeros(1,9) 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0]
and
G2C1=[zeros(1,18) ones(1,9)/9 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0]
and
G2C2=[zeros(1,18) ones(1,9)/9 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0]
Now subtract G1C1-G1C2 AND G2C2-G2C1
G1C1-G1C2=[zeros(1,27) 0 0 1 -1 0 0 0 0 0 1 -1 0 0 0 0 0 0 0 0 0 0 0 0]
and
G2C1-G2C2=[zeros(1,27) 0 0 1 -1 0 0 0 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0]
Now subtract these two:
Interaction contrast=[zeros(1,27) 0 0 0 0 0 0 0 0 0 1 -1 0 0 0 0 0 -1
1 0 0 0 0 0]