Dear Joelle,
> By the constant term you mean the last column in the design matrix right? And the task regressor you mean the first column in the design matrix (ie the task/rest regressor)?
Yes. However, we usually just call it "task", not "task/rest". In your case the boxcar function (on / off) indeed corresponds to task / rest, but if there are some addtional task regressors this is no longer the case.
> Why do long blocks and short rest periods create a high correlation between the two?
The task predictor usually has some "ups" and "downs" over time. In case of very long blocks there's the usual initial dip at the beginning of the block, followed by the usual overshoot, followed by a very long plateau, and only at the end of the block the usual drop, undershoot, return to "baseline". With very long blocks combined with short rest periods in between then the predictor has a certain positive values most of the time (that from the plateau phase), which leads to multicollinearity https://en.wikipedia.org/wiki/Multicollinearity , as the constant term also has a certain value most of the time (actually all the time).
> Why actually would you expect the signal in long blocks to extend to the low frequency range?
A time course can be decomposed into a set of sine waves with certain frequencies and amplitudes via the Fourier transform. Instead of the time domain of the signal you can also look at the corresponding frequency domain as illustrated in https://en.wikipedia.org/wiki/Frequency_domain#/media/File:Fourier_transform_time_and_frequency_domains_%28small%29.gif . For different time courses different frequencies (or certain ranges) will also contribute to a different extent, see e.g. for certain examples http://www.sbirc.ed.ac.uk/cyril/fMRI3.html . However, there's not just signal in the data but also noise, e.g. slow scanner drifts. That's why we usually go with something like 1/128 Hz to remove the slow frequencies as there's usually not much signal. This becomes problematic of course if the signal mainly consists of slow frequencies.
> why is the corresponding regressor close to a flat line?
Because the BOLD responses on each of the densely packed stimuli / Go trials add up, so the average BOLD response doesn't return to "baseline" level.
> we would actually be looking at activation related to task versus rest.
Except with a separate explicit regressor "rest" there's never really a contrast relative to rest, and if you wanted to compare to the constant term you need a differential contrast like [1 1]. A simple T contrast [1 0] tests whether the linear combination 1 * beta estimate 1 + 0 * beta estimate 2 is larger than zero, and zero really means the value 0. Think of a simple linear regression y = ax + b with the slope a and the intercept b. b could be anywhere. In terms of fMRI data we are not interested in b (beta zero / constant term), as there's just some average signal in the image unrelated to BOLD effects due to our manipulations. We also don't care about whether a is larger than b (which doesn't make much sense in most of the linear regressions probably, is the slope larger than the intercept). Instead we want to know whether a is sig. different from 0. In case of a simple linear regression this results in a certain regression line y1 = a1 * x + b differing from a horizontal line y2 = 0 * x + b = b. In case of fMRI data this results in a certain time couse (estimate multiplied by predictor time course) differing from a flat line as reflected by the beta zero. This is what we mean with "relative to implicit baseline the task abc lead to pos. activations in regions xyz". It does NOT mean that the value of beta estimate 1 differs from value of beta estimate 2 or that we tested for something related.
> I guess this contrast is (A + B)/2 = 0.
> When we say that the mean (of the first five and the second five) doesn't differ from 0, are we saying that the H0 is: combined activity in
> the first five and the second five is not actually any different than baseline? Or what does 0 in this case really mean?
In statistical terms, we test whether the linear combination differs from value 0, in this case 0.5 * A + 0.5 * B, which corresponds to the arithmetic mean. If this is the case, then the combined time course differs from a flat line with y = certain value (in practice not the case as the constant term is not constant after prewhitening, but this is another detail).
> What hypothesis would A+B= the sum be testing? What is meant by sum really? What contrast would examine this?
As we test against 0 the weighting has no impact on the statistics (p/T values) as long as the scaling is consistent, be it [1 1] or [sqrt(pi) sqrt(pi)]. It does matter when we extract parameter estimates from the beta or con images, as the images do not store the weighting factors of the contrast vectors. Thus it is up to the user to ensure whether the values as stored in the images are already meaningful (difference between A and B) or whether they have to be devided by sqrt(pi) to result in a meaningful value.
Christophe has given a meaningful example @ https://www.jiscmail.ac.uk/cgibin/webadmin?A2=spm;536e4d9a.1509 that contrasts the sum of two conditions against a third instead of contrasting the average of two conditions against a third. In most instances you don't want to do so though, either you want to look at simple effects (with the weighting factors adding up to 1 or 1) or differential effects (weighting factors adding up to 0  but note that not every differential contrast whose weighting factors sum up to 0 is meaningful).
> Other simple effects (A, A, B, B) can be determined based on these two contrasts
It's just different coding. Instead of two predictors A, B based on individual onsets, durations we can go with a design matrix based on two predictors C and D that reflect (A+B)/2 and AB (thus, C is the average time course of predictors A and B, and D is the difference). We obtain some beta estimate for C and D, say 10 and 5.
(I): (A + B)/2 = 10
(II): A  B = 5
Now rephrase
(II'): B = A  5
Insert into (I)
(I'): (A + A 5)/2 = 10
A  5/2 = 10
A = 12.5, B = A  5 = 7.5
Predictors like C and D are uncommon nowadays, but in the early days there have been design matrices like that (coding effects A+B, AB, instead of factor levels or conditions like A, B). Actually you can set up predictors like AB via the GUI, as it is possible to specify negative durations, but to have this work properly (negative duration BOLD response symmetrical to that of the BOLD response) you need some more adjustments, so forget about it ;)
> Where is the group x condition interaction being tested?
A group x condition interaction corresponds to a group difference with regard to a condition difference. Thus, if you have set up differential contrasts on singlesubject level like [1 1] you can take these con images and forward them into a secondlevel twosample ttest.
Best
Helmut
