Clinical Applicability of the Test-retest Reliability of qEEG Coherence

Measurement reliability is an important aspect of establishing the utility of scores used in clinical practice. Although much is known about the reliability of quantitative electroencephalographic (qEEG) metrics related to absolute power, less is known about the reliability of coherence metrics. The current study examined the measurement reliability of coherence metrics across standard frequency bands during an eyes-closed resting state. Reliability was examined both within channel pairs, and averaged across spatially contiguous channels, to summarize global patterns. We found that while most channel pairs were highly reliable on average, there was substantial variability across channels. Finally, we estimated the effect of measurement reliability on the detection of treatment-related neural change. We concluded that estimates of reliability for treated channels are crucial, and should factor into clinical assessment of treatment efficacy for EEG biofeedback (neurofeedback), especially in cases where large cross-channel variability is present.


Introduction
Technological advances in basic measures of electroencephalographic (EEG) recordings have led to a significantly expanded range of quantitative metrics of brain functioning.
McEvoy and colleagues (2000) also investigated test-retest reliability during cognitive tasks.They found that task-related reliability was higher (i.e., r > .9 for working memory tasks, r > .8 for psychomotor vigilance tasks) than that at rest (mean r > .7 across 4 resting state recordings).However, mean r remained ≥ .80 for theta and alpha regardless of condition.
Another study (Corsi-Cabrera et al., 2007) examined within-subject variability and inter-session stability of EEG power in women over time, and found coefficients of r = .92to r = .98for absolute power.Gudmundsson, Runarsson, Sigurdsson, Eiriksdottir, and Johnsen (2007) investigated the effects of montage selection and length of the raw data epochs on test-retest reliability and similarly found that most of the frequency bands had reliability coefficients of r ≥ .80.Finally, Thatcher (2010) reported test-retest reliability of qEEG is both high and stable with small samples sizes.He claimed that even as little as a 20-s epoch results in r ≈ .80,and suggested that test-retest reliability follows an exponential function, such that as the size of the sample of raw EEG data increases, so too does the reliability coefficient (i.e., 20 s, r ≈ .80;40 s, r ≈ .90;60 s, r ≈ .95).
Although research has found uniformly high reliabilities in absolute power, variations in reliability have also been found depending on spectral band and electrode location.
For example, Gasser, Bächer, and Steinberg (1985) studied test-retest reliability of both relative and absolute power.While they found mean reliabilities ranging from r = .47to r = .80and r = .58to r = .80for relative and absolute power, respectively, reliability in the alpha band was consistently the highest, with mean r = .80for both.Salinsky, Oken, and Morehead (1991) also studied relative and absolute power, and using a 5-min testretest interval, they found reliability coefficients ≥ .90, with a median r = .93across all frequency bands.Additionally, Salinsky et al. found that this remained relatively stable over time.
Although numerous studies have investigated a variety of aspects of absolute power reliability, much less is known about the reliability of qEEG coherence.Though the term "coherence" can be used to describe comodulation, here we will refer to it as in Thatcher's conception, that it is "a measure of the variability of time differences between two time series in a specific frequency band" (Thatcher, 2012).In this view, signals with complete phaselocking will display coherence values of 1.0, with a full absence of phase-locking representing a value of 0, and the magnitude of coherence representing the degree of functional association between two signals (e.g., brain regions).Currently, reliability research is mixed with some studies suggesting that coherence is a relatively stable measure of qEEG (Cannon et al., 2012;Chabot et al., 1996;Corsi-Cabrera et al., 2007;Corsi-Cabrera, Solís-Ortiz, & Guevara, 1997;John, 1977;Thatcher, Krause, & Hrybyk, 1986;Thatcher, Walker, Biver, North, & Curtin, 2003), and other studies finding it to be one of the least reliable measures (Gudmundsson et al., 2007).There is some evidence that coherence tends to be higher in the right hemisphere in comparison to the left hemisphere (Gootjes, Bouma, Van Strien, Scheltens, & Stam, 2008;Miskovic, Schmidt, Boyle, & Saigal, 2009;Tucker, Roth, & Bair, 1986).Additionally, previous studies have found a variety of gender differences in coherence (e.g., higher intrahemispheric connectivity for males, differential patterns of local coherence changes after photic stimulation or completion of cognitive tasks), with some suggesting that this is due to differences in lateralized brain organization between the sexes (e.g., Gootjes et al., 2008;Koles, Lind, & Flor-Henry, 2010;Rappelsberger & Petsche, 1988;Shaywitz et al., 1995;Volf & Razumnikova, 1999;Voyer, Voyer, & Bryden, 1995;Wada et al., 1996).However, many of these results have been found during cognitive tasks (i.e., verbal and/or spatial tasks), rather than during resting state.Coherence has been linked to a number of cognitive processes (Thatcher & Lubar, 2009) and sensorimotor tasks (Minc et al., 2010;Silva et al., 2012) as well as neuropsychiatric disorders, such as attention deficit hyperactivity disorder (Murias, Swanson, & Srinivasan, 2007), anxiety disorders (Velikova et al., 2010), and depression (Leuchter, Cook, Hunter, Cai, & Horvath, 2012).As such, understanding the reliability and validity of this metric is of utmost importance as the use of EEG increases in the treatment of these disorders.

Clinical Implications of Measurement Reliability
Understanding the measurement reliability of coherence is important for several reasons.First, the utility of qEEG coherence is directly related to its reliability.
Indeed, few would support using unreliable measures for making important clinical decisions concerning the care and treatment of individuals with various disorders.
Second, as coherence is often targeted as an outcome measure in neurofeedback treatment (i.e., Friedrich et al., 2014;Gruzelier, 2014;Keizer, Verment, & Hommel, 2010), it is important to establish objective parameters for determining whether treatment has led to a change in brain functioning.Finally, the amount of change needed to determine a meaningful clinical difference as a result of treatment is also directly related to the reliability of the measures used (i.e., Evans, Margison, & Barkham, 1998;Jacobson & Truax, 1991).Specifically, less reliable measures require greater change for demonstrating clinical effects, whereas more reliable measures are more powerful for detecting differences.The Reliability of Change (RC) index provides a formal association between measurement reliability and clinical outcomes.For example, the reliable change definition provided by Jacobson and Truax (1991) formulates whether a client has made clinically significant change.The following equation was used in this study to calculate the reliable change (RC) metric: As indicated by the formula, reliable change is determined by the measured difference of functioning at two time points (X 1 -X 2 ) divided by the standard error of the difference (S diff ).The S diff represents the variability in the difference between the two time points as a result of measurement error alone (Christensen & Mendoza, 1986).The S diff characterizes variability of the measure through the use of the test-retest reliability coefficient (r xx ) and the standard deviation of the pre-test score (s 1 ) using the following formula (see Jacobson & Truax, 1991 for further computational details): Thus, the RC metric can be interpreted similarly to a one-tailed z-score, in which values larger than 1.96 are unlikely to occur by chance if actual change is not present.As an important caveat, the reliability estimate used in the equation should provide an accurate gauge of measurement error related to the measurement instrument.Consequently, test-retest estimates should be based on relatively small intervals of time to ensure the change in scores is not due to a change in the underlying construct being tested.

The Current Study
The goal of this study was to demonstrate how the use of reliability statistics can be used to provide a basis from which to evaluate qEEG data as a pre-and post-test measure of treatment efficacy.Whereas most coherence reliability research has been conducted either during resting state or while participants were completing cognitive tasks (e.g., Fernández et al., 1993;Thornton & Carmody, 2009) As previously stated, these measures were used as an interference task, in order to evaluate the test-retest reliability of qEEG after the performance of a cognitively challenging task.Although the scores obtained were not analyzed in this study, future studies will examine the relationship between subjects' working memory and/or executive functioning performance and their qEEG.

Equipment and Software
Dell laptop and desktop computers were used in the collection and analysis of the electroencephalography (EEG) recordings.The BrainMaster Discovery 24 amplifier and corresponding Discovery software (Version 1.8, 2011) were used to record raw EEG data at a sampling rate of 256 Hz.During data collection, the 60 Hz notch filter was used to filter out noise due to other electronic devices in the laboratory.The BrainMaster Discovery amplifier was selected as a result of its compatibility with Neuroguide (Version 2.6.4.,n.d.), which was used to analyze the raw EEG data as well as to produce the qEEG maps.MATLAB (Release 2007b(Release , 2007)), SPSS (Version 19, 2007), andMicrosoft Excel (2007) were also used for data exportation and final data analysis.

Procedure
Participants were fitted with a standard 19-channel Electro-Cap (Electro-Cap International, Inc., Eaton, OH), which used the international 10-20 system for electrode placement.Impedance was kept below 20 kΩ (below 10 kΩ for most subjects) for each of the electrodes.
Additionally, reference leads were placed on participants' ears, and impedance was kept at or below 5 kΩ.These leads were used as a common point of reference for the data collection, and the linked ears montage was used during subsequent data analysis (in Neuroguide).Baseline recordings were taken for 3 min each while the participants' eyes were closed and then open.Participants were also asked to complete one standardized measure of cognitive ability between the baseline EEGs.The average time of completion for the cognitive measure was 5 min 26 s (SD = 5 s).Upon completion of the measure, participants then completed secondary baseline EEG recordings with their eyes closed and then open for another 3 min each.The average time between the start of the two eyes-closed conditions was 11 min 33 s (SD = 5 s).Thirty-nine of the 40 subjects completed the WJ III numbers reversed subtest between the baselines, while one subject performed the WCST.As these were used as an interference task, it is unlikely that the nature of the cognitive task significantly impacted the test-retest reliability.Additionally, the authors did not find any significant differences as a result of the two intermediary cognitive tasks.

Data Analysis
Prior to running analyses, all EEG data was visually inspected by a single examiner to select a minimum of ten seconds of artifact-free data within the first minute of each sample.Care was taken to select data in 2-s epochs whenever possible.This allowed for the use of the drowsiness and eye movement rejection options in Neuroguide, which helped to eliminate artifact from the data that followed recognizable patterns due to eye movement and/or drowsiness.Additionally, the automatic selection function was employed, which used the ten seconds of selected data as a template to automatically select similar data within the sample.This was done to ensure a minimum of one minute of artifact-free data for each session.Following artifacting, data from the eyes-closed EEG recordings were processed into qEEG metrics through fast-Fourier analysis.
A variety of qEEG measures (e.g., absolute power, coherence, phase lag, peak amplitude) were obtained through Neuroguide.MATLAB R2007b was used to collate the relevant raw coherence data from the full Neuroguide reports and to run correlations between Time 1 (T1) and Time 2 (T2) for each of the 171 electrode pairings.Data were then exported to Microsoft Excel and SPSS for additional summary and analysis.Note that while eyes-closed data were used here as an illustration of our method, equivalent eyes-open data are available from the authors, upon request.
In order to summarize patterns in the data, the electrode pairings were grouped into seven zones, based on location in the brain.The first region (FP1, F3, F7) represented the left frontal lobe, while zone two (FP2, F4, F8) represented the right frontal lobe.Zones three (C3, T3) and four (C4, T4) represented the left and right centro-temporal areas, respectively, while zones five (T5, P3, O1) and six (T6, P4, O2) represented the left and right posterior areas of the brain.The final zone, zone seven (Fz, Cz, Pz), represented the midline (see Figure 1).The electrode pairings were then coded based on the regions in which the electrodes fell, such that each pairing was given two codes.For example, the coherence between the left prefrontal (FP1) and left posterior (O1) electrodes would be coded for zones one and five, respectively.After all of the electrode pairs were assigned dual-codes, the pairings were regrouped, such that there were groups representing the coherence between the different zones.For example, one group represented the coherence within the left frontal area of the brain, while others represented the coherence between the frontal, centro-temporal and posterior areas in addition to the midline.There were seven zones (see Figure 1), and four EEG bands (delta [0.5-4.0Hz], theta [4-8 Hz], alpha [8-12 Hz], beta [12-25 Hz]), forming 28 groups in all.The reliability coefficients were then averaged and collapsed within each group, which significantly reduced the number of statistical comparisons.
Within each group, correlations were run for each electrode pair at T1 and T2 in order to calculate the test-retest reliability of the coherences between the two electrodes.
Although a Pearson Product Moment Correlation (r) can be interpreted in terms of size, it cannot be directly combined, as it is restricted in range, and is subject to reduced variances near its extremes (i.e., -1 ≤ r ≤ 1; Cohen, Cohen, West, & Aiken, 2003).As such, these correlations were then transformed using the Fisher's Z' transformation: (3) This was completed in order to calculate mean reliability coefficients for each of the 28 groups, because previous research has suggested that average r z' values are less biased than average rvalues (Corey, Dunlap, & Burke, 1998).Additional statistics were then calculated based on these z' r values (e.g., mean, standard deviation, and standard error of the mean [SEM]) in order to calculate confidence intervals (CI).The average z' r scores and the confidence intervals were then inverse transformed back to the r metric for ease of interpretation.For additional information regarding this transformation, the reader is directed to Cohen, et al. (2003) and Corey et al. (1998).
Finally, the authors used the most and least reliable zones to demonstrate the clinical applicability of these reliability estimates using Equation 1.These metrics were chosen to demonstrate the vast variability in the amount of change needed to establish the effectiveness of a given treatment, based solely on the reliability of the measure being used.

Bands
The data were first analyzed by EEG band.Overall, coherence in the alpha band was the most reliable across the two time points, with reliability coefficients ranging from .87 to .97.The next highest reliability for coherence was within the theta range, with reliability coefficients ranging from .83 to .98.Theta was followed by beta (r = .80to r = .99),and finally delta (r = .74to r = .96),suggesting that both the low and high extremes are less reliable than the midrange brain waves.These results are consistent with previous research, which has shown that alpha waves contribute significantly to the base rhythm of electrical activity in the brain, and are frequently associated with the default brain network in resting state with eyes closed (Noachtar et al., 1999).
Coherence within the bands was further analyzed, and additional patterns emerged in specific areas of the brain.For instance, reliability of coherence within zones 3 (T3, C3) and 4 (T4, C4) was the highest of any other areas, regardless of band, with reliability coefficients ranging from r = .86to r = .97and r = .82to r = .98,respectively.On the other hand, the reliability of coherence between anterior and posterior areas of the brain (i.e., zones 1 and 2 with zones 5 and 6) demonstrated the least testretest reliability, with coefficients ranging from r = .74to r = .99.This too is consistent with previous literature, in that areas close together have been shown to have higher test-retest reliability for coherence than areas that are further apart.

Zones
Due to the differential pattern of results from the band analysis, the data were also analyzed based on location.Zone 1 had the lowest average reliabilities for coherence (r = .74to r = .98,mean r = .90),while zone 7 had the highest (r = .90to r = .98,mean r = .93).In ranking the zones from lowest to highest average reliabilities, zone 1 was followed by zones 2 and 6 (r = .80to r = .99;r = .74to r = .98,mean r = .91),zones 5, 3, and 4 (r = .78to r = .99;r = .84to r = .98;r = .87to r = .98,mean r = .92)respectively, and finally, zone 7. Additionally, clearer patterns emerged from these analyses than from those based solely on the type of wave.In fact, the reliability of coherence within zones as well as between zones appeared to cluster together based on bands, and followed different patterns across each area of the brain.For the sake of time and space, these zoned reliability coefficients are depicted in graphical form (see Figure 2).To assess numerical patterns among the mean reliabilities across bands and zones, a two-way (7 zones by 4 bands) ANOVA was conducted on the mean reliability values for each zone and band.We found a main effect of band, F(3,168) = 15.52,p < .0001,but no effect of zone, F(6,168) = 1.64, p = .14,and no band by zone interaction, F(18,168) = 1.42, p = .13.Post-hoc tests revealed that Delta had lower reliability than all other bands, but that no other bands differed from each other.Detailed means for the coherence reliability coefficients, including additional frequency bands, are summarized in Supplementary Table 1, with further detail available upon request from the authors.Reliabilities were generally high (> .90)across zones and bands, with the highest average values in the alpha band, and lowest in the delta band.Within-zone reliabilities, denoted by bold lines, also tended to be higher than crosszone values.

Reliable Change
As previously reviewed, one of the primary benefits of estimating measurement reliability is to help inform parameters for determining clinically significant change as a result of an intervention.To demonstrate the implications for the impact of reliability on clinically significant outcomes, a case demonstration will be given for using the reliable change method for the least and most reliable individual coherence metrics found in the current study.Starting first with a lower reliability estimate such as Delta O2-F8 coherence, which had a reliability estimate of approximately (r 12 = .70).To establish Reliability of Change parameters, the coherence reliability metric will first be used to calculate the standard error of measurement: The calculated SEM is then used to calculate the standard error of the difference.Technically, the reliable change equation examines the SEM at two different measurement periods.Here, we assume the reliability estimate for time 1 is also an accurate estimate of the reliability of measurement at time 2. Thus, the standard error of the difference (SE diff ) can be calculated as follows: The standard error of the difference provides an estimate to be used for confidence intervals.Confidence intervals are arbitrary set values to determine range of score difference needed to conclude a change in score values is beyond what would be expected from measurement error.The 90% confidence interval would be created by multiplying the SE diff by a z-score of 1.64.The estimated range (.78*1.64 = 1.28) suggest an obtained z-score coherence score with a reliability of .70 would need to change approximately by 1.28 z-score points to determine a significantly clinical effect of intervention (e.g., neurofeedback) to be 90% confident.That is, if a client obtained a z-score of -2.0 on a z-score coherence measure and neurofeedback intervention procedure was implemented to normalize the coherence metric, then a score difference of 1.28 is needed to determine with a 90% confidence level that the intervention has had an impact on the z-score metric, which would be obtained with a z-score of -.72 or higher (-2.0 + 1.28 = -.72).
To further demonstrate the impact of reliability on treatment outcomes, a confidence interval will be calculated for coherence values with higher reliability metrics such as Beta coherence in FP2-O1, which was (r = .99).Using the same equation as above, the SEM would be .1.Entering this estimate into the SE diff equation would yield an estimate of .14.For establishing 90% confidence intervals, this estimate would be multiplied by 1.64 to yield an estimate of .23.Thus, the standardized coherence value would need to change by an estimate of .23 to conclude a significant amount of change as occurred beyond what may be attributed to measurement error.To allow use by interested clinicians, individual channel pair reliabilities, as well as SE diff values for each channel pair are given in Supplementary Table 2.

Discussion
Overall, the results of this study suggest that the test-retest reliability of coherence is sufficiently high for most areas (i.e., r ≥ .80).Although not all frequency bands or all areas of the brain demonstrated reliabilities above r = .80,consistent with the power literature, alpha and theta had the highest reliability coefficients.Furthermore, certain patterns emerged, which were also consistent with previous research.For instance, in examining the reliability coefficients by band, the inter-hemispheric reliability of T3-C3 and T4-C4 was the highest of any other areas, across bands.Corsi-Cabrera et al. (2007) found similar results, suggesting that interhemispheric reliabilities tend to be higher than those of intrahemispheric electrode pairs.Also consistent with their study, is that many of the highest reliabilities in the current study involve the right hemisphere (i.e., zone 4, zone 2 with zones 4, 5, and 6), which could be due to the higher coherences typically found in the right hemisphere.In general, the results from the current study demonstrate that qEEG coherence, much like absolute power, is a reliable measure of qEEG.
Additionally, as demonstrated with the above examples, the reliability estimates from qEEG metrics may have a large impact on concluding whether or not a treatment has worked.The current study found a large range of reliability estimates for coherence measures.Although most metrics were considered highly reliable, a fair percentage of metrics had low reliability and some were completely unreliable.
Although the causative factors for differences in reliability metrics is unknown and beyond the scope of the current study, coherence values with lower reliability (.70) may require a change in coherence values of over a standard deviation (z = 1.28) due to a large amount of measurement error.In contrast, highly reliable metrics (> .90)require much smaller changes to infer meaningful clinical change (z = .23).The difference in clinical change needed between a highly reliable versus a less reliable metric is over 1 standard deviation.
This provides a concrete demonstration of the importance of reliability in determining treatment outcomes.Given the fact that reliability values may vary differentially across channel pairings, and that this may impact the assessment of clinical effectiveness, both researchers and practitioners may consider incorporating Reliability of Change metrics as part of NF efficacy demonstrations.
Although such parameters are not typically provided in most software packages, the current study provides the basic procedures for estimating these parameters.

Limitations
The current sample was sufficiently large to estimate test-retest reliability; however, larger sample sizes generally provide more stable parameter estimates.Future studies may benefit by replicating the current study with larger samples sizes as well as systematically varying the time interval between the measurement periods.Additionally, although the 60 Hz pass filter was used to filter out typically occurring electrical interference, for some subjects the 50 Hz pass filter was also used (e.g., experimenter error), resulting in low estimations of delta, specifically below the 0.5 Hz range, due to overlap in the two filters between 0 Hz and 0.5 Hz.As coherence within the delta range was found to be one of the least reliable, it is possible that these results could be due to this underestimation.Alternatively, delta can be contaminated by EMG and EOG.Thus the method of artifacting used in this study might have included artifact in the delta frequency.Future studies should examine these possibilities.

Clinical Implications for Assessing Intervention Effectiveness
The applications of qEEG are far reaching, as shown by the immense literature base on the topic.The use of qEEG in psychology is growing, and with it, the importance of research such as this study.However, the validity of qEEG for practical applications will always be limited by its measurement reliability.This study focused on testretest reliability for coherence because it has been less reported in the research, yet has become a primary qEEG measure used in clinical practice.Indeed, as reported by Thatcher, North, and Biver (2005), coherence is a better predictor of IQ and various cognitive abilities than power.Regardless of the mechanism, cognition has consistently been demonstrated to be an important construct within psychology.In fact, qEEG data has already been linked to a variety of neurocognitive profiles, as well as neuropsychiatric disorders, specifically through the measurement of coherence.
As such, the reliability and validity of qEEG have become increasingly important.This study has demonstrated consistency with previous literature in showing that coherence is a reliable and stable measure of qEEG, and identified patterns of reliability, which can provide further confidence in the use of such methodology for treating cognitive and/or neuropsychiatric deficiencies.Additionally, the study demonstrated the utility of these reliability estimates in measuring reliable change, thereby extending the utility of qEEG to a progressmonitoring tool as well.

Figure 1 .
Figure 1.Depiction of the zones used for analysis.The bold black lines demarcate the seven zones as defined above (i.e., Zone 1 represents coherence within the left frontal region, between electrode sites FP1, F3, and F7; Zone 6 represents the coherence between electrodes in the right posterior region, P4, T6, and O2).

Figure 2 .
Figure 2. Mean reliabilities by zone and band.For this study, the bands were defined as follows: delta (0.5-4 Hz), theta (4-8 Hz), alpha (8-12 Hz), and beta (12-25 Hz).Reliabilities were generally high (> .90)across zones and bands, with the highest average values in the alpha band, and lowest in the delta band.Within-zone reliabilities, denoted by bold lines, also tended to be higher than crosszone values.
Individual channel pair reliabilities and SE diff values for each channel pair.Individual channel pair reliabilities and SE diff values for each channel pair.Individual channel pair reliabilities and SE diff values for each channel pair.