Reliability and validity of the AGREE instrument used by physical therapists in assessment of clinical practice guidelines

Background The AGREE instrument has been validated for evaluating Clinical Practice Guidelines (CPG) pertaining to medical care. This study evaluated the reliability and validity of physical therapists using the AGREE to assess quality of CPGs relevant to physical therapy practice. Methods A total of 69 physical therapists participated and were classified as generalists, specialist or researchers. Pairs of appraisers within each category evaluated independently, a set of 6 CPG selected at random from a pool of 55 CPGs. Results Reliability between pairs of appraisers indicated low to high reliability depending on the domain and number of appraisers (0.17–0.81 for single appraiser; 0.30–0.96 when score averaged across a pair of appraisers). The highest reliability was achieved for Rigour of Development, which exceeded ICC> 0.79, if scores from pairs of appraisers were pooled. Adding more than 3 appraisers did not consistently improve reliability. Appraiser type did not determine reliability scores. End-users, including study participants and a separate sample of 102 physical therapy students, found the AGREE useful to guide critical appraisal. The construct validity of the AGREE was supported in that expected differences on Rigour of Development domains were observed between expert panels versus those with no/uncertain expertise (differences of 10–21% p = 0.09–0.001). Factor analysis with varimax rotation, produced a 4-factor solution that was similar, although not in exact agreement with the AGREE Domains. Validity was also supported by the correlation observed (Kendall-tao = 0.69) between Overall Assessment and the Rigour of Development domain. Conclusion These findings suggest that the AGREE instrument is reliable and valid when used by physiotherapists to assess the quality of CPG pertaining to physical therapy health services.


Background
Clinical practice guidelines (CPGs) are one option for promotion of quality in health services [1-6] Many countries are faced with common challenges in delivering high-quality health care with available resources and have pursued the development of CPGs as a means to optimize effective and efficient care. As a result there is a need to evaluate CPGs guidelines to assess their quality and their impact on practice. The Appraisal of Guidelines for Research and Evaluation (AGREE) instrument was developed by a group of researchers from 13 countries to provide a systematic framework for assessing guideline quality [7,8] This instrument was thoroughly evaluated and refined and is now a commonly used assessment instrument for CPGs [2,3,[9][10][11][12][13] A large-scale validation study focussing primarily on medical (i.e. physician) guidelines, was conducted supporting the reliability and validity of this instrument [14].
The AGREE Collaboration published the development process and associated reliability and validity data in 2003 [14] This report outlined the rigorous process undertaken to develop the AGREE instrument which included item generation, selection and scaling followed by fieldtesting and refinement. This process resulted in the final instrument with 23 items distributed across six subscales termed "domains", for which reliability and validity data were presented. Reliability was determined by calculating the internal consistency of each domain within the final instrument and assessing the agreement between different appraisers. A total of 33 guidelines were evaluated by the four appraisers. Internal consistency ranged from 0.64-0.88. 'Scope of Purpose' and 'Rigour of Development' were the most homogeneous domains. The inter-rater reliability exhibited a wide range from 0.25-0.91. Reliability was higher with four appraisers and the most reliable domain was 'Rigor of Development'. Higher reliability within the domain of 'Rigor of Development' is a positive finding, as this domain should contain items that are more objective than items contained on other subscales of the AGREE instrument. That is, because the 'Rigor of Development' questions relate to the methodology of developing a CPG and thus there are optimal criteria that would be expected regardless of the content of the CPG. In measurement terms, it is more likely that a "true" score exists for elements within this domain. Therefore, variability observed on repeated assessments of the same CPG should reflect measurement error between appraisers. Other scales such as stakeholder involvement and applicability might reasonably have different criteria depending on clinical expertise or application. In measurement terms, no single true score may exist for these items. Therefore, variability observed between appraisers on these domains might reflect a combination of measurement error, as well as true variations in perspective. This concept is important when assessing and interpreting the reliability of evaluation instruments like the AGREE.
The AGREE Collaboration also assessed face, construct and criterion validity of the AGREE Instrument. Face validity was determined by surveying appraisers attitudes and opinions about the instrument and its associated user guide. Construct validity was determined by comparing scores of guidelines in different subgroups to determine whether they fit three specific constructs. The constructs tested included whether established quality guideline programs produced guidelines with higher domain scores than those developed outside of established systems; whether guidelines supported by well-documented technical information had higher domain scores than those without such documentation; and finally whether guidelines developed as national policies were higher quality than regional or local CPGs. The first hypothesis was supported with respect to editorial independence, but not other domains. The second and third hypotheses were supported with respect to the domain Rigor of Development, but not other domains. Finally, criterion validity was determined by assessing the rank correlation between appraisers domain scores and their overall assessment scores (final item on the AGREE instrument). These correlations were all highly significant (range Kendall's Tau-b = 0.67 -0.88).
Physical therapy and other health care disciplines shares common challenges in providing effective care within limited resources. While many disciplines currently use the AGREE the validation paper emphasized medical practice and practitioners. The nature of physical therapy practice differs substantively from medical practice in a number of ways including access, funding, the nature of interventions, research systems and professional associations; all of these might affect the type of CPGs, developed. Differences in training between disciplines might also cause variations in how the AGREE was interpreted. Additional validation with other settings or users would strengthen the validation of the AGREE across a variety of applications.
CPGs have arisen within the field of physical therapy from a variety of sources [15,16] Professional associations have assisted physical therapy practitioners in becoming aware of the existence of such guidelines through websites and newsletters. Ideally, information on the quality of such guidelines should also be available to assist practitioners in selecting between available guidelines. Currently, this is not the case in physical therapy. While professional associations may help co-ordinate the evaluation of CPGs, they must inevitably must make decisions about this process including which members of the profession are able to evaluate CPGs, how many appraisers should be selected and which instrument should be used. For this reason, our purpose was to determine whether the AGREE instrument is a reliable and valid tool when used by physical therapists to assess CPGs that pertain to physical therapy practice. Our hypothesis was that the AGREE instrument would be a valid tool when used by physical therapists to evaluate CPGs. Our secondary question was whether an ideal number or type of physical therapist appraiser would be evident from reliability data. We hypothesized that 4 appraisers might be best, as the AGREE Collaboration recommends " al least two appraisers and preferably four as this will increase the reliability of the assessment". [7]

Methods
This study was a cross-sectional study conducted to evaluate inter-appraiser reliability and validity of quality appraisal of CPGs performed by physical therapists using the AGREE instrument Permission to use the AGREE instrument was obtained from the AGREE Collaboration. Ethics approval for this study was received from the University of Toronto, Office of Research Services.

Clinical practice guidelines
Clinical practice guidelines (CPGs) are "systematically developed statements to assist practitioner and patient decisions about appropriate healthcare for specific clinical circumstances" [17]. The CPGs evaluated during this study were identified through an inventory that was created by the study authors from a series of systematic searches that included electronic databases, websites, contact of professional associations and guideline developers. The inventory included all located documents that were identified by authors as Clinical Practice Guidelines. This inventory was completed in 2002 and updated yearly and subsequently posted on the website for the Canadian and Ontario Physiotherapy Associations (members only access). Within this database, guidelines were categorized according to the area of physical therapy practice (e.g., musculoskeletal, neurological, cardiorespiratory). Sixty guidelines published in the last five years were selected from this database for inclusion in the present study. Four of these CPGs were excluded because they were not actually CPGs (e.g. systematic reviews included in error) and one CPG was excluded because it was not relevant to physical therapy. Thus in total, a sample of fifty-five guidelines were evaluated in the present study.

Participants/training
All participants were physical therapists who were recruited through advertisements in professional newsletters. A total of 72 therapists responded to advertisements and agreed to participate, 69 actually participated in training and study evaluations -two others had personal circumstances that prevented them from attending training and one failed to respond further. The participants were classified according to the following criteria: 1. Clinical Specialists were Physical Therapists who were currently practicing in a specific area, had a minimum of three years experience and had participated in at least one post-graduate course per year in their area of clinical expertise. 2. Generalists were Physical Therapists practicing in a variety of areas of physical therapy or an ongoing general practice, which covered a broad-spectrum of neurological, orthopaedic and cardiorespiratory health problems. 3. Researchers were Physical Therapists with or approaching completion of a graduate research degree (M.Sc. and/or PhD) with experience in conducting clinical trials or outcomes research and experience in formal critical appraisal of clinical research.
Demographic data was collected on all participants. Eligible participants were provided with training materials which included the AGREE instrument (form and associated interpretation guidelines)[7,18] as well as a multiplechoice test that required participants to answer questions on the content and structure of the AGREE (see Additional file 1). In addition, a sample guideline was provided to participants with instructions to read the guideline (on management of lymphedema following treatment for breast cancer [19]) and appraise it using the AGREE instrument and associated documentation.
Subsequently, all participants met by teleconference for one hour with a subgroup (4-8) of study participants. The sessions (a total of 6 teleconferences conducted) were led by a single facilitator (first author). During this session participants reviewed their responses to each question and discussed their rationale or concerns regarding scoring. The facilitator did not indicate a correct score for any individual item. Participants were instructed that the group facilitator would not indicate whether a given score was correct, as this was not possible for many of the items. Participants were directed to appreciate the difference between items that had clear answers because they inquired about specific factual information versus those that had a subjective element where responses might vary according to context. Participants were corrected if they incorrectly interpreted the meaning of a given item on the AGREE instrument. Although consensus was not the objective of the training sessions, participants tended to come to consensus after discussing items with colleagues and a facilitator. In total, 69 appraisers attended the training sessions.

Appraisal instrument
The appraisal instrument used to evaluate the CPGs was the AGREE [18]. This instrument consists of 23 items organized in six domains ; each domain is intended to capture a separate dimension of guideline quality. The following domains are included: 1. Scope And Purpose: 3 items that address the overall aim of the guideline, the clinical question and the target population 2. Stakeholder Involvement: 4 items that address the composition, expertise and representation of the development group 3. Rigor Of Development: 7 items that evaluate the process used to locate and synthesize the evidence and to formulate and update the recommendations, 4. Clarity And Presentation: 4 items that address language and format 5. Applicability: 3 items that address the potential organizational, behavioural and cost implications of implementation and 6. Editorial Independence: 2 items that address potential conflicts of interest.
Items are rated on a 4-point scale with endpoints of 4 'strongly agree' and 1 'strongly disagree'; the two midpoints are 3 'agree' and 2 'disagree'. A section for overall assessment is included at the end of the instrument that requires the appraiser to make a judgment about the overall quality of the CPG. Appraisers are asked whether they would 'strongly recommend', 'recommend (with provisos or alterations)', 'would not recommend' or are 'unsure' if they would recommend the CPG for practice.

Evaluation process
All participants completed the training program and proceeded to evaluate a set of 6 CPGs. These CPGs, six copies of the AGREE instrument, and a pre-paid return were provided by mail. Each guideline was evaluated independently by three pairs of appraisers who were randomly picked from the three pools of Physical Therapists (i.e. two clinical specialists in the area of the CPG, two generalists and two researchers). All appraisers returned their packages, although up to 3 reminders calls were required for late returns. Participants who completed the study were provided with an honorarium ($100).

Data analysis
Data analysis was conducted to verify the quality of data, assess instrument reliability and determine the validity of the AGREE instrument for physical therapy practice. SPSS statistical software for Windows (Version 11.0; SPSS Inc, Chicago, Illinois) was used for all statistical analyses. Pvalues of 0.05 or less were considered significant.

Data Entry/Quality
Data entry was completed by a single research assistant who inspected data for errors once the data file was complete. The first author conducted random checks of data entry against original data sheets. Descriptive statistics were conducted to identify outliers or unusual values.
Domain scores of each CPG were calculated as recommended by the AGREE Collaboration. The scores of the individual items in the domain were summed and standardized as a percentage of the maximum possible score for that domain (AGREE Collaboration, 2001).

Reliability analyses
The internal consistency of each domain was evaluated using Cronbach's Alpha. The reliability between appraisers was determined for each question and each domain of the AGREE. Intraclass correlation coefficients (ICC 1,1) were calculated within each pair of appraisers and across the pool of appraisers. A one-way random effects model was used as pairs of appraisers were randomly selected from our pool of physical therapist appraisers. An unweighted and quadratic weighted kappa were calculated to indicate the agreement within pairs of appraisers on whether a CPG was appropriate for clinical utilization.
ICCs or kappa values above 0.75 were considered to represent good, 0.40-0.75 moderate and <0.40 poor reliability [20] Validity analyses Face validity The face validity of the instrument for physical therapy practice was determined from feedback provided on the instrument from two sources. Participants (experienced physical therapists) in the study were invited to provide feedback at the training sessions (open-ended questions regarding the training session and the AGREE itself -verbal response). They were also asked to add comments about any items, any issues with clarity or concerns directly on their AGREE form when they were using the AGREE on their assigned CPGs. These were returned, by mail, with their ratings. In addition, over the course of two years, a sample of 102 entry-level-masters trainees at McMaster University were provided the training materials (except for multiple-choice questionnaire) and were required to complete an assignment where they evaluated the same CPG [19] used during the study training session. This assignment consisted of a facilitated group component where students worked in groups of 4-6 to complete the AGREE evaluation for the assigned CPG. The individual component of this assignment required each student to write a 1-2 page essay evaluating the instrument itself in standardized domain score obtained score minimum possible = − s score maximum possible score minimum possible score x100 − terms of its relevance to clinical practice, validity and their personal preference about whether they would use it again. This information was summarized by the course instructor (first author) and the percentage of students who responded that they would use the instrument again was tabulated.

Factor (domain) validity
The validity of the domain structure was evaluated using a principal components, varimax rotated factor analysis. Item means across all 6 appraisers were entered into the analysis. Coefficients were evaluated to determine whether they supported the domain structure and followed a similar pattern as to that reported for a previous factor analysis published by the AGREE Collaboration.

Construct Validity
Construct validity was assessed by evaluating 2 hypotheses. The first hypothesis was selected to match the hypothesis tested by the AGREE Collaboration guidelines [14] and supported by others [21,22], that CPGs developed by established guideline developers should have higher quality scores than those created outside of established system. All guidelines in the database were classified as having been developed by established guideline developers if it could be identified that an experienced guideline developer or development group was responsible for a specific guideline. A CPG was classified as having "Experienced Guideline Developers" if it 1) had more than 3 authors (also fulfilled by an agency) and 2) at least one team member could be identified an a methodologist experienced in CPG development -either by descriptions contained within the body or the CPG or after a review of listed authors (conducted by JM and DB). If this could not be verified the CPG was classified as " No or Uncertain Methodology Expertise". Our second hypothesis was that physical therapists would be more likely to recommend a guideline that was rated as having higher Rigor of Development scores. An independent t-test was used to evaluate the scores obtained for Rigor of Development for guidelines judged as acceptable versus those that were not. This hypothesis test is not ideal, as we are testing whether this subscale contributes to the overall rating within the same instrument. Nevertheless, in the absence of an external criterion, we choose to use this analysis given that it was also conducted in the original validation paper and there was an advantage to having a comparable analysis.

Criterion validity
Finally, criterion validity was assessed in the same manner as reported by the AGREE Collaboration in their validation study. Again we recognize we did not have an external criterion. Kendall Tau B Rank correlation coefficients were calculated between the appraisers domain scores and the overall assessment score.
The majority of specialists were orthopaedic physical therapists (55%), followed by neurological (24%) and cardiorespiratory (10%). One participant reported to specialize in paediatric physical therapy. Two participants did not specify their area of specialization.

Reliability
Analysis of reliability of individual items indicates a trend for higher reliability in items within the domain from Rigor of Development (See Table 2). Intraclass correlation coefficients (ICCs) for each domain of the AGREE instrument for pairs of appraisers are presented in Table 3. No systematic differences were observed that would indicate that type of appraisers had any substantial impact on the reliability obtained. Variation in reliability was observed across domains with Rigor of Development demonstrat- ing the highest level of reliability. Few ICCs reached the excellent benchmark of 0.75, if a single appraiser performed the evaluation. ICC models that estimate the reliability when appraisers were averaged using models (1,2) for pairs or (1,6) across all six appraisers indicated substantial improvement in reliability if appraisals were averaged across appraisers. When comparing the reliability across different numbers of appraisers (Table 4) the improvement in reliability was most notable when going from two to three appraisers, with the exception of editorial independence). Additional benefit for adding additional appraisers was inconsistent. Agreement on the overall assessment of the CPG had low reliability for generalists and specialists and moderate reliability for researchers. Quadratic weighting demonstrated some improvement in reliability coefficients for generalists and researchers, but not for specialists. (Table 5)

Validity
Face Validity/ User Feedback Study participants provided feedback during training sessions primarily with respect to the training session itself. They found the opportunity to discuss the results with others to be useful as a means of understanding the intent of individual items. Only three participants had previously been exposed to the AGREE instrument and that majority expressed positive comments about the value of learning about the AGREE instrument. Some clinicians expressed some anxiety about the role for CPGs and how they might be used. None of the study participants provided any feedback when returning their mail packages.
The entry-level physical therapists (students) uniformly agreed that the AGREE instrument provided a useful structure and guidance in the evaluation of the CPG. Students compared the AGREE instrument to evaluation instruments they had used for critical appraisal of different study designs, such as clinical trials and systematic reviews. Students frequently commented that these previous instruments had a more open-ended format and expected the reviewer to understand issues in critical appraisal with little direction as to expectations or scoring criteria. Thus, they found the concrete nature of the AGREE instrument and the clear instruction on interpretation to be a useful framework for the evaluation process. Students stated that this direction increased their confi- dence that they had addressed all important issues. Although students differed on their ratings for individual questions, as well as the overall usefulness of the CPG evaluated, the majority of students understood and correctly interpreted the intent of the items from all of the AGREE domains. The majority of the students, 96 %, stated that they would use the AGREE instrument in other situations. A concern raised by the remaining 4% and other students who would continue to use the instrument was the length of the form and the amount of time required to complete the evaluation, given the busy nature of clinical practice.

Factor analysis
The factor analysis supported a 4-factor solution. The first factor explained 45% of the variance and contained items primarily from the Scope and Purpose or Rigor of Development domains. The second component explained 12% of the variance and contained items primarily from the Clarity and Presentation or Applicability domains. The     Table 6.

Construct validity
The construct that the CPGs developed by expert guideline development groups would have a higher score on the domain Rigor of Development was supported (See Table 7). The construct that therapists would be more likely to recommend for usage a CPG with a higher quality on the domain Rigor of Development was also supported (mean of 76 vs. 58 p <0.001).

Criterion validity
The correlation between overall assessment and the domain scores ranged from low (0.38) to moderate (0.64), with the highest correlation being observed for the Rigor Of Development domain. (Table 8)

Discussion
The study findings suggest that the AGREE instrument is reliable and valid when used by physical therapists to evaluate CPGs. While some differences exist between the results reported in the original validation study authored by the AGREE Collaboration, the similarities far outweigh the differences. This would suggest that the process of evaluating CPGs using the AGREE instrument can be transferred to physical therapy practice to support the translation of higher quality CPGs into physical therapy health services.
Typically, individual item reliability is of little relevance when evaluating the properties of a measurement scale, as items are not intended to be used in isolation. However, it may be useful from a practical point of view to form hypotheses about were further training might be necessary, if it appears that appraisers have particular difficulty with certain items. As discussed above, one must be careful in interpreting reliability coefficients in isolation, particularly for instruments like the AGREE where some items have "relatively true" scores and others have a spectrum of true scores. For the AGREE this spectrum might exist on application items. Thus, we would expect low item reliability might result from items where there is large measurement error, but also from items where there is substantial variability in how the item might apply in different circumstances. Our data support this. Low reliability was evident on items such as whether the guideline was editorially independent (2/8 ICCs were not greater than 0) and whether key review criteria for monitoring or audit had been provided (2/8 ICCs were not greater than 0). Study participants also appeared to have difficulty interpreting these items during the training session and thus these disagreements may reflect a lack of consensus about the meaning of these items. In some CPGs information on Editorial Independence is contained in footnotes and was missed by some evaluators. Improvements in reliability on these items might be anticipated with further training. Other items where low reliability was observed included whether the clinical question was specifically described (6/8 ICCs were not greater than 0) and whether the patients to whom the guideline applied were specifically described (2/8 ICCs were not greater than 0). During training, participants appeared to understand these questions, although expectations about what constituted an appropriate description of the clinical question or patient population tended to vary according to the participant's level of expertise in that given clinical area. Thus, disagreements on these items may partially relate to differences in priorities or familiarity with relevant issues between participants. Further training is unlikely to enhance reliability in this case, but raises the question about the importance and specificity of relevant clinical expertise when selecting evaluators. Although our intent was to evaluate the importance of clinical expertise, we classified participants quite broadly. Participants classified as clinical specialists were provided CPGs within their broad area of practice; but, we did not ascertain whether in fact the specific topic of the CPG was an area in which they actually did have knowledge or experience. For example, a clinician with expertise in musculoskeletal practice might practice primarily in a narrower area within the field, such as, rheumatology, upper extremity, lower extremity, joint replacements etc. In such cases, their familiarity with the salient features of the specific conditions might be less detailed and result in a differential evaluation regarding whether the clinical population had This table contains the scores for the Agree Domain on Rigour of Development. CPGs were classified as having an expert panel if 1) there were more than 3 authors listed (or an agency) and 2) there was an experienced CPG methodologist clearly identified as a panel member or if one of the study investigators was recognized as such. All others were classified as "No or Uncertain Expertise". The p value for the independent samples ttest is shown. Kendall's Tau-b correlations were conducted between the mean rating of the over assessment of the CPG across all raters as compared to the mean of each Domain score. As hypothesized the correlation was highest with Rigour of development.
been appropriately defined. Future investigations should focus on content knowledge in a more specific sense to determine the importance of this issue when selecting evaluators.
It is not uncommon for the policy developers to use reliability data to set standards for the number of appraisers required when establishing guideline evaluation processes. Our data provide little direction in this regard, except to suggest that more than two appraisers are advisable. This is consistent with the minimum recommendation of 2 made by the AGREE Collaboration.
[7] Measurement theory suggest that additional appraisers/ ratings should produces higher levels of reliability [20] as do the AGREE Collaboration when stating that 4 appraisers is preferable. Our data did not follow this trend, beyond three appraisers. Our study evaluated the reliability of appraisers performing their assessment in isolation. This process replicates that which might be used by a working committee independently assessing quality scores. However, individual therapists might also need to evaluate CPG when no committee has been established to do so. In these cases, we would recommend three or more appraisers should still be recruited. We recognize that when evaluating CPGs, it is not just the score, but the process that is important. Adding additional appraisers with different perspectives may increase the variability/ disagreements. In fact, our data support this, as quadratic weighting of disagreements improved reliability coefficients in generalists and researchers, but not specialists. Specialists tended to disagree on whether a CPG should be used are not, but were less uncertain about that recommendation. We suggest the ideal approach to evaluation of CPGs is one where a group of potential endusers with clinical and guideline expertise work together to review a potential CPG by using the AGREE to facilitate a consensus process to determine both the quality and relevance to practice. These quality ratings should be disseminated to the relevant clinical communities. We recognize that the ideal process will not always be possible. Individual clinicians who must make decisions about utilization of CPGs should recruit colleagues to assist with the evaluation process before modifying their clinical practice based on CPGs with unknown validity.
Although interpretation of validity analyses is complex and requires some subjective decisions, overall our validity analyses are supportive of the AGREE instrument and are substantively similar to that reported by the AGREE Collaboration for 33 medically based CPGs [14] A factor analysis structure which suggests that the concepts measured by the AGREE are similar to the Domains described by the AGREE underlies the content and structural validity of the scale. The factor analysis was strongly supportive of the domains of Stakeholder Involvement and Editorial Independence. This would support uniqueness of items within these domains. The largest factor contained items from the domain Rigor of Development as well as the first two items in Scope and Purpose. Conceptually these two items, which require specific objectives and a well-defined clinical question, fit well with a process of rigorous development. That the largest factor relates to methodological issues supports the AGREE as an evaluative tool. The second factor contained the items regarding whether the patients were specifically described, whether procedures for updating the guideline were provided, all of the items from the applicability domain and 3/4 items from the clarity and presentation. Clarity of presentation, applicability and a defined patient population may all relate to the ability to implement CPGs and thus all retain a conceptual relationship to the domains Applicability or Clarity of Presentation described by the AGREE Collaboration. The third factor contained all items of the stakeholder involvement as well as the additional question in rigor of development addressing whether experts had externally reviewed the guideline. In general, these results support stakeholder involvement as a subscale and include external involvement within this distinct domain. The fourth factor contained only two items, one the item on health benefits/ side effects and a second whether recommendations were specific and unambiguous. This factor explained a small percentage of the variance and contained only two items, making it difficult to relate these results to the factor analysis presented by the AGREE Collaboration. While the results vary somewhat from those reported by the AGREE Collaboration, there is substantial similarity, particularly when one examines the concepts that are represented by the items that clumped together. We view our factor analysis results with caution given the small sample sizes. The AGREE Collaboration study use 100 guidelines and conducted this analysis during the field-testing prior to inclusion of item 23. Our analysis included only 56 guidelines. We did not conduct formal sample size calculations, as practical limitations on guideline availability determined our sample size. Rules of thumb suggest ten "subjects" per item for factor analysis, requiring 230 CPGs for a well-powered analysis.
Thus, some inherent instability should be expected in the factor analyses conducted to date. With the proliferation of CPG, we expect that larger samples might be evaluated in future studies and provide more definitive results on the factor validity of the AGREE instrument.
Construct validity tests also supported the validity of the AGREE instrument. Constructs were developed to replicate those used in the previous validation of the AGREE [14] and provided similar findings. Despite our inability to be confident about whether some CPGs had a methodologist, we still found large differences between the Rigour of Development Domain score obtained where we could identify a team containing a methodologist as compared to where this was not present or unclear (21 point difference). The AGREE Collaboration found a 9 point advantage in this domain for guidelines developed within a guideline development program and 19 points for those developed as national initiatives. Similarly, our correlations between overall assessment scores and those reported for Rigor of Development (0.64) concurred with those reported by AGREE Collaboration data (0.87) in that both suggest a substantive relationship.
Overall, our findings were remarkably consistent with those reported by the AGREE collaboration when validating the AGREE as used by medical practitioners [14] Validation by independent groups in different settings provides a stronger foundation for any instrument. These findings can be used to increase confidence in the current practice of different health care disciplines to use the AGREE to facilitate the evaluation of CPGs.

Conclusion
1. The AGREE instrument is reliable and valid when used by physical therapists to assess the quality of clinical practice guidelines 2. A minimum of three appraisers should be used to optimise reliability, although issues on effecting knowledge transfer or maximizing validity might increase these requirements.
3. There is no evidence that specific types of physical therapists provide more reliable scores. As this study did not address whether familiarity with the actual content of the CPG influenced reliability, this should be studied in the future.
4. There is evidence that guidelines developed with the assistance of experienced guideline developers are more rigorous.
5. The AGREE instrument was found to be useful in assisting both novice and experienced therapists in evaluation of clinical practice guidelines. The majority of therapists would continue to use the instrument.
6. Validity analysis supported the majority of results reported by the AGREE Collaboration) [14] The type of CPGs and evaluators were different in the current study supporting the validity of this instrument across a spectrum of circumstances.