State Teacher Quality Toolsets

May 14th, 2013

We’re working diligently on adding toolsets to support individual state standards for teacher quality observations. We’ve passed the 25 states mark, and there are more on the way. Go to eCOVE Home/Solutions to see the current list of states. All state toolsets are FREE add-on toolsets. You’ll need a license key to install but just email and let me know which state toolset you want.

There are also a few toolset to match state standards for administrator evaluations.  And we have partnered with ISTE to create and distribute (also free), the NETS & Computational Thinking Toolset.

The teacher quality toolsets include both scale and choice/checklist tools. The number of tools depends on how the standards are written, but the scale tools ranking matches the requirements for the specific state ( i.e. ineffective to distinguished). For each item in the scale tools there is a matching/aligned checklist tool with ‘observed /not observed’ as indicators.

If you want to work with a teacher to improve the classroom, you can then choose from the many timer/counter objective data tools to provide clear and easily measured record of progress.

If your state is not on the list and there is a current state-approved teacher quality standard in place, please email me. Include a link to the standards if possible and we’ll add it to the available toolsets.

Final note: It really does take all three types of data to go beyond the simple (and faulty) checklist systems and present an accurate and useful record of teaching practices and resultant teacher observation.

Inter-rater reliability training

April 17th, 2013

The process of inter- and intra-rater reliability is a complex and important issue. As the Director of the Graduate School of Education at Willamette University I constantly struggled with this problem and our student teacher supervisors.  It was that experience that led to the development of the eCOVE Software.

 

Here is the issue as I see it:

There are three types of data that can be collected in a observation:  evaluative, summary,  and descriptive.

Evaluative data is the judgment of the observer based on a common rubric, and typically a scale is used to record unsatisfactory to distinguished ranking. This is the most problematic, the least valid (depending on the wording of the rubric), and the least reliable. Calibration is a continuous process and is only effective when done in the real classrooms. Calibrating against a video is valid for that video and won’t hold up when compared to other observations. The practice of multiple observers doing ’rounds’ together in live classrooms is a common approach, but that must be repeated very frequently to hold up statistically.

Summary data is the summary of behaviors (teacher and/or student) noticed during the observation, and typically a checklist is used to record ‘observed / not-observed’. For example, the observer is looking for student engagement and sees many forms of engagement – students writing, asking questions, working on a project, discussion, etc., and ‘observes’ that the students are ‘engaged’. This also has validity and reliability problems as the internal definitions of ‘student engagement’ can vary from observer to observer. Again, calibrating against a video does not reliably translate to real classroom observations.

Descriptive data is the objective record of the frequency and/or duration of teacher and student behaviors during the observation. These behaviors are, in fact, the actual basis for the decisions made in summary or evaluative observations (even when not recorded). Using tools that record the count or length of time of specific behaviors that demonstrate the satisfactory/proficient levels of a standard is the most valid and accurate record that can be made. This is the most valid and reliable level of observation data, and establishing inter- and intra-rater reliability is much, much easier.

Calibration problems:

Observer bias: there have been 9 types of observer bias identified in classroom observations, and the observer is most often unaware of the bias. These become most evident in evaluative and summary observations where the values are internal and not easily identified. One observer maybe biased in favor of enthusiastic teachers or more polite students when they judge teaching quality or student engagement, but the wide variety of values between observers is nearly impossible to control for in a calibration.

Timing:  Observers have been found to judge behaviors differently depending on the day of the week and the time of day, as well as influenced by events prior to the observation. Even within a single observer ratings vary between Monday and Friday, morning and afternoon, or after an unrelated stressful event. When this is not controlled for, the calibrations of evaluative and summary type observations can be flawed and unreliable.

Recorded versus live classrooms:  Recorded classroom video just cannot accurately represent the real dynamic of teaching styles/practices, student demographics or behaviors that exist in the wide range of real classrooms. The videos have been carefully selected to demonstrate the intended behaviors; agreement between observers or observers and experts will not carry over to observations in the real classroom. While it appears to be an efficient approach, it’s not one that can result in high confidence of consistency across multiple observers and multiple real observations.

The eCOVE Approach:

Useful and accurate calibration is not between the judgments of observers, but between the actual behaviors occurring and the observer ranking. For example, if a standard is related to efficient use of classroom time and the observation presents a typical classroom from start to finish, the calibration is between the ranking (unsatisfactory, basic, proficient, distinguished) and the actual percent of total time there was an opportunity for students to learn.  Where one observer might rank as basic and another as proficient, the rater reliability can much more easily be reached by comparing the rating to the descriptive data. The training involved is two-fold: learning to accurately record the descriptive behaviors (ie; learning time, external interruption, internal interruption) and to come to agreement on the application of the standard. For example, when the learning time is 68% of the observation time is that to be rated as basic or proficient?

In all levels of observer judgment, the descriptive behaviors form the basis.  Observers see the behaviors, and in their mind form a summary before they apply the criteria of quality (the standard/rubric). The key to consistent observer judgment has to start with accurate observation of the behaviors. That’s why eCOVE includes tools for collecting all three levels of data, so that the full process can be compared and calibrated.

Conclusion:

We don’t provide video nor a process for comparing observer ranking against a preset ranking.

 We do have descriptive tools, and the ability to create new tools, that track any observable behavior. This can be used to calibrate observers ability to identify and record the observable behaviors, and video is useful for this purpose. In-district video is the least expensive and has other advantages. We also have, or can create, the checklist or scale tools to record observer’s rankings. The discussion to establish the criteria for the ranking – based on the levels of observed, descriptive behaviors – is the calibration process. This approach goes much deeper in establishing dependable and defensible observations and teacher evaluations, and has a longer lasting effect.

Self-Directed Professional Growth and eCOVE

May 22nd, 2010

I’ve been in numerous discussions with Teacher Leaders recently, and come across a common frustration. It seems that no matter how hard they try to convey that they are ‘just there to help’, there lingers a resistance and trust issue. When I dig deeper, I find that the observation process is one of identifying ‘good’ practices, and sometimes ‘bad’ practices; or it’s one of taking notes on ‘what occurred during the observation’.

I think that when the process has, in any way,  judgment or valuing language involved there will be a defensive resistance. Even the notes are a problem as the choice of what to record is a judgment made by the observer.

Since this doesn’t occur in the Data-Based Observation Method, it’s really easy to get past the initial trust concerns – you just have to follow the system and prove that you are not there to hammer them with the data.

Key concepts:   Don’t Praise, Don’t Criticize, Don’t Provide Solutions!

Follow this sequence of interaction:

Pre-conference: Centered around determining what data to collect – “What do you want to know about your classroom?”. This can be in light of a teacher’s individual goals, some teacher-perceived problem, or building/district/state standards.

During the observation, gather data without making any comments reflecting praise, criticism, or solutions.

In the post conference, ask these questions when presenting the data:

  • Is this what you thought was happening in the classroom? (teacher reflection and interpretation)
  • Do you think a change is indicated? (teacher and observer professional discussion about the interpretation of the data)
  • If so, what will you change (teacher ownership and empowerment; enhanced professional discussion)
  • How can I support you? (professional collaboration)
  • When should follow-up data be collected to see if the change is effective? (making the entire process not one of pleasing the observer, but in implementing effective change)

This approach shifts the dynamic from defensiveness to empowerment, from judge to colleague. There is no observer, Teacher Leader or Administrator, who can solve every classroom problem. It’s far better to develop the teachers’ skills in reflection and problem solving. This can be accomplished by basing the discussions on data rather than opinion.

Adding classes and students with Excel

May 22nd, 2010

How to add classes and teachers/students with an Excel spreadsheet.

It’s quite easy using the picture-book instructions and templates. Just type in the information, save as a tab-delimited file, and import from within eCOVE on the computer. The new setup will be available immediately, and if you sync an iPhone/iTouch the additions will show up there.

This is useful for entering or updating a small/medium number of observees. If you want to add an entire school, it’s better to export from the student information system, convert to the eCOVE template format, and then import. Look on the website under Support/Manuals for the #4 and 5 manuals. And there is a available

If you have trouble uploading the file, take a second look at it, especially at the column headings. If you don’t see something obvious, email the file with a note to  

Bloom’s Original versus Revised Bloom’s

July 5th, 2009

In a nutshell, the Revised Bloom’s Taxonomy is easier to reliably use during an observation. The original Bloom’s is very useful when examining written questions, but if you’re not quite skilled with it, it’s difficult to categorize questions on the fly. The revised actually covers the same behaviors, but the terms used make it easier to identify spoken questions. The Revised Bloom’s tool is tracking the cognitive demand on the student when responding to the question – are they being asked to remember or to analyze.

In addition, the Revised Bloom’s Taxonomy includes sub-categories for each level, which can be very useful when doing more detailed observations about question-asking or answering behaviors. The original Bloom’s Taxonomy is in the Basic Edition; the new Revised Bloom’s is in the current advanced Administrator and General Editions. All of the Bloom’s tools (there are 12) are available in the Solutions/View 200+ Tools heading on the website.

Here is a comparison chart.

Comparison chart - Original Blooms versus Revised Blooms

Comparison chart – Original Blooms versus Revised Blooms

Observing for Fidelity of Implementation

December 20th, 2008

As we continue to create new tools for the general classroom observation, special education, sheltered instruction, and implementation of a curriculum or behavior plan, it’s becoming clearer that there is major need to focus on the fidelity of implementation. So often the evaluation efforts are focused on either student outcomes (test scores or a culminating performance) or the level/type of student engagement. While those are critical pieces of data, the first data needs to on whether the teacher is implementing the curriculum or behavior plan as it was designed to be used.

I’m not in favor of lock-step following of the directions of ‘experts’, especially textbook publishers. To expect someone not familiar with the students and the school culture to lay out a specific sequence of teaching steps is asking too much. However, good curriculum is carefully designed and reviewed, and should have a consistent delivery system that should be initially followed. It’s very important to track the fidelity of implementation if you want to determine if the new curriculum or behavior plan is effective. If the teachers are consistent in the organization and delivery, then the data on change of student behavior can be trusted. Without the fidelity of implementation data, you would have no way of determining the cause of the success or failure of the efforts. Is it because the curriculum and/or delivery system is flawed or is the actual delivery by the teacher inconsistent or significantly changed?

With eCOVE Observation Software and the ability to create tools to match the desired behaviors, it’s possible to track both in the same observation. And with the ability to see the data over time and in comparison to other groups, it’s easy to get serious about making data based decisions. I get excited thinking about the time saved and frustration avoided when you can use the data to make in-progress adjustments, schedule retraining, or engage the teachers in an objective evaluation of the intervention. Rather than waiting for the outcomes to determine that it’s not working, tracking the student behavior can indicate the effectiveness of the intervention early in the process. Where that effectiveness is lacking, the data on the fidelity of implementation can help identify the cause, and in-progress corrections made in a timely manner.

Of course, this applies to not only school wide curriculum implementation but also to an individual student’s IEP. Whatever the level, the three basic questions are the same: Is the intervention being appropriately implemented? Is it having the desired effect on the student’s behavior? Does that result in greater learning? You need all three pieces of data to make professional decisions.

Peace,   John

Thoughts after a presentation on Observation Reliability

December 15th, 2008

This is an email sent to a person who requested a copy of the powerpoint presentation I did on Observation Reliability where I presented my new idea about the sequence from research to standards to indicators to data collection to teacher support and evaluation. If you’d like to see the Powerpoint send me an email.

When I first wrote eCOVE I was focused on giving helpful feedback to student teachers. From years of working with student teachers and new teachers I knew that they needed help thinking through the problems that came up in their classrooms. Providing them with ‘my’ answers and ideas was of much less benefit that getting them to think through things and devise their own solutions.

I also knew, again from personal experiences working with them, that giving them data (pencil and paper before eCOVE) help them honestly reflect on their own actions and outcomes, and it also greatly diminished the fear factor that came with the ‘evaluator’ role of a supervisor.

When I first started working with administrators and eCOVE I was totally focused on changing their role from judge to support and staff development. I preached hard that working collaboratively would have great effects and would/could create a staff of self-directed professionals. I still strongly believe that, and have enough feedback to feel confirmed.

However, a recent conversation with an ex-student, now an administrator, has added to my perspective. He likes eCOVE and would love to use it except that his district has a 20 page (gulp!) evaluation system that he needs to complete while observing – so he doesn’t have the time to work with teachers. We agree that it’s a waste of time, and corrupts the opportunity for collaborative professionalism.

As I thought about his situation and the hours of development time that went into the creation and adoption of that ‘evaluation guide’, I realized that my approach to observation as staff development had ignored the reality of the required and necessary role of administrator as evaluator. The guide that he’s stuck with seems to me to be the main flaw in the process, and what I believe is wrong with it (and the thousands in use across the country) is that they ask the observer to make a series of poorly defined judgments based on a vaguely defined set of ’standards’. It’s an impossible task and is functionally a terrible and ineffectual burden on both administrators and teachers.

When I thought about how a standards based system might be improved, I developed the basis for the idea in the powerpoint – Standards should be based on research; the implementation of the standard should be in some way observable, if not directly then by keystone indicators; the criteria for an acceptable level of performance should be concrete and collaboratively determined. I say collaboratively since I believe that administrators, teachers, parents, and the general public all have value to add to the process of educating our youth. Setting those criterial levels in terms of observable behavior data should, again, be based on research, and confirmed by localized action research efforts. That’s not as difficult as it sounds when the systematic process already includes data collection.

For the last couple of years, whenever I presented eCOVE I made a big point of saying that I was against set data targets for all teachers, that the context played such a big part in it all that only the teacher could interpret the data. I think now that I was wrong about that, partially at least. A simple example might be wait time – the time between a question and calling on a student for an answer. There’s lots of research that shows a wait time of 3 seconds has consistent positive benefits. While I’m sure it’s not the exact time of 3 seconds that is critical, the researched recommendation is a useful concrete measure. If a teacher waits less than one second (the research on new teachers), the children are robbed of the opportunity to think, and that’s not OK. An important facet of the process I’m proposing has to do with how the data is presented and used. My experience has been that the first approach to a teacher should be “Is this what you thought was happening?” This question, honestly asked, will empower the teacher and engage him or her in the process of reflection, interpretation, and problem solving. During the ensuing professional level discussion, the criteria for the acceptable level of student engagement is a 3 second wait period should be included, and that’s the measure to be used in the final evaluation. For, in the end, a judgment does have to be made, but it should not be based on the observer’s opinion or value system, but on set measurable criteria — criteria set and confirmed by sound research.

A more complex example – class learning time. The standard illustrated in the powerpoint stated that ’students should be engaged in learning’, a commonly included standard in most systems. There is extensive research that indicates that the more time a student is engaged in learning activities, the greater will be the learning. While the research does not propose a specific percent of learning time as a recommended criteria, I believe we as a profession can at least identify the ranges for unsatisfactory, satisfactory, and exceptional. I think we’d all agree that if a class period had only 25 % of the time organized for teaching and/or student engagement in learning activities, it would be absolutely unsatisfactory. Or is that number 35%? 45%? 60%? What educator would be comfortable with a class where 40% of the time lacked any opportunity for students to learn. I don’t know what the right number is, but I am confident that it is possible to come to a consensus over a minimum level. Class Learning time is a good example of a keystone data set – something that underlies the basic concept in the standard ‘engaged students’. I know there are others.

But then my personal experience as a teacher comes into focus, and the objection “How can you evaluate me on something I don’t have full control over?” pops up. I remember my lesson plans not working out when the principal took 10 minutes with a PA announcement and there were 4 interruptions from people with important messages or requests for information or students. How could it be fair to be concerned about my 50% learning time when there were all these outside influences?

That would be a valid concern where the evaluation system is based on the observer’s perception and judgment, but less so when based on data collection. It is an easy task to set up the data collection to identify the non-learning time by sub categories – time under the teacher’s control and time when an outside event took the control away from the teacher. The time under the teacher’s control should meet the criteria for acceptable performance; the total time should be examined for needed systematic changes to provide the teacher with the full allotment of teaching/learning time. Basing the inspection of school functioning on observable behavior data will reveal many possible solutions for problems currently included in the observer’s impression of teaching effectiveness.

It’s reasonable to be suspicious of data collected and used as an external weapon, and for that reason I believe it to be critical that the identification of the keystone research and indicators, and the setting of the target level be a collaborative process. Add to that the realization that good research continues to give us new knowledge about teaching and learning, and with that the process should be in a constant state of discussion and revision. That’s my vision of how a profession works – critical self-examination and improvement.

So now my thinking has come to a point where I believe (tentatively, at least) that we have sufficient research to develop standards, or to better focus the standards we do have; that we can identify keystone indicators for those standards; that we can use our collective wisdom to determine concrete levels for acceptability in those keystone indicators; that we can train observers to accurately observe and gather data; and that that data can be used to both further the teacher’s self-directed professional growth and to ensure that the levels of effective performance as indicated by sound research are met.

I’m hoping that my colleagues in the education field (and beyond) will join in this discussion and thinking. What is your reaction? Can you give me “Yes, but what if…..?” instances? Do we really have the credible research to provide us with keystone indicators? How could a system like this be abused? How could we guard against the abuse?

Students and Data-Based Observation

December 15th, 2008

Lots of good teachers, me included, work quite hard to get student to think at a ‘higher level’. In Bloom’s Taxonomy this would be in the Analysis/Synthesis level, or in some thoughtful response to a divergent question. Thinking at a higher level about the content at hand would be great, but a deeper desire is just that they exercise their brains for more than stimulus-response game playing or repeating the obvious.

After years of challenging, encouraging, praising, modeling I came to the conclusion that higher order thinking will only naturally (not forced) occur if the topic is related to the life of the student. It the broad sense, related can be as simple as having fun…solving puzzles, creating new ideas…self-directed mental challenges that end up with the intrinsic reward of a self-approved solution.

And now that I’m engrossed with data based observation, I have discovered something quite interesting – give kids data on their own behavior, either as an individual or as a group, and they go immediately to the analytic level, and love it. They will reflect, think divergently, propose and test changes, and anxiously look forward to the next round of higher order thinking.

And it’s a pretty easy step to transfer that analytic thinking to school related content — “Remember how 40% of your statements to each other were negative? How does that relate to the X versus Y conflict (take your pick)?” or “Compare your individual time-on-task rate with the campaign promises of President X (take your pick) for greater government efficiency.”

My observation is that the data collected needs to be real (not how many are wearing red, or how many pencils were dropped), and best if collaboratively identified as something of interest. Assigning a student to be the data gatherer further engages them.

Tools I’ve seen used with students include Time On Task, Positive/Negative, Verbal Tics, Bloom’s Taxonomy (levels of questions answered or asked by students), Teacher Travel (tracking what % each part of the room was engaged in a discussion), and of course, the Generic Tools. Track a small group working on a project together and then present them with the % of time each contributed to the discussion is enlightening.

Give them the data and ask “Is this what you thought was happening?” “Why/Why not?” “Is there a need for a change?”… and away you go.

Professional Development Rubrics

December 15th, 2008

There seems to be some conversation about the right term for these– rubrics, scoring guides, continuums, etc., but I’m sure we are all picturing the same table of headings describing a scale from not-good to great.

In the business world, and somewhat in education, they are also called Behaviorally Anchored Rating Scales (BARS). I’m adding one word to that, making it Data-based Behaviorally Anchored Rating Scales (D-BARS). If you ever see that somewhere else, you can say you know where it started.

If you’re adopting, amending, writing your own D-BARS, there are some errors to avoid lest the outcome be less than helpful to the observer and observee.

A very common error has to do with creating a continuum of behavior indicators. Since across the top of these documents is a scale that progresses from one extreme to the other, conceptually with no gaps or overlaps between the divisions, the D-BARS (the physical observable behavior that exemplifies each division) should also be a continuum. As I look at these documents from across the country and world, one of the most common errors is that the actual behavior being used as an indicator changes from one division/cell to another. It shouldn’t. What should be described is one behavior across the continuum, poor to great. An example…..

The target standard/behavior is “Teachers involve and guide all students in assessing their own learning.” The category headings are Unsatisfactory, Emerging, Basic, Proficient, and Distinguished.

The behavior indicator for the Unsatisfactory level is “Students do not assess their own learning.” That’s a clear statement, but there’s more to the unsatisfactory level than no assessment at all. Students might be assessing themselves once a year, unguided, inaccurately, using the wrong criteria, etc. The descriptor for Unsatisfactory should describe the range of indicators, all of which are unsatisfactory.

The next level, Emerging, has this as a behavior indicator: “Teacher checks student work and communicates progress through the report card.” This indicator is unrelated to student assessment of their own learning, and doesn’t provide the guidance for whoever would use the D-BARS to clearly be able to determine the difference between Unsatisfactory and Emerging. This statement might fit well in a target standard related to ‘communicating progress to students’, and in that standard might well fit in the ‘Emerging’ category.

Perhaps (and this is brainstorming – collaborative discussion needed)…

Emerging would be “Students are asked to state/guess what their grade on an assignment will be.” or “Students are asked to grade each other’s papers without the use of a scoring guide.” [Students are assessing their work related to grades, and with little guidance]

The Basic category could be something akin to “Students are assessed by the teacher according to a scoring guide and asked to describe why they agree/disagree with the grade.” [Students are asked to apply the scoring guide in their reflection, but do not actually self-assess]

A descriptor for the Proficient level might be “Using a teacher provided scoring guide, students are asked to assess their work before they hand it in to the teacher.” [Students assess their work according to a scoring guide]

And finally, the Distingished level could read “Using collaboratively developed (teacher and students) scoring guides, students are engaged in self and peer assessment of progress toward meeting the standards.” [Students are engaged and guided in the process of creating the criteria, and then applying that criteria to themselves and others.]

I would hope that there would be discussion about my choices and wording, as this is only to illustrate the need for a continuum in the described behavior indicators.

The next thinking might be about what are the keystone observable behaviors that should be tracked to gather data on “Involving and guiding students in assessing their own learning.” Is it the amount of time students are engaged in assessing learning? The number of references to standards made by the teacher? The number and or level of questions asked by students related to assessment and standards? Every observer who makes a determination of level is doing so on the basis of something they see. We need to come to consensus concerning what’s valid and reliable.

Related Research

December 15th, 2008

I get regular requests to include the research that’s related to the various tools in the reports, and that was my original intent. Then I found that since I was no longer considered an ‘educator’ but a commercial enterprise instead, a different set of rules apply. I can’t legally copy abstracts or research documents and post them. I can point you to the research – BUT a great deal of the research is owned by on-line companies that charge a bundle to provide access. As a business, I can’t copy and post any of that, and I’ve been trying to find a way to provide the research information to you.

In a webinar today a participant came up with a new thought – he suggested that I include my observations of observations — for instance, we find that the first period of the day in middle and high school has a lower percentage of class learning time than the other periods (last period is second). Generally it’s because of outside interruptions (attendance, PA announcements, etc), but if that’s not taken into account in scheduling, the students in those classes have a reduced opportunity to learn.

My hesitation is that there’s nothing in my observations of observations that remotely looks like research. I’m not setting out to determine the truth about a practice or condition; it’s just my conclusions based on a growing number of experiences. I’m afraid that my informal conclusions would take on a life of their own.

But another thought occurred to me that might strengthen it somewhat. How could I create a system where all of the observers could share their observation data (no names, of course) with me/each other. I think we’d all be better informed if we could get a feel for the global norm. Maybe? Serious researchers will cringe about this, but if we’re careful to not draw firm conclusions and apply the global data to a specific situation without careful inspection it could be more valid than my single person experiences.

Also, I think I need to do some description on how easy it is to do some focused action research with eCOVE. That’s when it really starts to get useful to you and your particular situation.

So Please Tell Me….. do you want the formal research even if you have to pay someone (not me) for it? Would my informal conclusions be of value, in spite of their validity weaknesses? Would a global norm be useful and would you take the time to send in the data? Are you interested in a simple system for doing action research with eCOVE?