Integrating CAI Development and Empirical Research:

Opportunities and Responsibilities
 
 

Marvin J. Croy

Department of Philosophy

The University of North Carolina at Charlotte

Charlotte, NC, USA 28223


Published in the Computerized Logic Teaching Bulletin, vol. 4, no. 1, March, 1991, pp. 2-12.

The use of computers in education is often thought of as a means of putting sound pedagogical principles and techniques into practice. However, such use can also contribute to building the empirical foundations for those techniques. This can occur in two ways. First, CAI programs can collect data on student performance for the purpose of identifying prominent weaknesses and for investigating processes involved in mastering various tasks and learning particular subject matters. In every discipline, from the sciences to the humanities, there are important questions to be answered concerning the difficulties students have in learning and applying various concepts and techniques. CAI can aid in answering those questions. Second, CAI programs can be used to document student performance for the purpose of program evaluation. While providing instruction, computers can serve as data collection devices that provide precise behavioral measures, including response time, of student learning. CAI programs can thus facilitate their own evaluation. In both of these applications, computers function akin to other observational instruments in the sciences.
 

The logic CAI project described here has made some attempts to move in these directions. The results of these efforts are reported in Croy (1989, 1988). Here, an overview of these activities is given along with some encouragement for other projects to take up this task.
 
 

An Overview of Previous Findings

 

It has previously been observed that, when using an inference/replacement rule set, students applied "negative" rules (defined as any rule containing a negation sign) less successfully than "positive" rules (those containing no negation signs). This difference, moreover, was greater than that observed for inference versus replacement rules, and the degree to which students used these negative rules was found to be inversely correlated with rule application proficiency. That is, greater use of negative rules was associated with lower application success rates. In addition, particular patterns of rule misapplications (termed "pseudo-rules") can be identified.
  

These observations can potentially be of immediate practical value. If students experience difficulties with negative rules, particular lessons can be devised to help remediate such difficulties both within the classroom and CAI programs. But before these efforts can begin, questions about the reliability and generalizability of the observations should be faced. What about the next semester? Will the next class of students also exhibit these weaknesses? Are the findings limited to a given problem set? What exactly are the conditions under which the difficulties occur? These questions are not purely of scientific interest. Implementing a plan to address observed weaknesses will require a variety of resources. The generalizability of the observed weaknesses will dictate the scope and usefulness of the remediation effort and the degree to which those resources are effectively used. The relevance of observed student difficulties and corresponding remediation to students in similar courses at other universities is also an issue of interest.
  

Resolving these issues is not a simple task. While statistical techniques are normally employed to address such issues, educational settings often work against their usefulness. Many statistical tests, for example, assume that the subjects observed have been randomly selected from some larger pool. In the case of a correlation (such as that reported above, for example) a measure of statistical significance would indicate the likelihood that observations of the selected group would hold for members of the larger group as well. Students enrolled in a particular section during a particular semester, however, cannot be assumed to constitute a random sample of all students taking the course during the year. Even less can they be considered to be a useful sample for generalizing to all students studying logic at large. This impairs the ability to make inferences from one's own students to those attending universities elsewhere. This problem is significant given the way in which CAI programs and textbooks developed at one university are exported to a wide variety of educational institutions.
 

On the other hand, it should be emphasized that the unending procession of students through courses, coupled with data collecting CAI programs, is an asset. Repetition of results affords some confidence in their reliability. The observed difficulty with negative rules, for example, has shown up each semester for several consecutive sessions. Consequently, it seems more reasonable to predict that the difficulties will continue to occur (given the same problem sets and sequence of exercises) than that they will not.
  

Nevertheless, questions can be raised concerning the observation that negation rules seem to be particularly troublesome. Would, for example, this observation hold up for all problem sets? One reason that it might not is that certain instantiations of a particular rule are more error-prone than others. Consequently, it might be that observed difficulties are a function of the particular problems (and application instances) involved rather than of any general characteristic of the rules themselves.
  

One approach to answering this question is to explore a variety of problem sets in various textbooks. The problems initially used in our studies are found in Copi's Introduction to Logic (seventh edition) and Butrick's Deduction and Analysis (revised edition). Students worked these problems in an introductory course which averages about 30 students each semester. During the Fall of 1988, an attempt was made to determine whether similar results would also show up with advanced students using problems from Kahane's Logic and Philosophy (fifth edition). Ten students in an upper level logic course worked justification exercises prior to facing full proof problems just as with the Copi problems. Observations of these students' performance (involving a total of 1,578 rule applications) confirmed much of what was found under the Copi problem set. For example, the difference between success rates for positive and negative rules (89.1% versus 83.7%) was greater than that between inference and replacment rules (86.4% versus 86.2%). The inverse correlation of negative rule use and overall rule application success rate was also evident. In fact, this correlation (-.83) was the highest ever observed. (This may be due to the fact that the Kahane text does a good job of forcing students to face the more difficult instances of negative rule applications.) By contrast, the correlation of success rates with replacment rules or other connective-defined classes of rules do not begin to approach the magnitude of that found with negative rules.
 
 

Very similar results were found in the Spring of 1990 using problems selected from Klenk's Understanding Symbolic Logic (second edition). A total of 1,016 applications made by nine advanced students working eight proof problems each were analyzed. While the differences are not as sharp, the same patterns described above emerge. A negative correlation (-.41) holds between the extent of negative rule use and overall application success rate. A similar, though somewhat weaker, inverse relation exists in respect to the use of replacement rules, but no other connective-defined class of rules shows relationships nearly as strong. So, once again, negative rules seem to contain an important source of difficulty for students attempting to master the application of symbolic transformation rules. Previous analyses have shown that these difficulties increase when these rules are applied to expressions already containing negations; for example, when DeMorgans is applied to '~(~A & ~B).' Moreover, these findings support the view that difficulties with negative rules are not merely a function of particular problem sets. Of additional interest is the fact that these advanced students working on more demanding problems displayed some of the same difficulties characteristic of introductory level students and problems.
 
 

Program Evaluation

 

One thing that's become clear from these studies is the importance of being able to apply transition rules correctly. Close observation reveals many more types of difficulties than was previously suspected. Although the inability to construct proofs is certainly related to failures in strategic thinking, rule application deficiencies account for many troubles and may even interfere with some forms of strategic planning. These deficiencies, moreover, arise not from the inability to define rules but from failures to apply a given rule pattern to particular expressions correctly. Learning to apply rules correctly under a variety of trying circumstances thus constitutes a significant segment of mastering proof construction, and CAI programs which promote this learning are doubly valuable.
 

In light of this, we have spent considerable effort developing a program for this purpose. This program, named JUSTIFIED THOUGHT (hereafter, JT), provides exercises on applying rules of transition to a variety of particular expressions. The form of these exercises is similar to that found in many textbooks. Students must (1) accurately name particular rule applications, (2) supply concluding expressions when presented with premises, and (3) accept or reject potential applications on the basis of their legitimacy. In the Spring of 1989, we initiated what is hoped to be the first of several evaluations of the JT program. Thirty introductory level students practiced justification exercises prior to attempting proof construction just as in previous semesters. About half of the class (randomly selected and designated Group E) used the JT program while the other half (Group C) worked the identical exercises on paper, to be handed in and returned as regular homework assignments. Given that both groups faced identical exercises, it was expected that any subsequent performance differences observed would be small and would indicate the effectiveness of computer presentation and capabilities rather than of justification exercises themselves. These capabilities included the ability to immediately compare misapplications with rule forms and to receive more informative error messages for pseudo-rules.
 

One measure of the effectiveness of the JT program can be obtained by comparing the proof construction efforts of the two groups. After completing justification exercises (via JT or on paper), all students worked proofs using a data collecting proof checking program (DEEP THOUGHT). Each student was assigned to one of four problem sets (a, b, c, d) each of which contained five proof problems. Since the use of negative rules appears to be a stumbling block, we have been interested in how well the JT program can help students overcome this weakness. Thus, student performance using these particular rules was compared both across Group E and Group C and the four problem sets. The performance measure taken was the total number of correct applications using negative rules minus the number of incorrect applications using those same rules. Figure 1 shows the results for each problem set and group. Each cell in Figure 1 contains two numbers, one representing the number of students in that condition and one representing average performance. For example, there were five students assigned to problem set b in Group E, and the average number of correct minus incorrect applications for these five students was thirteen. With the exception of problem set c, the students who used JT scored higher than those who did not, and the overall totals also favor Group E. However, the statistical test, known as analysis of variance (ANOVA), used to analyze this data shows that the differences observed here are not quite large enough to be statistically reliable. The differences between Group E and Group C and among problem sets approach but do not achieve significance (P = .10 and P = .14, respectively) using the .05 criterion.
 

In addition to rule application proficiency, we were also interested in comparing the two groups in respect to the amount of time required to solve proof problems. Microcomputers (as opposed to time-sharing systems) provide a reliable mechanism for measuring this variable, but a number of other factors are currently restricting our ability to interpret this data in a meaningful way. One of these impediments involves the existence of unsolved problems. It is risky to compare the average time spent working on a problem for different groups of students when both sets of data contain unfinished problems. In order to minimize these risks, two problems were selected from the five problem sequence assigned to each student. The first problem in the five-problem sequence was excluded from consideration to control for possible "warm-up" effects due to the newness of the CAI environment for Group C. The second and third problem in each problem set's sequence contained only four occurrences of unfinished problems (two for each group) out of a total of 58 problems worked, and the results for these two problems are summarized in Figure 2. The average time (minutes) spent working on these two problems is given for each group and problem set. For example, the average time required for the five students in group E to solve the two problems selected from set b was 7.7 minutes. (One student made no attempt at either of these problems and was deleted from Group E, problem set c.) The results of this analysis are very similar to those presented above. Overall, Group C students solved their problems in less time, and this finding holds for each problem set except set c. Again, however, the differences observed here both between the groups and among problem sets are not quite large enough to attain statistical significance (P = .15 and P = .08, respectively).

 

Interpretation of Results

 

These analyses should be taken as no more than initial, quasi-experimental explorations which can and should undergo continued refinement. There are several variables which still need to be controlled, and our plan is to conduct such studies about once per year, tightening them up progressively. The results will initially be treated as formative rather than summative evaluations. That is, they are used for guiding program development rather than for rating final products. Of course, development should evolve to a point where clear advantages of program use can be demonstrated.
 

In any event, the current results do not show any clear advantage to using the JT program, either as a means of overcoming the difficulties which attend the use of negative rules or as a means of expediting proof construction. The JT program did not prove itself to deliver the benefits of justification exercises any better than standard homework exercises did. (This is assuming that there are such benefits to be delivered, but without any control group of subjects working proofs without having previously worked justification exercises, that assumption is untested here.) Nevertheless, these results provide an important source of formative feedback, and the JT program is undergoing a number of modifications which, upon completion, will also be subject to evaluation.
 
 

Ethical Tightropes


Carrying out these empirical studies cannot proceed without walking an ethical tightrope in trying to balance diverse concerns and interests. This challenge is best seen not as a conflict between project needs and student needs but between short term and long term student interests. Clearly, collecting and analyzing data and publishing the results require steps that ensure privacy and anonymity. Data should be coded in ways that protect student identities. Particularly where students are divided into experimental and control groups, guarantees must be provided that no student is disadvantaged in terms of grades or learning opportunities. Individual projects will have to anticipate, seek out, and address particular ways in which students may be adversely affected by the development and use of CAI.
  

Recommendations relevant to this issue are discussed in a case analysis given by Moor (1986) and Overall (1986). In fact, the evaluation reported here was carried out in line with Moor's suggestion that, should statistically significant differences in grades exist between control and experimental groups, the disadvantaged group should be compensated by the amount of that difference. No such difference existed between the exam or homework assignment grades of Group E versus Group C students, however. We also offered students the opportunity of changing groups after random assignment (but none chose to do so). Providing this opportunity jeopardizes the experimental design but this may be acceptable in early stages of development where evaluations play the formative role of providing directions for further development and indications of when well controlled studies are needed. Had the JT program appeared to be highly effective in this preliminary study, development could have been discontinued and more serious evaluation initiated.
 
 

Organizing the Effort

 

If these suggestions were followed, CAI development projects would simultaneously become educational research projects. Issues worth investigating are plentiful. In every discipline, questions about aspects of the subject matter that prove most troublesome are worth pursuing. The same can be said for identifying effective techniques of remediation. With the possible exception of mathematics, these questions have not been systematically pursued. Since CAI programs are often designed to either accompany or follow particular textbooks, the opportunity to evaluate the pedagogical techniques employed in textbooks are also prime targets for evaluation.
 

It is not clear, however, how the burden of this research should be distributed, and there is no doubt that it is in fact a burden. Would the interests of developing effective CAI be best served if each project incorporated a research component into its overall structure? Or should particular projects focus on these efforts while others emphasize other components of CAI development? Whatever the answer, the need for widespread research is evident given the diversity of settings in which CAI is developed and adopted. (Suggestions related to these themes have also been made by Millican (1988) and Twidale (1989).)
 
 

In order to be productive, these studies need to be both long-term and coordinated with similar research efforts at other universities. A long-run coordinated effort could increase the generalizability and reliability of findings. For example, projects using similar rule sets for proof construction could jointly investigate the difficulties associated with those rules. Also, comparisons could be made among students with different characteristics at different universities working on identical problem sets. It should be expected that one project, or even a few working in isolation, will not be able to accomplish much. But wherever students are using computers to learn, the opportunity exists for observing and recording actual student strengths and weaknesses. Processing and analyzing this data adds a new dimension to the CAI effort. It seems clear that it is the responsiblity of the CAI movement to conduct this research, and that responsibility will be most fully met by a widespread effort in which many projects participate.
 
 

References


Butrick, R. (1981). Deduction and Analysis, revised edition. Washington, D.C.: University Press of America.
 
 

Copi, I. (1986). Introduction to Logic, seventh edition. New York: Macmillan.
 
 

Croy, M. (1988a). "Computer-Assisted Instruction and Rule Applications for Deductive Proof Construction," Collegiate Microcomputer, VI, 51-56.
 
 

_______. (1988b). "The Use of CAI to Enhance Human Interaction in the Learning of Deductive Proof Construction, Computers and the Humanities, 22, 277-284.
 
 

_______. (1989). "CAI and Empirical Explorations of Deductive Proof Construction," Computers and Philosophy Newsletter, 4, 111-127.
 
 

Kahane, H. (1986). Logic and Philosophy, fifth edition. Belmont, CA: Wadsworth Publishing Company.
 
 

Klenk, V. (1989). Understanding Symbolic Logic, second edition. Englewood Cliffs: Prentice-Hall.
 
 

Millican, P. (1988). "Prospects and Problems for Computers in Logic Teaching," Computerized Logic Teaching Bulletin, 1, 32-38.
 
 

Moor, J. (1986). "Computer-Assisted Instruction and the Guinea Pig Dilemma," Teaching Philosophy, 9, 351-354.
 
 

Overall, C. (1986). "Innovation and Injustice: Commentary on 'I'm Not a Guinea Pig'," Teaching Philosophy, 9, 354-358.
 
 

Twidale, M. (1989). "Explicit Planning and Instantiation as a Means of Facilitating Student Computer Dialog," Computerized Logic Teaching Bulletin, 2, 2-12.