• Search Menu
  • Browse content in Arts and Humanities
  • Browse content in Archaeology
  • Anglo-Saxon and Medieval Archaeology
  • Archaeological Methodology and Techniques
  • Archaeology by Region
  • Archaeology of Religion
  • Archaeology of Trade and Exchange
  • Biblical Archaeology
  • Contemporary and Public Archaeology
  • Environmental Archaeology
  • Historical Archaeology
  • History and Theory of Archaeology
  • Industrial Archaeology
  • Landscape Archaeology
  • Mortuary Archaeology
  • Prehistoric Archaeology
  • Underwater Archaeology
  • Urban Archaeology
  • Zooarchaeology
  • Browse content in Architecture
  • Architectural Structure and Design
  • History of Architecture
  • Residential and Domestic Buildings
  • Theory of Architecture
  • Browse content in Art
  • Art Subjects and Themes
  • History of Art
  • Industrial and Commercial Art
  • Theory of Art
  • Biographical Studies
  • Byzantine Studies
  • Browse content in Classical Studies
  • Classical History
  • Classical Philosophy
  • Classical Mythology
  • Classical Literature
  • Classical Reception
  • Classical Art and Architecture
  • Classical Oratory and Rhetoric
  • Greek and Roman Papyrology
  • Greek and Roman Epigraphy
  • Greek and Roman Law
  • Greek and Roman Archaeology
  • Late Antiquity
  • Religion in the Ancient World
  • Digital Humanities
  • Browse content in History
  • Colonialism and Imperialism
  • Diplomatic History
  • Environmental History
  • Genealogy, Heraldry, Names, and Honours
  • Genocide and Ethnic Cleansing
  • Historical Geography
  • History by Period
  • History of Emotions
  • History of Agriculture
  • History of Education
  • History of Gender and Sexuality
  • Industrial History
  • Intellectual History
  • International History
  • Labour History
  • Legal and Constitutional History
  • Local and Family History
  • Maritime History
  • Military History
  • National Liberation and Post-Colonialism
  • Oral History
  • Political History
  • Public History
  • Regional and National History
  • Revolutions and Rebellions
  • Slavery and Abolition of Slavery
  • Social and Cultural History
  • Theory, Methods, and Historiography
  • Urban History
  • World History
  • Browse content in Language Teaching and Learning
  • Language Learning (Specific Skills)
  • Language Teaching Theory and Methods
  • Browse content in Linguistics
  • Applied Linguistics
  • Cognitive Linguistics
  • Computational Linguistics
  • Forensic Linguistics
  • Grammar, Syntax and Morphology
  • Historical and Diachronic Linguistics
  • History of English
  • Language Evolution
  • Language Reference
  • Language Acquisition
  • Language Variation
  • Language Families
  • Lexicography
  • Linguistic Anthropology
  • Linguistic Theories
  • Linguistic Typology
  • Phonetics and Phonology
  • Psycholinguistics
  • Sociolinguistics
  • Translation and Interpretation
  • Writing Systems
  • Browse content in Literature
  • Bibliography
  • Children's Literature Studies
  • Literary Studies (Romanticism)
  • Literary Studies (American)
  • Literary Studies (Asian)
  • Literary Studies (European)
  • Literary Studies (Eco-criticism)
  • Literary Studies (Modernism)
  • Literary Studies - World
  • Literary Studies (1500 to 1800)
  • Literary Studies (19th Century)
  • Literary Studies (20th Century onwards)
  • Literary Studies (African American Literature)
  • Literary Studies (British and Irish)
  • Literary Studies (Early and Medieval)
  • Literary Studies (Fiction, Novelists, and Prose Writers)
  • Literary Studies (Gender Studies)
  • Literary Studies (Graphic Novels)
  • Literary Studies (History of the Book)
  • Literary Studies (Plays and Playwrights)
  • Literary Studies (Poetry and Poets)
  • Literary Studies (Postcolonial Literature)
  • Literary Studies (Queer Studies)
  • Literary Studies (Science Fiction)
  • Literary Studies (Travel Literature)
  • Literary Studies (War Literature)
  • Literary Studies (Women's Writing)
  • Literary Theory and Cultural Studies
  • Mythology and Folklore
  • Shakespeare Studies and Criticism
  • Browse content in Media Studies
  • Browse content in Music
  • Applied Music
  • Dance and Music
  • Ethics in Music
  • Ethnomusicology
  • Gender and Sexuality in Music
  • Medicine and Music
  • Music Cultures
  • Music and Media
  • Music and Religion
  • Music and Culture
  • Music Education and Pedagogy
  • Music Theory and Analysis
  • Musical Scores, Lyrics, and Libretti
  • Musical Structures, Styles, and Techniques
  • Musicology and Music History
  • Performance Practice and Studies
  • Race and Ethnicity in Music
  • Sound Studies
  • Browse content in Performing Arts
  • Browse content in Philosophy
  • Aesthetics and Philosophy of Art
  • Epistemology
  • Feminist Philosophy
  • History of Western Philosophy
  • Metaphysics
  • Moral Philosophy
  • Non-Western Philosophy
  • Philosophy of Language
  • Philosophy of Mind
  • Philosophy of Perception
  • Philosophy of Science
  • Philosophy of Action
  • Philosophy of Law
  • Philosophy of Religion
  • Philosophy of Mathematics and Logic
  • Practical Ethics
  • Social and Political Philosophy
  • Browse content in Religion
  • Biblical Studies
  • Christianity
  • East Asian Religions
  • History of Religion
  • Judaism and Jewish Studies
  • Qumran Studies
  • Religion and Education
  • Religion and Health
  • Religion and Politics
  • Religion and Science
  • Religion and Law
  • Religion and Art, Literature, and Music
  • Religious Studies
  • Browse content in Society and Culture
  • Cookery, Food, and Drink
  • Cultural Studies
  • Customs and Traditions
  • Ethical Issues and Debates
  • Hobbies, Games, Arts and Crafts
  • Lifestyle, Home, and Garden
  • Natural world, Country Life, and Pets
  • Popular Beliefs and Controversial Knowledge
  • Sports and Outdoor Recreation
  • Technology and Society
  • Travel and Holiday
  • Visual Culture
  • Browse content in Law
  • Arbitration
  • Browse content in Company and Commercial Law
  • Commercial Law
  • Company Law
  • Browse content in Comparative Law
  • Systems of Law
  • Competition Law
  • Browse content in Constitutional and Administrative Law
  • Government Powers
  • Judicial Review
  • Local Government Law
  • Military and Defence Law
  • Parliamentary and Legislative Practice
  • Construction Law
  • Contract Law
  • Browse content in Criminal Law
  • Criminal Procedure
  • Criminal Evidence Law
  • Sentencing and Punishment
  • Employment and Labour Law
  • Environment and Energy Law
  • Browse content in Financial Law
  • Banking Law
  • Insolvency Law
  • History of Law
  • Human Rights and Immigration
  • Intellectual Property Law
  • Browse content in International Law
  • Private International Law and Conflict of Laws
  • Public International Law
  • IT and Communications Law
  • Jurisprudence and Philosophy of Law
  • Law and Politics
  • Law and Society
  • Browse content in Legal System and Practice
  • Courts and Procedure
  • Legal Skills and Practice
  • Primary Sources of Law
  • Regulation of Legal Profession
  • Medical and Healthcare Law
  • Browse content in Policing
  • Criminal Investigation and Detection
  • Police and Security Services
  • Police Procedure and Law
  • Police Regional Planning
  • Browse content in Property Law
  • Personal Property Law
  • Study and Revision
  • Terrorism and National Security Law
  • Browse content in Trusts Law
  • Wills and Probate or Succession
  • Browse content in Medicine and Health
  • Browse content in Allied Health Professions
  • Arts Therapies
  • Clinical Science
  • Dietetics and Nutrition
  • Occupational Therapy
  • Operating Department Practice
  • Physiotherapy
  • Radiography
  • Speech and Language Therapy
  • Browse content in Anaesthetics
  • General Anaesthesia
  • Neuroanaesthesia
  • Clinical Neuroscience
  • Browse content in Clinical Medicine
  • Acute Medicine
  • Cardiovascular Medicine
  • Clinical Genetics
  • Clinical Pharmacology and Therapeutics
  • Dermatology
  • Endocrinology and Diabetes
  • Gastroenterology
  • Genito-urinary Medicine
  • Geriatric Medicine
  • Infectious Diseases
  • Medical Toxicology
  • Medical Oncology
  • Pain Medicine
  • Palliative Medicine
  • Rehabilitation Medicine
  • Respiratory Medicine and Pulmonology
  • Rheumatology
  • Sleep Medicine
  • Sports and Exercise Medicine
  • Community Medical Services
  • Critical Care
  • Emergency Medicine
  • Forensic Medicine
  • Haematology
  • History of Medicine
  • Browse content in Medical Skills
  • Clinical Skills
  • Communication Skills
  • Nursing Skills
  • Surgical Skills
  • Browse content in Medical Dentistry
  • Oral and Maxillofacial Surgery
  • Paediatric Dentistry
  • Restorative Dentistry and Orthodontics
  • Surgical Dentistry
  • Medical Ethics
  • Medical Statistics and Methodology
  • Browse content in Neurology
  • Clinical Neurophysiology
  • Neuropathology
  • Nursing Studies
  • Browse content in Obstetrics and Gynaecology
  • Gynaecology
  • Occupational Medicine
  • Ophthalmology
  • Otolaryngology (ENT)
  • Browse content in Paediatrics
  • Neonatology
  • Browse content in Pathology
  • Chemical Pathology
  • Clinical Cytogenetics and Molecular Genetics
  • Histopathology
  • Medical Microbiology and Virology
  • Patient Education and Information
  • Browse content in Pharmacology
  • Psychopharmacology
  • Browse content in Popular Health
  • Caring for Others
  • Complementary and Alternative Medicine
  • Self-help and Personal Development
  • Browse content in Preclinical Medicine
  • Cell Biology
  • Molecular Biology and Genetics
  • Reproduction, Growth and Development
  • Primary Care
  • Professional Development in Medicine
  • Browse content in Psychiatry
  • Addiction Medicine
  • Child and Adolescent Psychiatry
  • Forensic Psychiatry
  • Learning Disabilities
  • Old Age Psychiatry
  • Psychotherapy
  • Browse content in Public Health and Epidemiology
  • Epidemiology
  • Public Health
  • Browse content in Radiology
  • Clinical Radiology
  • Interventional Radiology
  • Nuclear Medicine
  • Radiation Oncology
  • Reproductive Medicine
  • Browse content in Surgery
  • Cardiothoracic Surgery
  • Gastro-intestinal and Colorectal Surgery
  • General Surgery
  • Neurosurgery
  • Paediatric Surgery
  • Peri-operative Care
  • Plastic and Reconstructive Surgery
  • Surgical Oncology
  • Transplant Surgery
  • Trauma and Orthopaedic Surgery
  • Vascular Surgery
  • Browse content in Science and Mathematics
  • Browse content in Biological Sciences
  • Aquatic Biology
  • Biochemistry
  • Bioinformatics and Computational Biology
  • Developmental Biology
  • Ecology and Conservation
  • Evolutionary Biology
  • Genetics and Genomics
  • Microbiology
  • Molecular and Cell Biology
  • Natural History
  • Plant Sciences and Forestry
  • Research Methods in Life Sciences
  • Structural Biology
  • Systems Biology
  • Zoology and Animal Sciences
  • Browse content in Chemistry
  • Analytical Chemistry
  • Computational Chemistry
  • Crystallography
  • Environmental Chemistry
  • Industrial Chemistry
  • Inorganic Chemistry
  • Materials Chemistry
  • Medicinal Chemistry
  • Mineralogy and Gems
  • Organic Chemistry
  • Physical Chemistry
  • Polymer Chemistry
  • Study and Communication Skills in Chemistry
  • Theoretical Chemistry
  • Browse content in Computer Science
  • Artificial Intelligence
  • Computer Architecture and Logic Design
  • Game Studies
  • Human-Computer Interaction
  • Mathematical Theory of Computation
  • Programming Languages
  • Software Engineering
  • Systems Analysis and Design
  • Virtual Reality
  • Browse content in Computing
  • Business Applications
  • Computer Security
  • Computer Games
  • Computer Networking and Communications
  • Digital Lifestyle
  • Graphical and Digital Media Applications
  • Operating Systems
  • Browse content in Earth Sciences and Geography
  • Atmospheric Sciences
  • Environmental Geography
  • Geology and the Lithosphere
  • Maps and Map-making
  • Meteorology and Climatology
  • Oceanography and Hydrology
  • Palaeontology
  • Physical Geography and Topography
  • Regional Geography
  • Soil Science
  • Urban Geography
  • Browse content in Engineering and Technology
  • Agriculture and Farming
  • Biological Engineering
  • Civil Engineering, Surveying, and Building
  • Electronics and Communications Engineering
  • Energy Technology
  • Engineering (General)
  • Environmental Science, Engineering, and Technology
  • History of Engineering and Technology
  • Mechanical Engineering and Materials
  • Technology of Industrial Chemistry
  • Transport Technology and Trades
  • Browse content in Environmental Science
  • Applied Ecology (Environmental Science)
  • Conservation of the Environment (Environmental Science)
  • Environmental Sustainability
  • Environmentalist Thought and Ideology (Environmental Science)
  • Management of Land and Natural Resources (Environmental Science)
  • Natural Disasters (Environmental Science)
  • Nuclear Issues (Environmental Science)
  • Pollution and Threats to the Environment (Environmental Science)
  • Social Impact of Environmental Issues (Environmental Science)
  • History of Science and Technology
  • Browse content in Materials Science
  • Ceramics and Glasses
  • Composite Materials
  • Metals, Alloying, and Corrosion
  • Nanotechnology
  • Browse content in Mathematics
  • Applied Mathematics
  • Biomathematics and Statistics
  • History of Mathematics
  • Mathematical Education
  • Mathematical Finance
  • Mathematical Analysis
  • Numerical and Computational Mathematics
  • Probability and Statistics
  • Pure Mathematics
  • Browse content in Neuroscience
  • Cognition and Behavioural Neuroscience
  • Development of the Nervous System
  • Disorders of the Nervous System
  • History of Neuroscience
  • Invertebrate Neurobiology
  • Molecular and Cellular Systems
  • Neuroendocrinology and Autonomic Nervous System
  • Neuroscientific Techniques
  • Sensory and Motor Systems
  • Browse content in Physics
  • Astronomy and Astrophysics
  • Atomic, Molecular, and Optical Physics
  • Biological and Medical Physics
  • Classical Mechanics
  • Computational Physics
  • Condensed Matter Physics
  • Electromagnetism, Optics, and Acoustics
  • History of Physics
  • Mathematical and Statistical Physics
  • Measurement Science
  • Nuclear Physics
  • Particles and Fields
  • Plasma Physics
  • Quantum Physics
  • Relativity and Gravitation
  • Semiconductor and Mesoscopic Physics
  • Browse content in Psychology
  • Affective Sciences
  • Clinical Psychology
  • Cognitive Psychology
  • Cognitive Neuroscience
  • Criminal and Forensic Psychology
  • Developmental Psychology
  • Educational Psychology
  • Evolutionary Psychology
  • Health Psychology
  • History and Systems in Psychology
  • Music Psychology
  • Neuropsychology
  • Organizational Psychology
  • Psychological Assessment and Testing
  • Psychology of Human-Technology Interaction
  • Psychology Professional Development and Training
  • Research Methods in Psychology
  • Social Psychology
  • Browse content in Social Sciences
  • Browse content in Anthropology
  • Anthropology of Religion
  • Human Evolution
  • Medical Anthropology
  • Physical Anthropology
  • Regional Anthropology
  • Social and Cultural Anthropology
  • Theory and Practice of Anthropology
  • Browse content in Business and Management
  • Business Ethics
  • Business Strategy
  • Business History
  • Business and Technology
  • Business and Government
  • Business and the Environment
  • Comparative Management
  • Corporate Governance
  • Corporate Social Responsibility
  • Entrepreneurship
  • Health Management
  • Human Resource Management
  • Industrial and Employment Relations
  • Industry Studies
  • Information and Communication Technologies
  • International Business
  • Knowledge Management
  • Management and Management Techniques
  • Operations Management
  • Organizational Theory and Behaviour
  • Pensions and Pension Management
  • Public and Nonprofit Management
  • Strategic Management
  • Supply Chain Management
  • Browse content in Criminology and Criminal Justice
  • Criminal Justice
  • Criminology
  • Forms of Crime
  • International and Comparative Criminology
  • Youth Violence and Juvenile Justice
  • Development Studies
  • Browse content in Economics
  • Agricultural, Environmental, and Natural Resource Economics
  • Asian Economics
  • Behavioural Finance
  • Behavioural Economics and Neuroeconomics
  • Econometrics and Mathematical Economics
  • Economic History
  • Economic Systems
  • Economic Methodology
  • Economic Development and Growth
  • Financial Markets
  • Financial Institutions and Services
  • General Economics and Teaching
  • Health, Education, and Welfare
  • History of Economic Thought
  • International Economics
  • Labour and Demographic Economics
  • Law and Economics
  • Macroeconomics and Monetary Economics
  • Microeconomics
  • Public Economics
  • Urban, Rural, and Regional Economics
  • Welfare Economics
  • Browse content in Education
  • Adult Education and Continuous Learning
  • Care and Counselling of Students
  • Early Childhood and Elementary Education
  • Educational Equipment and Technology
  • Educational Strategies and Policy
  • Higher and Further Education
  • Organization and Management of Education
  • Philosophy and Theory of Education
  • Schools Studies
  • Secondary Education
  • Teaching of a Specific Subject
  • Teaching of Specific Groups and Special Educational Needs
  • Teaching Skills and Techniques
  • Browse content in Environment
  • Applied Ecology (Social Science)
  • Climate Change
  • Conservation of the Environment (Social Science)
  • Environmentalist Thought and Ideology (Social Science)
  • Natural Disasters (Environment)
  • Social Impact of Environmental Issues (Social Science)
  • Browse content in Human Geography
  • Cultural Geography
  • Economic Geography
  • Political Geography
  • Browse content in Interdisciplinary Studies
  • Communication Studies
  • Museums, Libraries, and Information Sciences
  • Browse content in Politics
  • African Politics
  • Asian Politics
  • Chinese Politics
  • Comparative Politics
  • Conflict Politics
  • Elections and Electoral Studies
  • Environmental Politics
  • European Union
  • Foreign Policy
  • Gender and Politics
  • Human Rights and Politics
  • Indian Politics
  • International Relations
  • International Organization (Politics)
  • International Political Economy
  • Irish Politics
  • Latin American Politics
  • Middle Eastern Politics
  • Political Behaviour
  • Political Economy
  • Political Institutions
  • Political Methodology
  • Political Communication
  • Political Philosophy
  • Political Sociology
  • Political Theory
  • Politics and Law
  • Public Policy
  • Public Administration
  • Quantitative Political Methodology
  • Regional Political Studies
  • Russian Politics
  • Security Studies
  • State and Local Government
  • UK Politics
  • US Politics
  • Browse content in Regional and Area Studies
  • African Studies
  • Asian Studies
  • East Asian Studies
  • Japanese Studies
  • Latin American Studies
  • Middle Eastern Studies
  • Native American Studies
  • Scottish Studies
  • Browse content in Research and Information
  • Research Methods
  • Browse content in Social Work
  • Addictions and Substance Misuse
  • Adoption and Fostering
  • Care of the Elderly
  • Child and Adolescent Social Work
  • Couple and Family Social Work
  • Developmental and Physical Disabilities Social Work
  • Direct Practice and Clinical Social Work
  • Emergency Services
  • Human Behaviour and the Social Environment
  • International and Global Issues in Social Work
  • Mental and Behavioural Health
  • Social Justice and Human Rights
  • Social Policy and Advocacy
  • Social Work and Crime and Justice
  • Social Work Macro Practice
  • Social Work Practice Settings
  • Social Work Research and Evidence-based Practice
  • Welfare and Benefit Systems
  • Browse content in Sociology
  • Childhood Studies
  • Community Development
  • Comparative and Historical Sociology
  • Economic Sociology
  • Gender and Sexuality
  • Gerontology and Ageing
  • Health, Illness, and Medicine
  • Marriage and the Family
  • Migration Studies
  • Occupations, Professions, and Work
  • Organizations
  • Population and Demography
  • Race and Ethnicity
  • Social Theory
  • Social Movements and Social Change
  • Social Research and Statistics
  • Social Stratification, Inequality, and Mobility
  • Sociology of Religion
  • Sociology of Education
  • Sport and Leisure
  • Urban and Rural Studies
  • Browse content in Warfare and Defence
  • Defence Strategy, Planning, and Research
  • Land Forces and Warfare
  • Military Administration
  • Military Life and Institutions
  • Naval Forces and Warfare
  • Other Warfare and Defence Issues
  • Peace Studies and Conflict Resolution
  • Weapons and Equipment

The Oxford Handbook of Thinking and Reasoning

  • < Previous chapter
  • Next chapter >

35 Scientific Thinking and Reasoning

Kevin N. Dunbar, Department of Human Development and Quantitative Methodology, University of Maryland, College Park, MD

David Klahr, Department of Psychology, Carnegie Mellon University, Pittsburgh, PA

  • Published: 21 November 2012
  • Cite Icon Cite
  • Permissions Icon Permissions

Scientific thinking refers to both thinking about the content of science and the set of reasoning processes that permeate the field of science: induction, deduction, experimental design, causal reasoning, concept formation, hypothesis testing, and so on. Here we cover both the history of research on scientific thinking and the different approaches that have been used, highlighting common themes that have emerged over the past 50 years of research. Future research will focus on the collaborative aspects of scientific thinking, on effective methods for teaching science, and on the neural underpinnings of the scientific mind.

There is no unitary activity called “scientific discovery”; there are activities of designing experiments, gathering data, inventing and developing observational instruments, formulating and modifying theories, deducing consequences from theories, making predictions from theories, testing theories, inducing regularities and invariants from data, discovering theoretical constructs, and others. — Simon, Langley, & Bradshaw, 1981 , p. 2

What Is Scientific Thinking and Reasoning?

There are two kinds of thinking we call “scientific.” The first, and most obvious, is thinking about the content of science. People are engaged in scientific thinking when they are reasoning about such entities and processes as force, mass, energy, equilibrium, magnetism, atoms, photosynthesis, radiation, geology, or astrophysics (and, of course, cognitive psychology!). The second kind of scientific thinking includes the set of reasoning processes that permeate the field of science: induction, deduction, experimental design, causal reasoning, concept formation, hypothesis testing, and so on. However, these reasoning processes are not unique to scientific thinking: They are the very same processes involved in everyday thinking. As Einstein put it:

The scientific way of forming concepts differs from that which we use in our daily life, not basically, but merely in the more precise definition of concepts and conclusions; more painstaking and systematic choice of experimental material, and greater logical economy. (The Common Language of Science, 1941, reprinted in Einstein, 1950 , p. 98)

Nearly 40 years after Einstein's remarkably insightful statement, Francis Crick offered a similar perspective: that great discoveries in science result not from extraordinary mental processes, but rather from rather common ones. The greatness of the discovery lies in the thing discovered.

I think what needs to be emphasized about the discovery of the double helix is that the path to it was, scientifically speaking, fairly commonplace. What was important was not the way it was discovered , but the object discovered—the structure of DNA itself. (Crick, 1988 , p. 67; emphasis added)

Under this view, scientific thinking involves the same general-purpose cognitive processes—such as induction, deduction, analogy, problem solving, and causal reasoning—that humans apply in nonscientific domains. These processes are covered in several different chapters of this handbook: Rips, Smith, & Medin, Chapter 11 on induction; Evans, Chapter 8 on deduction; Holyoak, Chapter 13 on analogy; Bassok & Novick, Chapter 21 on problem solving; and Cheng & Buehner, Chapter 12 on causality. One might question the claim that the highly specialized procedures associated with doing science in the “real world” can be understood by investigating the thinking processes used in laboratory studies of the sort described in this volume. However, when the focus is on major scientific breakthroughs, rather than on the more routine, incremental progress in a field, the psychology of problem solving provides a rich source of ideas about how such discoveries might occur. As Simon and his colleagues put it:

It is understandable, if ironic, that ‘normal’ science fits … the description of expert problem solving, while ‘revolutionary’ science fits the description of problem solving by novices. It is understandable because scientific activity, particularly at the revolutionary end of the continuum, is concerned with the discovery of new truths, not with the application of truths that are already well-known … it is basically a journey into unmapped terrain. Consequently, it is mainly characterized, as is novice problem solving, by trial-and-error search. The search may be highly selective—but it reaches its goal only after many halts, turnings, and back-trackings. (Simon, Langley, & Bradshaw, 1981 , p. 5)

The research literature on scientific thinking can be roughly categorized according to the two types of scientific thinking listed in the opening paragraph of this chapter: (1) One category focuses on thinking that directly involves scientific content . Such research ranges from studies of young children reasoning about the sun-moon-earth system (Vosniadou & Brewer, 1992 ) to college students reasoning about chemical equilibrium (Davenport, Yaron, Klahr, & Koedinger, 2008 ), to research that investigates collaborative problem solving by world-class researchers in real-world molecular biology labs (Dunbar, 1995 ). (2) The other category focuses on “general” cognitive processes, but it tends to do so by analyzing people's problem-solving behavior when they are presented with relatively complex situations that involve the integration and coordination of several different types of processes, and that are designed to capture some essential features of “real-world” science in the psychology laboratory (Bruner, Goodnow, & Austin, 1956 ; Klahr & Dunbar, 1988 ; Mynatt, Doherty, & Tweney, 1977 ).

There are a number of overlapping research traditions that have been used to investigate scientific thinking. We will cover both the history of research on scientific thinking and the different approaches that have been used, highlighting common themes that have emerged over the past 50 years of research.

A Brief History of Research on Scientific Thinking

Science is often considered one of the hallmarks of the human species, along with art and literature. Illuminating the thought processes used in science thus reveal key aspects of the human mind. The thought processes underlying scientific thinking have fascinated both scientists and nonscientists because the products of science have transformed our world and because the process of discovery is shrouded in mystery. Scientists talk of the chance discovery, the flash of insight, the years of perspiration, and the voyage of discovery. These images of science have helped make the mental processes underlying the discovery process intriguing to cognitive scientists as they attempt to uncover what really goes on inside the scientific mind and how scientists really think. Furthermore, the possibilities that scientists can be taught to think better by avoiding mistakes that have been clearly identified in research on scientific thinking, and that their scientific process could be partially automated, makes scientific thinking a topic of enduring interest.

The cognitive processes underlying scientific discovery and day-to-day scientific thinking have been a topic of intense scrutiny and speculation for almost 400 years (e.g., Bacon, 1620 ; Galilei 1638 ; Klahr 2000 ; Tweney, Doherty, & Mynatt, 1981 ). Understanding the nature of scientific thinking has been a central issue not only for our understanding of science but also for our understating of what it is to be human. Bacon's Novumm Organum in 1620 sketched out some of the key features of the ways that experiments are designed and data interpreted. Over the ensuing 400 years philosophers and scientists vigorously debated about the appropriate methods that scientists should use (see Giere, 1993 ). These debates over the appropriate methods for science typically resulted in the espousal of a particular type of reasoning method, such as induction or deduction. It was not until the Gestalt psychologists began working on the nature of human problem solving, during the 1940s, that experimental psychologists began to investigate the cognitive processes underlying scientific thinking and reasoning.

The Gestalt psychologist Max Wertheimer pioneered the investigation of scientific thinking (of the first type described earlier: thinking about scientific content ) in his landmark book Productive Thinking (Wertheimer, 1945 ). Wertheimer spent a considerable amount of time corresponding with Albert Einstein, attempting to discover how Einstein generated the concept of relativity. Wertheimer argued that Einstein had to overcome the structure of Newtonian physics at each step in his theorizing, and the ways that Einstein actually achieved this restructuring were articulated in terms of Gestalt theories. (For a recent and different account of how Einstein made his discovery, see Galison, 2003 .) We will see later how this process of overcoming alternative theories is an obstacle that both scientists and nonscientists need to deal with when evaluating and theorizing about the world.

One of the first investigations of scientific thinking of the second type (i.e., collections of general-purpose processes operating on complex, abstract, components of scientific thought) was carried out by Jerome Bruner and his colleagues at Harvard (Bruner et al., 1956 ). They argued that a key activity engaged in by scientists is to determine whether a particular instance is a member of a category. For example, a scientist might want to discover which substances undergo fission when bombarded by neutrons and which substances do not. Here, scientists have to discover the attributes that make a substance undergo fission. Bruner et al. saw scientific thinking as the testing of hypotheses and the collecting of data with the end goal of determining whether something is a member of a category. They invented a paradigm where people were required to formulate hypotheses and collect data that test their hypotheses. In one type of experiment, the participants were shown a card such as one with two borders and three green triangles. The participants were asked to determine the concept that this card represented by choosing other cards and getting feedback from the experimenter as to whether the chosen card was an example of the concept. In this case the participant may have thought that the concept was green and chosen a card with two green squares and one border. If the underlying concept was green, then the experimenter would say that the card was an example of the concept. In terms of scientific thinking, choosing a new card is akin to conducting an experiment, and the feedback from the experimenter is similar to knowing whether a hypothesis is confirmed or disconfirmed. Using this approach, Bruner et al. identified a number of strategies that people use to formulate and test hypotheses. They found that a key factor determining which hypothesis-testing strategy that people use is the amount of memory capacity that the strategy takes up (see also Morrison & Knowlton, Chapter 6 ; Medin et al., Chapter 11 ). Another key factor that they discovered was that it was much more difficult for people to discover negative concepts (e.g., not blue) than positive concepts (e.g., blue). Although Bruner et al.'s research is most commonly viewed as work on concepts, they saw their work as uncovering a key component of scientific thinking.

A second early line of research on scientific thinking was developed by Peter Wason and his colleagues (Wason, 1968 ). Like Bruner et al., Wason saw a key component of scientific thinking as being the testing of hypotheses. Whereas Bruner et al. focused on the different types of strategies that people use to formulate hypotheses, Wason focused on whether people adopt a strategy of trying to confirm or disconfirm their hypotheses. Using Popper's ( 1959 ) theory that scientists should try and falsify rather than confirm their hypotheses, Wason devised a deceptively simple task in which participants were given three numbers, such as 2-4-6, and were asked to discover the rule underlying the three numbers. Participants were asked to generate other triads of numbers and the experimenter would tell the participant whether the triad was consistent or inconsistent with the rule. They were told that when they were sure they knew what the rule was they should state it. Most participants began the experiment by thinking that the rule was even numbers increasing by 2. They then attempted to confirm their hypothesis by generating a triad like 8-10-12, then 14-16-18. These triads are consistent with the rule and the participants were told yes, that the triads were indeed consistent with the rule. However, when they proposed the rule—even numbers increasing by 2—they were told that the rule was incorrect. The correct rule was numbers of increasing magnitude! From this research, Wason concluded that people try to confirm their hypotheses, whereas normatively speaking, they should try to disconfirm their hypotheses. One implication of this research is that confirmation bias is not just restricted to scientists but is a general human tendency.

It was not until the 1970s that a general account of scientific reasoning was proposed. Herbert Simon, often in collaboration with Allan Newell, proposed that scientific thinking is a form of problem solving. He proposed that problem solving is a search in a problem space. Newell and Simon's theory of problem solving is discussed in many places in this handbook, usually in the context of specific problems (see especially Bassok & Novick, Chapter 21 ). Herbert Simon, however, devoted considerable time to understanding many different scientific discoveries and scientific reasoning processes. The common thread in his research was that scientific thinking and discovery is not a mysterious magical process but a process of problem solving in which clear heuristics are used. Simon's goal was to articulate the heuristics that scientists use in their research at a fine-grained level. By constructing computer programs that simulated the process of several major scientific discoveries, Simon and colleagues were able to articulate the specific computations that scientists could have used in making those discoveries (Langley, Simon, Bradshaw, & Zytkow, 1987 ; see section on “Computational Approaches to Scientific Thinking”). Particularly influential was Simon and Lea's ( 1974 ) work demonstrating that concept formation and induction consist of a search in two problem spaces: a space of instances and a space of rules. This idea has influenced problem-solving accounts of scientific thinking that will be discussed in the next section.

Overall, the work of Bruner, Wason, and Simon laid the foundations for contemporary research on scientific thinking. Early research on scientific thinking is summarized in Tweney, Doherty and Mynatt's 1981 book On Scientific Thinking , where they sketched out many of the themes that have dominated research on scientific thinking over the past few decades. Other more recent books such as Cognitive Models of Science (Giere, 1993 ), Exploring Science (Klahr, 2000 ), Cognitive Basis of Science (Carruthers, Stich, & Siegal, 2002 ), and New Directions in Scientific and Technical Thinking (Gorman, Kincannon, Gooding, & Tweney, 2004 ) provide detailed analyses of different aspects of scientific discovery. Another important collection is Vosnadiau's handbook on conceptual change research (Vosniadou, 2008 ). In this chapter, we discuss the main approaches that have been used to investigate scientific thinking.

How does one go about investigating the many different aspects of scientific thinking? One common approach to the study of the scientific mind has been to investigate several key aspects of scientific thinking using abstract tasks designed to mimic some essential characteristics of “real-world” science. There have been numerous methodologies that have been used to analyze the genesis of scientific concepts, theories, hypotheses, and experiments. Researchers have used experiments, verbal protocols, computer programs, and analyzed particular scientific discoveries. A more recent development has been to increase the ecological validity of such research by investigating scientists as they reason “live” (in vivo studies of scientific thinking) in their own laboratories (Dunbar, 1995 , 2002 ). From a “Thinking and Reasoning” standpoint the major aspects of scientific thinking that have been most actively investigated are problem solving, analogical reasoning, hypothesis testing, conceptual change, collaborative reasoning, inductive reasoning, and deductive reasoning.

Scientific Thinking as Problem Solving

One of the primary goals of accounts of scientific thinking has been to provide an overarching framework to understand the scientific mind. One framework that has had a great influence in cognitive science is that scientific thinking and scientific discovery can be conceived as a form of problem solving. As noted in the opening section of this chapter, Simon ( 1977 ; Simon, Langley, & Bradshaw, 1981 ) argued that both scientific thinking in general and problem solving in particular could be thought of as a search in a problem space. A problem space consists of all the possible states of a problem and all the operations that a problem solver can use to get from one state to the next. According to this view, by characterizing the types of representations and procedures that people use to get from one state to another it is possible to understand scientific thinking. Thus, scientific thinking can be characterized as a search in various problem spaces (Simon, 1977 ). Simon investigated a number of scientific discoveries by bringing participants into the laboratory, providing the participants with the data that a scientist had access to, and getting the participants to reason about the data and rediscover a scientific concept. He then analyzed the verbal protocols that participants generated and mapped out the types of problem spaces that the participants search in (e.g., Qin & Simon, 1990 ). Kulkarni and Simon ( 1988 ) used a more historical approach to uncover the problem-solving heuristics that Krebs used in his discovery of the urea cycle. Kulkarni and Simon analyzed Krebs's diaries and proposed a set of problem-solving heuristics that he used in his research. They then built a computer program incorporating the heuristics and biological knowledge that Krebs had before he made his discoveries. Of particular importance are the search heuristics that the program uses, which include experimental proposal heuristics and data interpretation heuristics. A key heuristic was an unusualness heuristic that focused on unusual findings, which guided search through a space of theories and a space of experiments.

Klahr and Dunbar ( 1988 ) extended the search in a problem space approach and proposed that scientific thinking can be thought of as a search through two related spaces: an hypothesis space and an experiment space. Each problem space that a scientist uses will have its own types of representations and operators used to change the representations. Search in the hypothesis space constrains search in the experiment space. Klahr and Dunbar found that some participants move from the hypothesis space to the experiment space, whereas others move from the experiment space to the hypothesis space. These different types of searches lead to the proposal of different types of hypotheses and experiments. More recent work has extended the dual-space approach to include alternative problem-solving spaces, including those for data, instrumentation, and domain-specific knowledge (Klahr & Simon, 1999 ; Schunn & Klahr, 1995 , 1996 ).

Scientific Thinking as Hypothesis Testing

Many researchers have regarded testing specific hypotheses predicted by theories as one of the key attributes of scientific thinking. Hypothesis testing is the process of evaluating a proposition by collecting evidence regarding its truth. Experimental cognitive research on scientific thinking that specifically examines this issue has tended to fall into two broad classes of investigations. The first class is concerned with the types of reasoning that lead scientists astray, thus blocking scientific ingenuity. A large amount of research has been conducted on the potentially faulty reasoning strategies that both participants in experiments and scientists use, such as considering only one favored hypothesis at a time and how this prevents the scientists from making discoveries. The second class is concerned with uncovering the mental processes underlying the generation of new scientific hypotheses and concepts. This research has tended to focus on the use of analogy and imagery in science, as well as the use of specific types of problem-solving heuristics.

Turning first to investigations of what diminishes scientific creativity, philosophers, historians, and experimental psychologists have devoted a considerable amount of research to “confirmation bias.” This occurs when scientists only consider one hypothesis (typically the favored hypothesis) and ignore other alternative hypotheses or potentially relevant hypotheses. This important phenomenon can distort the design of experiments, formulation of theories, and interpretation of data. Beginning with the work of Wason ( 1968 ) and as discussed earlier, researchers have repeatedly shown that when participants are asked to design an experiment to test a hypothesis they will predominantly design experiments that they think will yield results consistent with the hypothesis. Using the 2-4-6 task mentioned earlier, Klayman and Ha ( 1987 ) showed that in situations where one's hypothesis is likely to be confirmed, seeking confirmation is a normatively incorrect strategy, whereas when the probability of confirming one's hypothesis is low, then attempting to confirm one's hypothesis can be an appropriate strategy. Historical analyses by Tweney ( 1989 ), concerning the way that Faraday made his discoveries, and experiments investigating people testing hypotheses, have revealed that people use a confirm early, disconfirm late strategy: When people initially generate or are given hypotheses, they try and gather evidence that is consistent with the hypothesis. Once enough evidence has been gathered, then people attempt to find the boundaries of their hypothesis and often try to disconfirm their hypotheses.

In an interesting variant on the confirmation bias paradigm, Gorman ( 1989 ) showed that when participants are told that there is the possibility of error in the data that they receive, participants assume that any data that are inconsistent with their favored hypothesis are due to error. Thus, the possibility of error “insulates” hypotheses against disconfirmation. This intriguing hypothesis has not been confirmed by other researchers (Penner & Klahr, 1996 ), but it is an intriguing hypothesis that warrants further investigation.

Confirmation bias is very difficult to overcome. Even when participants are asked to consider alternate hypotheses, they will often fail to conduct experiments that could potentially disconfirm their hypothesis. Tweney and his colleagues provide an excellent overview of this phenomenon in their classic monograph On Scientific Thinking (1981). The precise reasons for this type of block are still widely debated. Researchers such as Michael Doherty have argued that working memory limitations make it difficult for people to consider more than one hypothesis. Consistent with this view, Dunbar and Sussman ( 1995 ) have shown that when participants are asked to hold irrelevant items in working memory while testing hypotheses, the participants will be unable to switch hypotheses in the face of inconsistent evidence. While working memory limitations are involved in the phenomenon of confirmation bias, even groups of scientists can also display confirmation bias. For example, the controversy over cold fusion is an example of confirmation bias. Here, large groups of scientists had other hypotheses available to explain their data yet maintained their hypotheses in the face of other more standard alternative hypotheses. Mitroff ( 1974 ) provides some interesting examples of NASA scientists demonstrating confirmation bias, which highlight the roles of commitment and motivation in this process. See also MacPherson and Stanovich ( 2007 ) for specific strategies that can be used to overcome confirmation bias.

Causal Thinking in Science

Much of scientific thinking and scientific theory building pertains to the development of causal models between variables of interest. For example, do vaccines cause illnesses? Do carbon dioxide emissions cause global warming? Does water on a planet indicate that there is life on the planet? Scientists and nonscientists alike are constantly bombarded with statements regarding the causal relationship between such variables. How does one evaluate the status of such claims? What kinds of data are informative? How do scientists and nonscientists deal with data that are inconsistent with their theory?

A central issue in the causal reasoning literature, one that is directly relevant to scientific thinking, is the extent to which scientists and nonscientists alike are governed by the search for causal mechanisms (i.e., how a variable works) versus the search for statistical data (i.e., how often variables co-occur). This dichotomy can be boiled down to the search for qualitative versus quantitative information about the paradigm the scientist is investigating. Researchers from a number of cognitive psychology laboratories have found that people prefer to gather more information about an underlying mechanism than covariation between a cause and an effect (e.g., Ahn, Kalish, Medin, & Gelman, 1995 ). That is, the predominant strategy that students in simulations of scientific thinking use is to gather as much information as possible about how the objects under investigation work, rather than collecting large amounts of quantitative data to determine whether the observations hold across multiple samples. These findings suggest that a central component of scientific thinking may be to formulate explicit mechanistic causal models of scientific events.

One type of situation in which causal reasoning has been observed extensively is when scientists obtain unexpected findings. Both historical and naturalistic research has revealed that reasoning causally about unexpected findings plays a central role in science. Indeed, scientists themselves frequently state that a finding was due to chance or was unexpected. Given that claims of unexpected findings are such a frequent component of scientists' autobiographies and interviews in the media, Dunbar ( 1995 , 1997 , 1999 ; Dunbar & Fugelsang, 2005 ; Fugelsang, Stein, Green, & Dunbar, 2004 ) decided to investigate the ways that scientists deal with unexpected findings. In 1991–1992 Dunbar spent 1 year in three molecular biology laboratories and one immunology laboratory at a prestigious U.S. university. He used the weekly laboratory meeting as a source of data on scientific discovery and scientific reasoning. (He termed this type of study “in vivo” cognition.) When he looked at the types of findings that the scientists made, he found that over 50% of the findings were unexpected and that these scientists had evolved a number of effective strategies for dealing with such findings. One clear strategy was to reason causally about the findings: Scientists attempted to build causal models of their unexpected findings. This causal model building results in the extensive use of collaborative reasoning, analogical reasoning, and problem-solving heuristics (Dunbar, 1997 , 2001 ).

Many of the key unexpected findings that scientists reasoned about in the in vivo studies of scientific thinking were inconsistent with the scientists' preexisting causal models. A laboratory equivalent of the biology labs involved creating a situation in which students obtained unexpected findings that were inconsistent with their preexisting theories. Dunbar and Fugelsang ( 2005 ) examined this issue by creating a scientific causal thinking simulation where experimental outcomes were either expected or unexpected. Dunbar ( 1995 ) has called the study of people reasoning in a cognitive laboratory “in vitro” cognition. These investigators found that students spent considerably more time reasoning about unexpected findings than expected findings. In addition, when assessing the overall degree to which their hypothesis was supported or refuted, participants spent the majority of their time considering unexpected findings. An analysis of participants' verbal protocols indicates that much of this extra time was spent formulating causal models for the unexpected findings. Similarly, scientists spend more time considering unexpected than expected findings, and this time is devoted to building causal models (Dunbar & Fugelsang, 2004 ).

Scientists know that unexpected findings occur often, and they have developed many strategies to take advantage of their unexpected findings. One of the most important places that they anticipate the unexpected is in designing experiments (Baker & Dunbar, 2000 ). They build different causal models of their experiments incorporating many conditions and controls. These multiple conditions and controls allow unknown mechanisms to manifest themselves. Thus, rather than being the victims of the unexpected, they create opportunities for unexpected events to occur, and once these events do occur, they have causal models that allow them to determine exactly where in the causal chain their unexpected finding arose. The results of these in vivo and in vitro studies all point to a more complex and nuanced account of how scientists and nonscientists alike test and evaluate hypotheses about theories.

The Roles of Inductive, Abductive, and Deductive Thinking in Science

One of the most basic characteristics of science is that scientists assume that the universe that we live in follows predictable rules. Scientists reason using a variety of different strategies to make new scientific discoveries. Three frequently used types of reasoning strategies that scientists use are inductive, abductive, and deductive reasoning. In the case of inductive reasoning, a scientist may observe a series of events and try to discover a rule that governs the event. Once a rule is discovered, scientists can extrapolate from the rule to formulate theories of observed and yet-to-be-observed phenomena. One example is the discovery using inductive reasoning that a certain type of bacterium is a cause of many ulcers (Thagard, 1999 ). In a fascinating series of articles, Thagard documented the reasoning processes that Marshall and Warren went through in proposing this novel hypothesis. One key reasoning process was the use of induction by generalization. Marshall and Warren noted that almost all patients with gastric entritis had a spiral bacterium in their stomachs, and he formed the generalization that this bacterium is the cause of stomach ulcers. There are numerous other examples of induction by generalization in science, such as Tycho De Brea's induction about the motion of planets from his observations, Dalton's use of induction in chemistry, and the discovery of prions as the source of mad cow disease. Many theories of induction have used scientific discovery and reasoning as examples of this important reasoning process.

Another common type of inductive reasoning is to map a feature of one member of a category to another member of a category. This is called categorical induction. This type of induction is a way of projecting a known property of one item onto another item that is from the same category. Thus, knowing that the Rous Sarcoma virus is a retrovirus that uses RNA rather than DNA, a biologist might assume that another virus that is thought to be a retrovirus also uses RNA rather than DNA. While research on this type of induction typically has not been discussed in accounts of scientific thinking, this type of induction is common in science. For an influential contribution to this literature, see Smith, Shafir, and Osherson ( 1993 ), and for reviews of this literature see Heit ( 2000 ) and Medin et al. (Chapter 11 ).

While less commonly mentioned than inductive reasoning, abductive reasoning is an important form of reasoning that scientists use when they are seeking to propose explanations for events such as unexpected findings (see Lombrozo, Chapter 14 ; Magnani, et al., 2010 ). In Figure 35.1 , taken from King ( 2011 ), the differences between inductive, abductive, and deductive thinking are highlighted. In the case of abduction, the reasoner attempts to generate explanations of the form “if situation X had occurred, could it have produced the current evidence I am attempting to interpret?” (For an interesting of analysis of abductive reasoning see the brief paper by Klahr & Masnick, 2001 ). Of course, as in classical induction, such reasoning may produce a plausible account that is still not the correct one. However, abduction does involve the generation of new knowledge, and is thus also related to research on creativity.

The different processes underlying inductive, abductive, and deductive reasoning in science. (Figure reproduced from King 2011 ).)

Turning now to deductive thinking, many thinking processes that scientists adhere to follow traditional rules of deductive logic. These processes correspond to those conditions in which a hypothesis may lead to, or is deducible to, a conclusion. Though they are not always phrased in syllogistic form, deductive arguments can be phrased as “syllogisms,” or as brief, mathematical statements in which the premises lead to the conclusion. Deductive reasoning is an extremely important aspect of scientific thinking because it underlies a large component of how scientists conduct their research. By looking at many scientific discoveries, we can often see that deductive reasoning is at work. Deductive reasoning statements all contain information or rules that state an assumption about how the world works, as well as a conclusion that would necessarily follow from the rule. Numerous discoveries in physics such as the discovery of dark matter by Vera Rubin are based on deductions. In the dark matter case, Rubin measured galactic rotation curves and based on the differences between the predicted and observed angular motions of galaxies she deduced that the structure of the universe was uneven. This led her to propose that dark matter existed. In contemporary physics the CERN Large Hadron Collider is being used to search for the Higgs Boson. The Higgs Boson is a deductive prediction from contemporary physics. If the Higgs Boson is not found, it may lead to a radical revision of the nature of physics and a new understanding of mass (Hecht, 2011 ).

The Roles of Analogy in Scientific Thinking

One of the most widely mentioned reasoning processes used in science is analogy. Scientists use analogies to form a bridge between what they already know and what they are trying to explain, understand, or discover. In fact, many scientists have claimed that the making of certain analogies was instrumental in their making a scientific discovery, and almost all scientific autobiographies and biographies feature one particular analogy that is discussed in depth. Coupled with the fact that there has been an enormous research program on analogical thinking and reasoning (see Holyoak, Chapter 13 ), we now have a number of models and theories of analogical reasoning that suggest how analogy can play a role in scientific discovery (see Gentner, Holyoak, & Kokinov, 2001 ). By analyzing several major discoveries in the history of science, Thagard and Croft ( 1999 ), Nersessian ( 1999 , 2008 ), and Gentner and Jeziorski ( 1993 ) have all shown that analogical reasoning is a key aspect of scientific discovery.

Traditional accounts of analogy distinguish between two components of analogical reasoning: the target and the source (Holyoak, Chapter 13 ; Gentner 2010 ). The target is the concept or problem that a scientist is attempting to explain or solve. The source is another piece of knowledge that the scientist uses to understand the target or to explain the target to others. What the scientist does when he or she makes an analogy is to map features of the source onto features of the target. By mapping the features of the source onto the target, new features of the target may be discovered, or the features of the target may be rearranged so that a new concept is invented and a scientific discovery is made. For example, a common analogy that is used with computers is to describe a harmful piece of software as a computer virus. Once a piece of software is called a virus, people can map features of biological viruses, such as that it is small, spreads easily, self-replicates using a host, and causes damage. People not only map individual features of the source onto the target but also the systems of relations. For example, if a computer virus is similar to a biological virus, then an immune system can be created on computers that can protect computers from future variants of a virus. One of the reasons that scientific analogy is so powerful is that it can generate new knowledge, such as the creation of a computational immune system having many of the features of a real biological immune system. This analogy also leads to predictions that there will be newer computer viruses that are the computational equivalent of retroviruses, lacking DNA, or standard instructions, that will elude the computational immune system.

The process of making an analogy involves a number of key steps: retrieval of a source from memory, aligning the features of the source with those of the target, mapping features of the source onto those of the target, and possibly making new inferences about the target. Scientific discoveries are made when the source highlights a hitherto unknown feature of the target or restructures the target into a new set of relations. Interestingly, research on analogy has shown that participants do not easily use remote analogies (see Gentner et al., 1997 ; Holyoak & Thagard 1995 ). Participants in experiments tend to focus on the sharing of a superficial feature between the source and the target, rather than the relations among features. In his in vivo studies of science, Dunbar ( 1995 , 2001 , 2002 ) investigated the ways that scientists use analogies while they are conducting their research and found that scientists use both relational and superficial features when they make analogies. Whether they use superficial or relational features depends on their goals. If their goal is to fix a problem in an experiment, their analogies are based upon superficial features. However, if their goal is to formulate hypotheses, they focus on analogies based upon sets of relations. One important difference between scientists and participants in experiments is that the scientists have deep relational knowledge of the processes that they are investigating and can hence use this relational knowledge to make analogies (see Holyoak, Chapter 13 for a thorough review of analogical reasoning).

Are scientific analogies always useful? Sometimes analogies can lead scientists and students astray. For example, Evelyn Fox-Keller ( 1985 ) shows how an analogy between the pulsing of a lighthouse and the activity of the slime mold dictyostelium led researchers astray for a number of years. Likewise, the analogy between the solar system (the source) and the structure of the atom (the target) has been shown to be potentially misleading to students taking more advanced courses in physics or chemistry. The solar system analogy has a number of misalignments to the structure of the atom, such as electrons being repelled from each other rather than attracted; moreover, electrons do not have individual orbits like planets but have orbit clouds of electron density. Furthermore, students have serious misconceptions about the nature of the solar system, which can compound their misunderstanding of the nature of the atom (Fischler & Lichtfeld, 1992 ). While analogy is a powerful tool in science, like all forms of induction, incorrect conclusions can be reached.

Conceptual Change in Science

Scientific knowledge continually accumulates as scientists gather evidence about the natural world. Over extended time, this knowledge accumulation leads to major revisions, extensions, and new organizational forms for expressing what is known about nature. Indeed, these changes are so substantial that philosophers of science speak of “revolutions” in a variety of scientific domains (Kuhn, 1962 ). The psychological literature that explores the idea of revolutionary conceptual change can be roughly divided into (a) investigations of how scientists actually make discoveries and integrate those discoveries into existing scientific contexts, and (b) investigations of nonscientists ranging from infants, to children, to students in science classes. In this section we summarize the adult studies of conceptual change, and in the next section we look at its developmental aspects.

Scientific concepts, like all concepts, can be characterized as containing a variety of “knowledge elements”: representations of words, thoughts, actions, objects, and processes. At certain points in the history of science, the accumulated evidence has demanded major shifts in the way these collections of knowledge elements are organized. This “radical conceptual change” process (see Keil, 1999 ; Nersessian 1998 , 2002 ; Thagard, 1992 ; Vosniadou 1998, for reviews) requires the formation of a new conceptual system that organizes knowledge in new ways, adds new knowledge, and results in a very different conceptual structure. For more recent research on conceptual change, The International Handbook of Research on Conceptual Change (Vosniadou, 2008 ) provides a detailed compendium of theories and controversies within the field.

While conceptual change in science is usually characterized by large-scale changes in concepts that occur over extensive periods of time, it has been possible to observe conceptual change using in vivo methodologies. Dunbar ( 1995 ) reported a major conceptual shift that occurred in immunologists, where they obtained a series of unexpected findings that forced the scientists to propose a new concept in immunology that in turn forced the change in other concepts. The drive behind this conceptual change was the discovery of a series of different unexpected findings or anomalies that required the scientists to both revise and reorganize their conceptual knowledge. Interestingly, this conceptual change was achieved by a group of scientists reasoning collaboratively, rather than by a scientist working alone. Different scientists tend to work on different aspects of concepts, and also different concepts, that when put together lead to a rapid change in entire conceptual structures.

Overall, accounts of conceptual change in individuals indicate that it is indeed similar to that of conceptual change in entire scientific fields. Individuals need to be confronted with anomalies that their preexisting theories cannot explain before entire conceptual structures are overthrown. However, replacement conceptual structures have to be generated before the old conceptual structure can be discarded. Sometimes, people do not overthrow their original conceptual theories and through their lives maintain their original views of many fundamental scientific concepts. Whether people actively possess naive theories, or whether they appear to have a naive theory because of the demand characteristics of the testing context, is a lively source of debate within the science education community (see Gupta, Hammer, & Redish, 2010 ).

Scientific Thinking in Children

Well before their first birthday, children appear to know several fundamental facts about the physical world. For example, studies with infants show that they behave as if they understand that solid objects endure over time (e.g., they don't just disappear and reappear, they cannot move through each other, and they move as a result of collisions with other solid objects or the force of gravity (Baillargeon, 2004 ; Carey 1985 ; Cohen & Cashon, 2006 ; Duschl, Schweingruber, & Shouse, 2007 ; Gelman & Baillargeon, 1983 ; Gelman & Kalish, 2006 ; Mandler, 2004 ; Metz 1995 ; Munakata, Casey, & Diamond, 2004 ). And even 6-month-olds are able to predict the future location of a moving object that they are attempting to grasp (Von Hofsten, 1980 ; Von Hofsten, Feng, & Spelke, 2000 ). In addition, they appear to be able to make nontrivial inferences about causes and their effects (Gopnik et al., 2004 ).

The similarities between children's thinking and scientists' thinking have an inherent allure and an internal contradiction. The allure resides in the enthusiastic wonder and openness with which both children and scientists approach the world around them. The paradox comes from the fact that different investigators of children's thinking have reached diametrically opposing conclusions about just how “scientific” children's thinking really is. Some claim support for the “child as a scientist” position (Brewer & Samarapungavan, 1991 ; Gelman & Wellman, 1991 ; Gopnik, Meltzoff, & Kuhl, 1999 ; Karmiloff-Smith 1988 ; Sodian, Zaitchik, & Carey, 1991 ; Samarapungavan 1992 ), while others offer serious challenges to the view (Fay & Klahr, 1996 ; Kern, Mirels, & Hinshaw, 1983 ; Kuhn, Amsel, & O'Laughlin, 1988 ; Schauble & Glaser, 1990 ; Siegler & Liebert, 1975 .) Such fundamentally incommensurate conclusions suggest that this very field—children's scientific thinking—is ripe for a conceptual revolution!

A recent comprehensive review (Duschl, Schweingruber, & Shouse, 2007 ) of what children bring to their science classes offers the following concise summary of the extensive developmental and educational research literature on children's scientific thinking:

Children entering school already have substantial knowledge of the natural world, much of which is implicit.

What children are capable of at a particular age is the result of a complex interplay among maturation, experience, and instruction. What is developmentally appropriate is not a simple function of age or grade, but rather is largely contingent on children's prior opportunities to learn.

Students' knowledge and experience play a critical role in their science learning, influencing four aspects of science understanding, including (a) knowing, using, and interpreting scientific explanations of the natural world; (b) generating and evaluating scientific evidence and explanations, (c) understanding how scientific knowledge is developed in the scientific community, and (d) participating in scientific practices and discourse.

Students learn science by actively engaging in the practices of science.

In the previous section of this article we discussed conceptual change with respect to scientific fields and undergraduate science students. However, the idea that children undergo radical conceptual change in which old “theories” need to be overthrown and reorganized has been a central topic in understanding changes in scientific thinking in both children and across the life span. This radical conceptual change is thought to be necessary for acquiring many new concepts in physics and is regarded as the major source of difficulty for students. The factors that are at the root of this conceptual shift view have been difficult to determine, although there have been a number of studies in cognitive development (Carey, 1985 ; Chi 1992 ; Chi & Roscoe, 2002 ), in the history of science (Thagard, 1992 ), and in physics education (Clement, 1982 ; Mestre 1991 ) that give detailed accounts of the changes in knowledge representation that occur while people switch from one way of representing scientific knowledge to another.

One area where students show great difficulty in understanding scientific concepts is physics. Analyses of students' changing conceptions, using interviews, verbal protocols, and behavioral outcome measures, indicate that large-scale changes in students' concepts occur in physics education (see McDermott & Redish, 1999 , for a review of this literature). Following Kuhn ( 1962 ), many researchers, but not all, have noted that students' changing conceptions resemble the sequences of conceptual changes in physics that have occurred in the history of science. These notions of radical paradigm shifts and ensuing incompatibility with past knowledge-states have called attention to interesting parallels between the development of particular scientific concepts in children and in the history of physics. Investigations of nonphysicists' understanding of motion indicate that students have extensive misunderstandings of motion. Some researchers have interpreted these findings as an indication that many people hold erroneous beliefs about motion similar to a medieval “impetus” theory (McCloskey, Caramazza, & Green, 1980 ). Furthermore, students appear to maintain “impetus” notions even after one or two courses in physics. In fact, some authors have noted that students who have taken one or two courses in physics can perform worse on physics problems than naive students (Mestre, 1991 ). Thus, it is only after extensive learning that we see a conceptual shift from impetus theories of motion to Newtonian scientific theories.

How one's conceptual representation shifts from “naive” to Newtonian is a matter of contention, as some have argued that the shift involves a radical conceptual change, whereas others have argued that the conceptual change is not really complete. For example, Kozhevnikov and Hegarty ( 2001 ) argue that much of the naive impetus notions of motion are maintained at the expense of Newtonian principles even with extensive training in physics. However, they argue that such impetus principles are maintained at an implicit level. Thus, although students can give the correct Newtonian answer to problems, their reaction times to respond indicate that they are also using impetus theories when they respond. An alternative view of conceptual change focuses on whether there are real conceptual changes at all. Gupta, Hammer and Redish ( 2010 ) and Disessa ( 2004 ) have conducted detailed investigations of changes in physics students' accounts of phenomena covered in elementary physics courses. They have found that rather than students possessing a naive theory that is replaced by the standard theory, many introductory physics students have no stable physical theory but rather construct their explanations from elementary pieces of knowledge of the physical world.

Computational Approaches to Scientific Thinking

Computational approaches have provided a more complete account of the scientific mind. Computational models provide specific detailed accounts of the cognitive processes underlying scientific thinking. Early computational work consisted of taking a scientific discovery and building computational models of the reasoning processes involved in the discovery. Langley, Simon, Bradshaw, and Zytkow ( 1987 ) built a series of programs that simulated discoveries such as those of Copernicus, Bacon, and Stahl. These programs had various inductive reasoning algorithms built into them, and when given the data that the scientists used, they were able to propose the same rules. Computational models make it possible to propose detailed models of the cognitive subcomponents of scientific thinking that specify exactly how scientific theories are generated, tested, and amended (see Darden, 1997 , and Shrager & Langley, 1990 , for accounts of this branch of research). More recently, the incorporation of scientific knowledge into computer programs has resulted in a shift in emphasis from using programs to simulate discoveries to building programs that are used to help scientists make discoveries. A number of these computer programs have made novel discoveries. For example, Valdes-Perez ( 1994 ) has built systems for discoveries in chemistry, and Fajtlowicz has done this in mathematics (Erdos, Fajtlowicz, & Staton, 1991 ).

These advances in the fields of computer discovery have led to new fields, conferences, journals, and even departments that specialize in the development of programs devised to search large databases in the hope of making new scientific discoveries (Langley, 2000 , 2002 ). This process is commonly known as “data mining.” This approach has only proved viable relatively recently, due to advances in computer technology. Biswal et al. ( 2010 ), Mitchell ( 2009 ), and Yang ( 2009 ) provide recent reviews of data mining in different scientific fields. Data mining is at the core of drug discovery, our understanding of the human genome, and our understanding of the universe for a number of reasons. First, vast databases concerning drug actions, biological processes, the genome, the proteome, and the universe itself now exist. Second, the development of high throughput data-mining algorithms makes it possible to search for new drug targets, novel biological mechanisms, and new astronomical phenomena in relatively short periods of time. Research programs that took decades, such as the development of penicillin, can now be done in days (Yang, 2009 ).

Another recent shift in the use of computers in scientific discovery has been to have both computers and people make discoveries together, rather than expecting that computers make an entire scientific discovery. Now instead of using computers to mimic the entire scientific discovery process as used by humans, computers can use powerful algorithms that search for patterns on large databases and provide the patterns to humans who can then use the output of these computers to make discoveries, ranging from the human genome to the structure of the universe. However, there are some robots such as ADAM, developed by King ( 2011 ), that can actually perform the entire scientific process, from the generation of hypotheses, to the conduct of experiments and the interpretation of results, with little human intervention. The ongoing development of scientific robots by some scientists (King et al., 2009 ) thus continues the tradition started by Herbert Simon in the 1960s. However, many of the controversies as to whether the robot is a “real scientist” or not continue to the present (Evans & Rzhetsky, 2010 , Gianfelici, 2010 ; Haufe, Elliott, Burian, & O' Malley, 2010 ; O'Malley 2011 ).

Scientific Thinking and Science Education

Accounts of the nature of science and research on scientific thinking have had profound effects on science education along many levels, particularly in recent years. Science education from the 1900s until the 1970s was primarily concerned with teaching students both the content of science (such as Newton's laws of motion) or the methods that scientists need to use in their research (such as using experimental and control groups). Beginning in the 1980s, a number of reports (e.g., American Association for the Advancement of Science, 1993; National Commission on Excellence in Education, 1983; Rutherford & Ahlgren, 1991 ) stressed the need for teaching scientific thinking skills rather than just methods and content. The addition of scientific thinking skills to the science curriculum from kindergarten through adulthood was a major shift in focus. Many of the particular scientific thinking skills that have been emphasized are skills covered in previous sections of this chapter, such as teaching deductive and inductive thinking strategies. However, rather than focusing on one particular skill, such as induction, researchers in education have focused on how the different components of scientific thinking are put together in science. Furthermore, science educators have focused upon situations where science is conducted collaboratively, rather than being the product of one person thinking alone. These changes in science education parallel changes in methodologies used to investigate science, such as analyzing the ways that scientists think and reason in their laboratories.

By looking at science as a complex multilayered and group activity, many researchers in science education have adopted a constructivist approach. This approach sees learning as an active rather than a passive process, and it suggests that students learn through constructing their scientific knowledge. We will first describe a few examples of the constructivist approach to science education. Following that, we will address several lines of work that challenge some of the assumptions of the constructivist approach to science education.

Often the goal of constructivist science education is to produce conceptual change through guided instruction where the teacher or professor acts as a guide to discovery, rather than the keeper of all the facts. One recent and influential approach to science education is the inquiry-based learning approach. Inquiry-based learning focuses on posing a problem or a puzzling event to students and asking them to propose a hypothesis that could explain the event. Next, the student is asked to collect data that test the hypothesis, make conclusions, and then reflect upon both the original problem and the thought processes that they used to solve the problem. Often students use computers that aid in their construction of new knowledge. The computers allow students to learn many of the different components of scientific thinking. For example, Reiser and his colleagues have developed a learning environment for biology, where students are encouraged to develop hypotheses in groups, codify the hypotheses, and search databases to test these hypotheses (Reiser et al., 2001 ).

One of the myths of science is the lone scientist suddenly shouting “Eureka, I have made a discovery!” Instead, in vivo studies of scientists (e.g., Dunbar, 1995 , 2002 ), historical analyses of scientific discoveries (Nersessian, 1999 ), and studies of children learning science at museums have all pointed to collaborative scientific discovery mechanisms as being one of the driving forces of science (Atkins et al., 2009 ; Azmitia & Crowley, 2001 ). What happens during collaborative scientific thinking is that there is usually a triggering event, such as an unexpected result or situation that a student does not understand. This results in other members of the group adding new information to the person's representation of knowledge, often adding new inductions and deductions that both challenge and transform the reasoner's old representations of knowledge (Chi & Roscoe, 2002 ; Dunbar 1998 ). Social mechanisms play a key component in fostering changes in concepts that have been ignored in traditional cognitive research but are crucial for both science and science education. In science education there has been a shift to collaborative learning, particularly at the elementary level; however, in university education, the emphasis is still on the individual scientist. As many domains of science now involve collaborations across scientific disciplines, we expect the explicit teaching of heuristics for collaborative science to increase.

What is the best way to teach and learn science? Surprisingly, the answer to this question has been difficult to uncover. For example, toward the end of the last century, influenced by several thinkers who advocated a constructivist approach to learning, ranging from Piaget (Beilin, 1994 ) to Papert ( 1980 ), many schools answered this question by adopting a philosophy dubbed “discovery learning.” Although a clear operational definition of this approach has yet to be articulated, the general idea is that children are expected to learn science by reconstructing the processes of scientific discovery—in a range of areas from computer programming to chemistry to mathematics. The premise is that letting students discover principles on their own, set their own goals, and collaboratively explore the natural world produces deeper knowledge that transfers widely.

The research literature on science education is far from consistent in its use of terminology. However, our reading suggests that “discovery learning” differs from “inquiry-based learning” in that few, if any, guidelines are given to students in discovery learning contexts, whereas in inquiry learning, students are given hypotheses and specific goals to achieve (see the second paragraph of this section for a definition of inquiry-based learning). Even though thousands of schools have adopted discovery learning as an alternative to more didactic approaches to teaching and learning, the evidence showing that it is more effective than traditional, direct, teacher-controlled instructional approaches is mixed, at best (Lorch et al., 2010 ; Minner, Levy, & Century, 2010 ). In several cases where the distinctions between direct instruction and more open-ended constructivist instruction have been clearly articulated, implemented, and assessed, direct instruction has proven to be superior to the alternatives (Chen & Klahr, 1999 ; Toth, Klahr, & Chen, 2000 ). For example, in a study of third- and fourth-grade children learning about experimental design, Klahr and Nigam ( 2004 ) found that many more children learned from direct instruction than from discovery learning. Furthermore, they found that among the few children who did manage to learn from a discovery method, there was no better performance on a far transfer test of scientific reasoning than that observed for the many children who learned from direct instruction.

The idea of children learning most of their science through a process of self-directed discovery has some romantic appeal, and it may accurately describe the personal experience of a handful of world-class scientists. However, the claim has generated some contentious disagreements (Kirschner, Sweller, & Clark, 2006 ; Klahr, 2010 ; Taber 2009 ; Tobias & Duffy, 2009 ), and the jury remains out on the extent to which most children can learn science that way.

Conclusions and Future Directions

The field of scientific thinking is now a thriving area of research with strong underpinnings in cognitive psychology and cognitive science. In recent years, a new professional society has been formed that aims to facilitate this integrative and interdisciplinary approach to the psychology of science, with its own journal and regular professional meetings. 1 Clearly the relations between these different aspects of scientific thinking need to be combined in order to produce a truly comprehensive picture of the scientific mind.

While much is known about certain aspects of scientific thinking, much more remains to be discovered. In particular, there has been little contact between cognitive, neuroscience, social, personality, and motivational accounts of scientific thinking. Research in thinking and reasoning has been expanded to use the methods and theories of cognitive neuroscience (see Morrison & Knowlton, Chapter 6 ). A similar approach can be taken in exploring scientific thinking (see Dunbar et al., 2007 ). There are two main reasons for taking a neuroscience approach to scientific thinking. First, functional neuroimaging allows the researcher to look at the entire human brain, making it possible to see the many different sites that are involved in scientific thinking and gain a more complete understanding of the entire range of mechanisms involved in this type of thought. Second, these brain-imaging approaches allow researchers to address fundamental questions in research on scientific thinking, such as the extent to which ordinary thinking in nonscientific contexts and scientific thinking recruit similar versus disparate neural structures of the brain.

Dunbar ( 2009 ) has used some novel methods to explore Simon's assertion, cited at the beginning of this chapter, that scientific thinking uses the same cognitive mechanisms that all human beings possess (rather than being an entirely different type of thinking) but combines them in ways that are specific to a particular aspect of science or a specific discipline of science. For example, Fugelsang and Dunbar ( 2009 ) compared causal reasoning when two colliding circular objects were labeled balls or labeled subatomic particles. They obtained different brain activation patterns depending on whether the stimuli were labeled balls or subatomic particles. In another series of experiments, Dunbar and colleagues used functional magnetic resonance imaging (fMRI) to study patterns of activation in the brains of students who have and who have not undergone conceptual change in physics. For example, Fugelsang and Dunbar ( 2005 ) and Dunbar et al. ( 2007 ) have found differences in the activation of specific brain sites (such as the anterior cingulate) for students when they encounter evidence that is inconsistent with their current conceptual understandings. These initial cognitive neuroscience investigations have the potential to reveal the ways that knowledge is organized in the scientific brain and provide detailed accounts of the nature of the representation of scientific knowledge. Petitto and Dunbar ( 2004 ) proposed the term “educational neuroscience” for the integration of research on education, including science education, with research on neuroscience. However, see Fitzpatrick (in press) for a very different perspective on whether neuroscience approaches are relevant to education. Clearly, research on the scientific brain is just beginning. We as scientists are beginning to get a reasonable grasp of the inner workings of the subcomponents of the scientific mind (i.e., problem solving, analogy, induction). However, great advances remain to be made concerning how these processes interact so that scientific discoveries can be made. Future research will focus on both the collaborative aspects of scientific thinking and the neural underpinnings of the scientific mind.

The International Society for the Psychology of Science and Technology (ISPST). Available at http://www.ispstonline.org/

Ahn, W., Kalish, C. W., Medin, D. L., & Gelman, S. A. ( 1995 ). The role of covariation versus mechanism information in causal attribution.   Cognition , 54 , 299–352.

American Association for the Advancement of Science. ( 1993 ). Benchmarks for scientific literacy . New York: Oxford University Press.

Google Scholar

Google Preview

Atkins, L. J., Velez, L., Goudy, D., & Dunbar, K. N. ( 2009 ). The unintended effects of interactive objects and labels in the science museum.   Science Education , 54 , 161–184.

Azmitia, M. A., & Crowley, K. ( 2001 ). The rhythms of scientific thinking: A study of collaboration in an earthquake microworld. In K. Crowley, C. Schunn, & T. Okada (Eds.), Designing for science: Implications from everyday, classroom, and professional settings (pp. 45–72). Mahwah, NJ: Erlbaum.

Bacon, F. ( 1620 /1854). Novum organum (B. Monatgue, Trans.). Philadelphia, P A: Parry & McMillan.

Baillargeon, R. ( 2004 ). Infants' reasoning about hidden objects: Evidence for event-general and event-specific expectations (article with peer commentaries and response, listed below).   Developmental Science , 54 , 391–424.

Baker, L. M., & Dunbar, K. ( 2000 ). Experimental design heuristics for scientific discovery: The use of baseline and known controls.   International Journal of Human Computer Studies , 54 , 335–349.

Beilin, H. ( 1994 ). Jean Piaget's enduring contribution to developmental psychology. In R. D. Parke, P. A. Ornstein, J. J. Rieser, & C. Zahn-Waxler (Eds.), A century of developmental psychology (pp. 257–290). Washington, DC US: American Psychological Association.

Biswal, B. B., Mennes, M., Zuo, X.-N., Gohel, S., Kelly, C., Smith, S.M., et al. ( 2010 ). Toward discovery science of human brain function.   Proceedings of the National Academy of Sciences of the United States of America , 107, 4734–4739.

Brewer, W. F., & Samarapungavan, A. ( 1991 ). Children's theories vs. scientific theories: Differences in reasoning or differences in knowledge? In R. R. Hoffman & D. S. Palermo (Eds.), Cognition and the symbolic processes: Applied and ecological perspectives (pp. 209–232). Hillsdale, NJ: Erlbaum.

Bruner, J. S., Goodnow, J. J., & Austin, G. A. ( 1956 ). A study of thinking . New York: NY Science Editions.

Carey, S. ( 1985 ). Conceptual change in childhood . Cambridge, MA: MIT Press.

Carruthers, P., Stich, S., & Siegal, M. ( 2002 ). The cognitive basis of science . New York: Cambridge University Press.

Chi, M. ( 1992 ). Conceptual change within and across ontological categories: Examples from learning and discovery in science. In R. Giere (Ed.), Cognitive models of science (pp. 129–186). Minneapolis: University of Minnesota Press.

Chi, M. T. H., & Roscoe, R. D. ( 2002 ). The processes and challenges of conceptual change. In M. Limon & L. Mason (Eds.), Reconsidering conceptual change: Issues in theory and practice (pp 3–27). Amsterdam, Netherlands: Kluwer Academic Publishers.

Chen, Z., & Klahr, D. ( 1999 ). All other things being equal: Children's acquisition of the control of variables strategy.   Child Development , 54 (5), 1098–1120.

Clement, J. ( 1982 ). Students' preconceptions in introductory mechanics.   American Journal of Physics , 54 , 66–71.

Cohen, L. B., & Cashon, C. H. ( 2006 ). Infant cognition. In W. Damon & R. M. Lerner (Series Eds.) & D. Kuhn & R. S. Siegler (Vol. Eds.), Handbook of child psychology. Vol. 2: Cognition, perception, and language (6th ed., pp. 214–251). New York: Wiley.

National Commission on Excellence in Education. ( 1983 ). A nation at risk: The imperative for educational reform . Washington, DC: US Department of Education.

Crick, F. H. C. ( 1988 ). What mad pursuit: A personal view of science . New York: Basic Books.

Darden, L. ( 2002 ). Strategies for discovering mechanisms: Schema instantiation, modular subassembly, forward chaining/backtracking.   Philosophy of Science , 69, S354–S365.

Davenport, J. L., Yaron, D., Klahr, D., & Koedinger, K. ( 2008 ). Development of conceptual understanding and problem solving expertise in chemistry. In B. C. Love, K. McRae, & V. M. Sloutsky (Eds.), Proceedings of the 30th Annual Conference of the Cognitive Science Society (pp. 751–756). Austin, TX: Cognitive Science Society.

diSessa, A. A. ( 2004 ). Contextuality and coordination in conceptual change. In E. Redish & M. Vicentini (Eds.), Proceedings of the International School of Physics “Enrico Fermi:” Research on physics education (pp. 137–156). Amsterdam, Netherlands: ISO Press/Italian Physics Society

Dunbar, K. ( 1995 ). How scientists really reason: Scientific reasoning in real-world laboratories. In R. J. Sternberg, & J. Davidson (Eds.), Mechanisms of insight (pp. 365–395). Cambridge, MA: MIT press.

Dunbar, K. ( 1997 ). How scientists think: Online creativity and conceptual change in science. In T. B. Ward, S. M. Smith, & S. Vaid (Eds.), Conceptual structures and processes: Emergence, discovery and change (pp. 461–494). Washington, DC: American Psychological Association.

Dunbar, K. ( 1998 ). Problem solving. In W. Bechtel & G. Graham (Eds.), A companion to cognitive science (pp. 289–298). London: Blackwell

Dunbar, K. ( 1999 ). The scientist InVivo : How scientists think and reason in the laboratory. In L. Magnani, N. Nersessian, & P. Thagard (Eds.), Model-based reasoning in scientific discovery (pp. 85–100). New York: Plenum.

Dunbar, K. ( 2001 ). The analogical paradox: Why analogy is so easy in naturalistic settings, yet so difficult in the psychology laboratory. In D. Gentner, K. J. Holyoak, & B. Kokinov Analogy: Perspectives from cognitive science (pp. 313–334). Cambridge, MA: MIT press.

Dunbar, K. ( 2002 ). Science as category: Implications of InVivo science for theories of cognitive development, scientific discovery, and the nature of science. In P. Caruthers, S. Stich, & M. Siegel (Eds.) Cognitive models of science (pp. 154–170). New York: Cambridge University Press.

Dunbar, K. ( 2009 ). The biology of physics: What the brain reveals about our physical understanding of the world. In M. Sabella, C. Henderson, & C. Singh. (Eds.), Proceedings of the Physics Education Research Conference (pp. 15–18). Melville, NY: American Institute of Physics.

Dunbar, K., & Fugelsang, J. ( 2004 ). Causal thinking in science: How scientists and students interpret the unexpected. In M. E. Gorman, A. Kincannon, D. Gooding, & R. D. Tweney (Eds.), New directions in scientific and technical thinking (pp. 57–59). Mahway, NJ: Erlbaum.

Dunbar, K., Fugelsang, J., & Stein, C. ( 2007 ). Do naïve theories ever go away? In M. Lovett & P. Shah (Eds.), Thinking with Data: 33 rd Carnegie Symposium on Cognition (pp. 193–206). Mahwah, NJ: Erlbaum.

Dunbar, K., & Sussman, D. ( 1995 ). Toward a cognitive account of frontal lobe function: Simulating frontal lobe deficits in normal subjects.   Annals of the New York Academy of Sciences , 54 , 289–304.

Duschl, R. A., Schweingruber, H. A., & Shouse, A. W. (Eds.). ( 2007 ). Taking science to school: Learning and teaching science in grades K-8. Washington, DC: National Academies Press.

Einstein, A. ( 1950 ). Out of my later years . New York: Philosophical Library

Erdos, P., Fajtlowicz, S., & Staton, W. ( 1991 ). Degree sequences in the triangle-free graphs,   Discrete Mathematics , 54 (91), 85–88.

Evans, J., & Rzhetsky, A. ( 2010 ). Machine science.   Science , 54 , 399–400.

Fay, A., & Klahr, D. ( 1996 ). Knowing about guessing and guessing about knowing: Preschoolers' understanding of indeterminacy.   Child Development , 54 , 689–716.

Fischler, H., & Lichtfeldt, M. ( 1992 ). Modern physics and students conceptions.   International Journal of Science Education , 54 , 181–190.

Fitzpatrick, S. M. (in press). Functional brain imaging: Neuro-turn or wrong turn? In M. M., Littlefield & J.M., Johnson (Eds.), The neuroscientific turn: Transdisciplinarity in the age of the brain. Ann Arbor: University of Michigan Press.

Fox-Keller, E. ( 1985 ). Reflections on gender and science . New Haven, CT: Yale University Press.

Fugelsang, J., & Dunbar, K. ( 2005 ). Brain-based mechanisms underlying complex causal thinking.   Neuropsychologia , 54 , 1204–1213.

Fugelsang, J., & Dunbar, K. ( 2009 ). Brain-based mechanisms underlying causal reasoning. In E. Kraft (Ed.), Neural correlates of thinking (pp. 269–279). Berlin, Germany: Springer

Fugelsang, J., Stein, C., Green, A., & Dunbar, K. ( 2004 ). Theory and data interactions of the scientific mind: Evidence from the molecular and the cognitive laboratory.   Canadian Journal of Experimental Psychology , 54 , 132–141

Galilei, G. ( 1638 /1991). Dialogues concerning two new sciences (A. de Salvio & H. Crew, Trans.). Amherst, NY: Prometheus Books.

Galison, P. ( 2003 ). Einstein's clocks, Poincaré's maps: Empires of time . New York: W. W. Norton.

Gelman, R., & Baillargeon, R. ( 1983 ). A review of Piagetian concepts. In P. H. Mussen (Series Ed.) & J. H. Flavell & E. M. Markman (Vol. Eds.), Handbook of child psychology (4th ed., Vol. 3, pp. 167–230). New York: Wiley.

Gelman, S. A., & Kalish, C. W. ( 2006 ). Conceptual development. In D. Kuhn & R. Siegler (Eds.), Handbook of child psychology. Vol. 2: Cognition, perception and language (pp. 687–733). New York: Wiley.

Gelman, S., & Wellman, H. ( 1991 ). Insides and essences.   Cognition , 54 , 214–244.

Gentner, D. ( 2010 ). Bootstrapping the mind: Analogical processes and symbol systems.   Cognitive Science , 54 , 752–775.

Gentner, D., Brem, S., Ferguson, R. W., Markman, A. B., Levidow, B. B., Wolff, P., & Forbus, K. D. ( 1997 ). Analogical reasoning and conceptual change: A case study of Johannes Kepler.   The Journal of the Learning Sciences , 54 (1), 3–40.

Gentner, D., Holyoak, K. J., & Kokinov, B. ( 2001 ). The analogical mind: Perspectives from cognitive science . Cambridge, MA: MIT Press.

Gentner, D., & Jeziorski, M. ( 1993 ). The shift from metaphor to analogy in western science. In A. Ortony (Ed.), Metaphor and thought (2nd ed., pp. 447–480). Cambridge, England: Cambridge University Press.

Gianfelici, F. ( 2010 ). Machine science: Truly machine-aided science.   Science , 54 , 317–319.

Giere, R. ( 1993 ). Cognitive models of science . Minneapolis: University of Minnesota Press.

Gopnik, A. N., Meltzoff, A. N., & Kuhl, P. K. ( 1999 ). The scientist in the crib: Minds, brains and how children learn . New York: Harper Collins

Gorman, M. E. ( 1989 ). Error, falsification and scientific inference: An experimental investigation.   Quarterly Journal of Experimental Psychology: Human Experimental Psychology , 41A , 385–412

Gorman, M. E., Kincannon, A., Gooding, D., & Tweney, R. D. ( 2004 ). New directions in scientific and technical thinking . Mahwah, NJ: Erlbaum.

Gupta, A., Hammer, D., & Redish, E. F. ( 2010 ). The case for dynamic models of learners' ontologies in physics.   Journal of the Learning Sciences , 54 (3), 285–321.

Haufe, C., Elliott, K. C., Burian, R., & O'Malley, M. A. ( 2010 ). Machine science: What's missing.   Science , 54 , 318–320.

Hecht, E. ( 2011 ). On defining mass.   The Physics Teacher , 54 , 40–43.

Heit, E. ( 2000 ). Properties of inductive reasoning.   Psychonomic Bulletin and Review , 54 , 569–592.

Holyoak, K. J., & Thagard, P. ( 1995 ). Mental leaps . Cambridge, MA: MIT Press.

Karmiloff-Smith, A. ( 1988 ) The child is a theoretician, not an inductivist.   Mind and Language , 54 , 183–195.

Keil, F. C. ( 1999 ). Conceptual change. In R. Wilson & F. Keil (Eds.), The MIT encyclopedia of cognitive science . (pp. 179–182) Cambridge, MA: MIT press.

Kern, L. H., Mirels, H. L., & Hinshaw, V. G. ( 1983 ). Scientists' understanding of propositional logic: An experimental investigation.   Social Studies of Science , 54 , 131–146.

King, R. D. ( 2011 ). Rise of the robo scientists.   Scientific American , 54 (1), 73–77.

King, R. D., Rowland, J., Oliver, S. G., Young, M., Aubrey, W., Byrne, E., et al. ( 2009 ). The automation of science.   Science , 54 , 85–89.

Kirschner, P. A., Sweller, J., & Clark, R. ( 2006 ) Why minimal guidance during instruction does not work: An analysis of the failure of constructivist, discovery, problem-based, experiential, and inquiry-based teaching.   Educational Psychologist , 54 , 75–86

Klahr, D. ( 2000 ). Exploring science: The cognition and development of discovery processes . Cambridge, MA: MIT Press.

Klahr, D. ( 2010 ). Coming up for air: But is it oxygen or phlogiston? A response to Taber's review of constructivist instruction: Success or failure?   Education Review , 54 (13), 1–6.

Klahr, D., & Dunbar, K. ( 1988 ). Dual space search during scientific reasoning.   Cognitive Science , 54 , 1–48.

Klahr, D., & Nigam, M. ( 2004 ). The equivalence of learning paths in early science instruction: effects of direct instruction and discovery learning.   Psychological Science , 54 (10), 661–667.

Klahr, D. & Masnick, A. M. ( 2002 ). Explaining, but not discovering, abduction. Review of L. Magnani (2001) abduction, reason, and science: Processes of discovery and explanation.   Contemporary Psychology , 47, 740–741.

Klahr, D., & Simon, H. ( 1999 ). Studies of scientific discovery: Complementary approaches and convergent findings.   Psychological Bulletin , 54 , 524–543.

Klayman, J., & Ha, Y. ( 1987 ). Confirmation, disconfirmation, and information in hypothesis testing.   Psychological Review , 54 , 211–228.

Kozhevnikov, M., & Hegarty, M. ( 2001 ). Impetus beliefs as default heuristic: Dissociation between explicit and implicit knowledge about motion.   Psychonomic Bulletin and Review , 54 , 439–453.

Kuhn, T. ( 1962 ). The structure of scientific revolutions . Chicago, IL: University of Chicago Press.

Kuhn, D., Amsel, E., & O'Laughlin, M. ( 1988 ). The development of scientific thinking skills . Orlando, FL: Academic Press.

Kulkarni, D., & Simon, H. A. ( 1988 ). The processes of scientific discovery: The strategy of experimentation.   Cognitive Science , 54 , 139–176.

Langley, P. ( 2000 ). Computational support of scientific discovery.   International Journal of Human-Computer Studies , 54 , 393–410.

Langley, P. ( 2002 ). Lessons for the computational discovery of scientific knowledge. In Proceedings of the First International Workshop on Data Mining Lessons Learned (pp. 9–12).

Langley, P., Simon, H. A., Bradshaw, G. L., & Zytkow, J. M. ( 1987 ). Scientific discovery: Computational explorations of the creative processes . Cambridge, MA: MIT Press.

Lorch, R. F., Jr., Lorch, E. P., Calderhead, W. J., Dunlap, E. E., Hodell, E. C., & Freer, B. D. ( 2010 ). Learning the control of variables strategy in higher and lower achieving classrooms: Contributions of explicit instruction and experimentation.   Journal of Educational Psychology , 54 (1), 90–101.

Magnani, L., Carnielli, W., & Pizzi, C., (Eds.) ( 2010 ). Model-based reasoning in science and technology: Abduction, logic,and computational discovery. Series Studies in Computational Intelligence (Vol. 314). Heidelberg/Berlin: Springer.

Mandler, J.M. ( 2004 ). The foundations of mind: Origins of conceptual thought . Oxford, England: Oxford University Press.

Macpherson, R., & Stanovich, K. E. ( 2007 ). Cognitive ability, thinking dispositions, and instructional set as predictors of critical thinking.   Learning and Individual Differences , 54 , 115–127.

McCloskey, M., Caramazza, A., & Green, B. ( 1980 ). Curvilinear motion in the absence of external forces: Naive beliefs about the motion of objects.   Science , 54 , 1139–1141.

McDermott, L. C., & Redish, L. ( 1999 ). Research letter on physics education research.   American Journal of Psychics , 54 , 755.

Mestre, J. P. ( 1991 ). Learning and instruction in pre-college physical science.   Physics Today , 54 , 56–62.

Metz, K. E. ( 1995 ). Reassessment of developmental constraints on children's science instruction.   Review of Educational Research , 54 (2), 93–127.

Minner, D. D., Levy, A. J., & Century, J. ( 2010 ). Inquiry-based science instruction—what is it and does it matter? Results from a research synthesis years 1984 to 2002.   Journal of Research in Science Teaching , 54 (4), 474–496.

Mitchell, T. M. ( 2009 ). Mining our reality.   Science , 54 , 1644–1645.

Mitroff, I. ( 1974 ). The subjective side of science . Amsterdam, Netherlands: Elsevier.

Munakata, Y., Casey, B. J., & Diamond, A. ( 2004 ). Developmental cognitive neuroscience: Progress and potential.   Trends in Cognitive Sciences , 54 , 122–128.

Mynatt, C. R., Doherty, M. E., & Tweney, R. D. ( 1977 ) Confirmation bias in a simulated research environment: An experimental study of scientific inference.   Quarterly Journal of Experimental Psychology , 54 , 89–95.

Nersessian, N. ( 1998 ). Conceptual change. In W. Bechtel, & G. Graham (Eds.), A companion to cognitive science (pp. 157–166). London, England: Blackwell.

Nersessian, N. ( 1999 ). Models, mental models, and representations: Model-based reasoning in conceptual change. In L. Magnani, N. Nersessian, & P. Thagard (Eds.), Model-based reasoning in scientific discovery (pp. 5–22). New York: Plenum.

Nersessian, N. J. ( 2002 ). The cognitive basis of model-based reasoning in science In. P. Carruthers, S. Stich, & M. Siegal (Eds.), The cognitive basis of science (pp. 133–152). New York: Cambridge University Press.

Nersessian, N. J. ( 2008 ) Creating scientific concepts . Cambridge, MA: MIT Press.

O' Malley, M. A. ( 2011 ). Exploration, iterativity and kludging in synthetic biology.   Comptes Rendus Chimie , 54 (4), 406–412 .

Papert, S. ( 1980 ) Mindstorms: Children computers and powerful ideas. New York: Basic Books.

Penner, D. E., & Klahr, D. ( 1996 ). When to trust the data: Further investigations of system error in a scientific reasoning task.   Memory and Cognition , 54 (5), 655–668.

Petitto, L. A., & Dunbar, K. ( 2004 ). New findings from educational neuroscience on bilingual brains, scientific brains, and the educated mind. In K. Fischer & T. Katzir (Eds.), Building usable knowledge in mind, brain, and education Cambridge, England: Cambridge University Press.

Popper, K. R. ( 1959 ). The logic of scientific discovery . London, England: Hutchinson.

Qin, Y., & Simon, H.A. ( 1990 ). Laboratory replication of scientific discovery processes.   Cognitive Science , 54 , 281–312.

Reiser, B. J., Tabak, I., Sandoval, W. A., Smith, B., Steinmuller, F., & Leone, T. J., ( 2001 ). BGuILE: Stategic and conceptual scaffolds for scientific inquiry in biology classrooms. In S. M. Carver & D. Klahr (Eds.), Cognition and instruction: Twenty-five years of progress (pp. 263–306). Mahwah, NJ: Erlbaum

Riordan, M., Rowson, P. C., & Wu, S. L. ( 2001 ). The search for the higgs boson.   Science , 54 , 259–260.

Rutherford, F. J., & Ahlgren, A. ( 1991 ). Science for all Americans. New York: Oxford University Press.

Samarapungavan, A. ( 1992 ). Children's judgments in theory choice tasks: Scientifc rationality in childhood.   Cognition , 54 , 1–32.

Schauble, L., & Glaser, R. ( 1990 ). Scientific thinking in children and adults. In D. Kuhn (Ed.), Developmental perspectives on teaching and learning thinking skills. Contributions to Human Development , (Vol. 21, pp. 9–26). Basel, Switzerland: Karger.

Schunn, C. D., & Klahr, D. ( 1995 ). A 4-space model of scientific discovery. In Proceedings of the 17th Annual Conference of the Cognitive Science Society (pp. 106–111). Mahwah, NJ: Erlbaum.

Schunn, C. D., & Klahr, D. ( 1996 ). The problem of problem spaces: When and how to go beyond a 2-space model of scientific discovery. Part of symposium on Building a theory of problem solving and scientific discovery: How big is N in N-space search? In Proceedings of the 18th Annual Conference of the Cognitive Science Society (pp. 25–26). Mahwah, NJ: Erlbaum.

Shrager, J., & Langley, P. ( 1990 ). Computational models of scientific discovery and theory formation . San Mateo, CA: Morgan Kaufmann.

Siegler, R. S., & Liebert, R. M. ( 1975 ). Acquisition of formal scientific reasoning by 10- and 13-year-olds: Designing a factorial experiment.   Developmental Psychology , 54 , 401–412.

Simon, H. A. ( 1977 ). Models of discovery . Dordrecht, Netherlands: D. Reidel Publishing.

Simon, H. A., Langley, P., & Bradshaw, G. L. ( 1981 ). Scientific discovery as problem solving.   Synthese , 54 , 1–27.

Simon, H. A., & Lea, G. ( 1974 ). Problem solving and rule induction. In H. Simon (Ed.), Models of thought (pp. 329–346). New Haven, CT: Yale University Press.

Smith, E. E., Shafir, E., & Osherson, D. ( 1993 ). Similarity, plausibility, and judgments of probability.   Cognition. Special Issue: Reasoning and decision making , 54 , 67–96.

Sodian, B., Zaitchik, D., & Carey, S. ( 1991 ). Young children's differentiation of hypothetical beliefs from evidence.   Child Development , 54 , 753–766.

Taber, K. S. ( 2009 ). Constructivism and the crisis in U.S. science education: An essay review.   Education Review , 54 (12), 1–26.

Thagard, P. ( 1992 ). Conceptual revolutions . Cambridge, MA: MIT Press.

Thagard, P. ( 1999 ). How scientists explain disease . Princeton, NJ: Princeton University Press.

Thagard, P., & Croft, D. ( 1999 ). Scientific discovery and technological innovation: Ulcers, dinosaur extinction, and the programming language Java. In L. Magnani, N. Nersessian, & P. Thagard (Eds.), Model-based reasoning in scientific discovery (pp. 125–138). New York: Plenum.

Tobias, S., & Duffy, T. M. (Eds.). ( 2009 ). Constructivist instruction: Success or failure? New York: Routledge.

Toth, E. E., Klahr, D., & Chen, Z. ( 2000 ) Bridging research and practice: A cognitively-based classroom intervention for teaching experimentation skills to elementary school children.   Cognition and Instruction , 54 (4), 423–459.

Tweney, R. D. ( 1989 ). A framework for the cognitive psychology of science. In B. Gholson, A. Houts, R. A. Neimeyer, & W. Shadish (Eds.), Psychology of science: Contributions to metascience (pp. 342–366). Cambridge, England: Cambridge University Press.

Tweney, R. D., Doherty, M. E., & Mynatt, C. R. ( 1981 ). On scientific thinking . New York: Columbia University Press.

Valdes-Perez, R. E. ( 1994 ). Conjecturing hidden entities via simplicity and conservation laws: Machine discovery in chemistry.   Artificial Intelligence , 54 (2), 247–280.

Von Hofsten, C. ( 1980 ). Predictive reaching for moving objects by human infants.   Journal of Experimental Child Psychology , 54 , 369–382.

Von Hofsten, C., Feng, Q., & Spelke, E. S. ( 2000 ). Object representation and predictive action in infancy.   Developmental Science , 54 , 193–205.

Vosnaidou, S. (Ed.). ( 2008 ). International handbook of research on conceptual change . New York: Taylor & Francis.

Vosniadou, S., & Brewer, W. F. ( 1992 ). Mental models of the earth: A study of conceptual change in childhood.   Cognitive Psychology , 54 , 535–585.

Wason, P. C. ( 1968 ). Reasoning about a rule.   Quarterly Journal of Experimental Psychology , 54 , 273–281.

Wertheimer, M. ( 1945 ). Productive thinking . New York: Harper.

Yang, Y. ( 2009 ). Target discovery from data mining approaches.   Drug Discovery Today , 54 (3–4), 147–154.

  • About Oxford Academic
  • Publish journals with us
  • University press partners
  • What we publish
  • New features  
  • Open access
  • Institutional account management
  • Rights and permissions
  • Get help with access
  • Accessibility
  • Advertising
  • Media enquiries
  • Oxford University Press
  • Oxford Languages
  • University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

  • Copyright © 2024 Oxford University Press
  • Cookie settings
  • Cookie policy
  • Privacy policy
  • Legal notice

This Feature Is Available To Subscribers Only

Sign In or Create an Account

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

SCIENTIFIC REASONING: RESEARCH, DEVELOPMENT, AND ASSESSMENT

DISSERTATION

Presented in Partial Fulfillment of the Requirements for

the Degree Doctor of Philosophy in the Graduate

School of The Ohio State University

Jing Han, B.S., M.S.

The Ohio State University 2013

Dissertation Committee: Approved by Professor Lei Bao, Adviser

Professor Fengyuan Yang ______Professor Andrew F. Heckler Adviser Physics Graduate Program Professor Evan R. Sugarbaker

© Copyright by

Education in Science , Technology, Engineering , and Math (STEM) is emphasized

worldwide. Reports from large-scale international studies such as TIMSS and PISA

continually rank U.S. students behind many other nations. As a result, the U.S. has

increased its emphasis on the implementation of a more extensive science and

mathematics curriculum in K-12 education.

In STEM education, widely accepted teaching goals include not only the development

of solid content knowledge but also the development of general scientific abilities that

will enable students to successfully handle open-ended real-world tasks in future careers.

One such ability, scientific reasoning, is closely related to a wide range of general cognitive abilities such as critical thinking and reasoning. Existing research has suggested

that scientific reasoning skills can be trained and transferred. Training in scientific

reasoning may also have a long-term impact on student academic achievement. In the

STEM education community, it has been widely agreed that student development of

transferable general abilities is at least as important as certain learned STEM knowledge.

Therefore, it is important to investigate how to implement a STEM education program

that can help students develop both STEM content knowledge and scientific reasoning.

ii In order to develop such a knowledge base and to assess and evaluate the impact and effectiveness of education methods and resources, we need good assessment tools that can be easily applied in large scale and produce valid results comparable across a wide range of populations.

In the area of scientific reasoning, there exists a practical tool, the Lawson’s

Classroom Test of Scientific Reasoning, and since its initial development in the late

1970s and early 1980s, the test has undergone several revisions with the current version released in 2000. Although the Lawson’s test has provided much useful information for research and assessment purposes, the test itself hasn’t been systematically validated and several issues have been observed in large scale applications concerning item designs and scalability of the results (details will be provided in later sections). Therefore, there is an urgent need for systematic research on validating the Lawson’s test and further

development of validated standardized assessment tools on scientific reasoning for K-18

This dissertation project establishes a first step to systematically improve the

assessment instrumentation of scientific reasoning. A series of studies have been

(1) A detailed validation study of the Lawson’s test, which has identified a number of

validity issues including item/choice design issues, item context issues, item

structure and wording issues (e.g. two-tier design), the limited scale of

measurement range, and the ceiling effect for advanced students.

iii (2) A study to determine the basic measurement features of the Lawson’s test with

large scale data.

(3) A data-mining study of Lawson’s test data, which helps identify learning

progression behaviors of selected scientific reasoning skills. The results also

provide evidence for researchers to evaluate and model the scoring methods of

two-tiered questions used in the Lawson’s test.

(4) A study with randomized testing to investigate the learning progression of the

skill of control of variables (COV), which showed a series of fine grained

intermediate levels of COV skills.

This project produces rich resources for sustained research and development on scientific reasoning. It establishes a valuable baseline for teachers and researchers to apply the Lawson’s test in research and teaching and a solid foundation for researchers to further develop the next generation assessment instruments on scientific reasoning.

ACKNOWLEDGMENTS

I am extremely grateful to my adviser, Prof. Lei Bao, for all his support since my initial meeting to find out about working with the physics education research group.

I would like to thank Prof. Andrew Heckler, Prof. Fengyuan Yang, and Prof. Evan

Sugarbaker for serving on my committee.

I also would like to thank fellow graduate students in the physics education research group for all the helpful discussions and support.

I thank my family and friends for their unconditional support.

Most importantly, I would like to thank the students for their participation and cooperation with this project.

2008-Present…………………………… Research Assistant, Department of Physics,

The Ohio State University

2010…………………………………..... M.S. Physics, The Ohio State University

2006…………………………………..... B.S. Physics, Capital Normal University,

Beijing, China

PUBLICATIONS

Jing Han, Li Chen, Yibing Zhang, Shaona Zhou,Lei Bao, “Seeing what students see in

doing the Force Concept Inventory”, Am. J. Phys., In Press.

Lei Bao, Amy Raplinger, Jing Han, Yeounsoo Kim, “Assessment of Students’ Cognitive

Conflicts an Anxiety” , Journal of Research in Science Teaching, Submitted.

Li Chen, Jing Han, Lei Bao, Jing Wang, Yan Tu, “Comparisons of Item Response Theory

Algorithms with Force Concept Inventory Data”, Research in Education Assessment

and Learning, 2 (02), 26-34, (2011).

vi Shaona Zhou, Jing Han, Nathaniel Pelz, Xiaojun Wang, Liangyu Peng, Hua Xiao, Lei

Bao, Inquiry Style Interactive Virtual Experiments : A Case on Circular Motion, Eur.

J. Phys. 32 1597, 2011

Tianfang Cai, Hongmin Zhao,Jing Han, “Clicker in introductory physics class of

Chinese University,” Physics and engineering, Vol.2, pp51-53, 2010( in Chinese).

Lei Bao, Tianfang Cai, Kathy Koenig, Kai Fang, Jing Han, Jing Wang, Qing Liu, Lin

Ding, Lili Cui, Ying Luo, Yufeng Wang, Lieming Li, Nianle Wu, “Learning and

Scientific Reasoning”, Science, Vol. 323. no. 5914, pp. 586 – 587 (2009).

Lei Bao, Kai Fang, Tianfang Cai, Jing Wang, Lijia Yang, Lili Cui, Jing Han, Lin Ding,

and Ying Luo “Learning of Content Knowledge and Development of Scientific

Reasoning Ability: A Cross Culture Comparison,” Am. J. Phys., 77 (12), 1118-1123

Jing Han, Dan Li, “Research in Moral and Law Education in XuanWu District”,

Excellent Thesis of Special Undergraduate Grant for Research CNU, 2006. ( in

Jing Han, Wu Zheng “Enhancing the Gauss Theorem Education through Analysis of

Universal Gravitational Field Intensity” Discussion of Physics Education, Vol. 25,

2006 (in Chinese).

FIELDS OF STUDY

Major Field: Physics

TABLE OF CONTENTS

Abstract …………………………………………………………………………………. ii

Acknowledgments ……………………………………………………………………….. v

Vita …………………………………………………………………………………..…...vi

List of Figures ………………………………………………………………………….. xii

List of Tables …………………………………………………………………...... xiv

Chapter 1.Introduction to Research on Scientific Reasoning……………………………. 1

1.1 What is Scientific Reasoning? …………………………………………………... 1 1.2 Why is Scientific Reasoning Important? ………………………………………... 9 1.3 How is Scientific Reasoning Learned? …………………………………………12 1.4 How is Scientific Reasoning Assessed? ……………………………………….. 13 1.5 Scientific Reasoning, an Important Component of the 21stCentury Skills…….. 15 1.5.1 Skills Gap between Schools and Workplaces…………………………...... 17 1.5.2 What are 21st Century Skills? …………………………………………… 19 1.6 Outline of the Thesis…………………………………………………………….22

Chapter 2.Research on Assessment of Scientific Reasoning…………………………... 24

2.1 Theoretical Background of Scientific Reasoning………………………………. 24 2.2 Existing Research and Tools on Assessment of Scientific Reasoning…………. 28 2.2.1 Group Assessment of Logical Thinking (GALT) ………………………...29 2.2.2 The Test of Logical Thinking (TOLT) …………………………………... 31 2.2.3 Lawson’s Classroom Test of Scientific Reasoning (LCTSR) …………… 32 2.3 Expanding the Dimensions of Skills for Assessment of Scientific Reasoning….36 2.3.1 Control of Variables……………………………………………………….37 2.3.2 Proportions and Ratios…………………………………………………….42

ix 2.3.3 Probability……………………………………………………………….. 45 2.3.4Correlational Reasoning………………………………………………….. 47 2.3.5 Deductive Reasoning…………………………………………………….. 50 2.3.6 Inductive Reasoning ………………………………………………………54 2.3.7 Causal Reasoning…………………………………………………….…... 56

Chapter 3. Validity of the Lawson’s Classroom Test of Scientific Reasoning.…..…… 60

3.1 A Historical Review on the Development of the Lawson’s Test………………. 60 3.2 Content Evaluation of Lawon’s Test – Item Context Issues…………………… 66 3.3 A Data-Driven Study on the Validity of the Lawson’s Test…………………… 67 3.4 Quantitative Results – Item Score Analysis…………………………………….69 3.5 Quantitative Results – Analysis of Two-Tier Score Patterns………………….. 72 3.6 Qualitative Results – Analysis of Student Interviews…………………………..75 3.7 Consideration on Two-Tier Question Structure……………………………….. 84 3.8 The Ceiling Effect and Measurement Saturation of the Lawson’s Test……….. 85 3.9 Conclusions…………………………………………………….………………. 88

Chapter 4.The Developmental Metric of Scientific Reasoning………………………... 89

4.1 Context of the Study…………………………………………………….……... 89 4.2 Data Collection…………………………………………………….…….…….. 90 4.3 The Developmental Scales of the Lawson’s Test Scores……………………… 93 4.3.1 The Learning Evolution Index Curve of the Lawson’s Test…………….. 93 4.3.2 The Developmental Scales of the Six Skill Dimensions of the Lawson’s Test……………………………………………………………... 98 4.3.3 The Developmental Curve of the Conservation of Mass and Volume…... 99 4.3.4 The Developmental Curve of the Proportional Reasoning………………101 4.3.5 The Developmental Curve of the Control of Variables………………….103 4.3.6 The Developmental Curve of the Probabilistic Reasoning……………... 105 4.3.7 The Developmental Curve of the Correlation Thinking…………………108 4.3.8 The Developmental Curve of the Hypothetical-deductive Reasoning…. 110 4.4 Summary…………………………………………………….…….………….. 112

Chapter 5. Study Learning Progression of Scientific Reasoning Skills through Pattern Analysis of Reponses to the Lawson’s Test…………………………………………... 114

5.1 Context of the Study………………………………………………………….. 114 5.2 Review of Learning Progression in the Context of Scientific Reasoning…….. 114 5.3 Research Design to Study Learning Progress in Scientific Reasoning……….. 123 5.4 Data Collection………………………………………………….…….………. 130 5.5 Data Analysis and Results………………………………………………..…… 131 5.5.1 Result 1: Defining a new level in the scoring of the Lawson Test…….... 136

x 5.5.2 Result 2: Combined patterns of responses as indicators for performance levels……………………………………………..……………………… 139 5.5.3 Result 3: Proposing a three-level scoring system for the Lawson’s Test.. 145 5.6 Conclusions…………………………………………………………………….149

Chapter 6.A Case Study on Fine Grained Learning Progression of Control of Variables………………………………………………………………………………. 152

6.1 Context of the Study…………………………………………………………... 152 6.2 Review of Research on Control of Variables…………………………………. 153 6.3 Research Design………………………………………………………………..159 6.3.1 Research Questions and Goals…………………………………………...159 6.3.2 The Design of the Assessment Instrument……………………………… 160 6.3.3 Data Collection………………………………………………………….. 165 6.4 Data Analysis and Results…………………………………………………...... 166 6.4.1 Impact of giving experimental data in a question on student performance……………………………………………………………166 6.4.2 Impact of question context on student performance………………...…...170 6.4.3 Impact of embedded relationships between variables on student performance………………………………………………………...174 6.6 Conclusions and Discussions……………………………………...…………...179

Chapter 7. Summary……………………………………………………………………183

References ………………………………………………………………...…………...187

Appendix A: Group Assessment of Logical Thinking (GALT) ..………...…………... 206

Appendix B: The Test of Logical Thinking (TOLT) ………………………..………...212

Appendix C: Lawson’s Class Room Test of Scientific Reasoning ……....………….. 218

LIST OF FIGURES

Figure Page

Figure 4.1.The developmental trend of Chinese and U.S. ………………… 96 students’ total LCTSR scores

Figure 4.2.The developmental trends on conservation of ………………… 103 matter and volumes

Figure 4.3.The developmental trends on proportional ………………… 104 reasoning.

Figure 4.4.The developmental trends on control of ………………… 107 variables.

Figure 4.5.The developmental trends on probabilistic ………………… 109 reasoning.

Figure 4.6.The developmental trends on correlation ………………… 112 thinking.

Figure4.7.The developmental trends on hypothetical- ………………… 114 deductive reasoning.

Figure 5.1.Items from the Lawson’s Test used in this study. ………………… 131

Figure 5.2. Percentage of grades 3-12 at the six levels of ………………… 146 Lawson Test performance.

Figure 5.3.Percentage of grade groupings at each of the six ………………… 148 levels.

Figure 5.4.Traditional, individual, and three-level Lawson ………………… 152 Test scoring.

Figure 6.1.Test questions on COV with experimental data. ………………… 166 Question 1 poses a COV situation using a real-life context. Questions 2 and 3 are in physics contexts and are based on the tasks used in Boudreaux et al. (2008).

Figure 6.2. Mean scores on Test A (data not given) and Test ………………… 172 B (data given). The error bars (approximately ±0.04) represent the standard error.

Figure 6.3.The mean scores on Test A (data not given) and ………………… 176 Test B (data given) for each context. The real-life context shows a greater difference between the means of Tests A and B than the physics context. The error bars (±0.04) indicate the standard errors.

LIST OF TABLES

Table 3.1. Lawson Test two-tiered response patterns of ………………… 75 U.S. college freshmen (N=1699).

Table 3.2. Comparison of Lawson test total percentage ………………… 75 scores of U.S. college freshmen (N=1699) calculated with paired-score vs. single-question-score methods.

Table 4.1.Summary of Lawson’s Test Data from USA ………………… 94 and China. COV (Control of Variables), HD (Hypothetical Deductive)

Table 4.2.The model fit parameters and the root-mean- ………………… 98 square deviations (RMSD) of the fit for the mean scores and population.

Table 4.3.The six skill dimensions of the Lawson’s ………………… 101 test.

Table 5.1.Traditional scoring on a two-tier item from ………………… 128 the Lawson Test.

Table 5.2.Student performance on two easy and two ………………… 130 difficult questions.

Table 5.3.Distribution of collected student data across ………………… 134 different grade levels.

Table 5.4.Responses to Lawson Test items P1, P2, F1, ………………… 138 and F2 from grades 6-7, 9-10, and college.

Table 5.5. Student performance on P1, P2, F1, and F2 ………………… 140 from grades 3 to 12.

Table 5.6.College student responses to P1, P2, F1, and ………………… 141 F2 on pre- and post-tests.

Table 5.7. Percentage of grades 3-12 at the six levels ………………… 145 of Lawson Test performance.

Table 5.8.Traditional and proposed scoring methods ………………… 150 for two-tier items on the Lawson Test.

Table 6.1.A summary of different levels of COV skills ………………… 162 studied in the literature

Table 6.2.Information about test items ………………… 167

Table 6.3.Percentage of students responding with ………………… 181 selected choices of the three questions on Test A (data not given) and Test B (data given).

Table 6.5.A progression of COV skills tested in this ………………… 185 study.

Chapter 1. Introduction to Research on Scientific Reasoning

1.1 What is Scientific Reasoning?

Scientific reasoning, also referred to as “formal reasoning” (Piaget, 1965) or “critical thinking” (Hawkins and PEA, 1987) in early studies, represents the ability to systematically explore a problem, formulate and test hypotheses, control and manipulate variables, and evaluate experimental outcomes (Zimmerman, 2007; Bao et al., 2009). It represents a set of domain general skills involved in science inquiry supporting the experimentation, evidence evaluation, inference and argumentation that lead to formation and modification of concepts and theories about the natural and social world.

There exists a large body of research on the multifaceted aspects of scientific reasoning. Zimmerman (2007) made a comprehensive review on the related work using the Klahr’s (2000, 2005) Scientific Discovery as Dual Search (SDDS) model as the general framework that organizes the main empirical findings in three areas including experiential skills, evidence evaluation skills, and integrated approaches in self-directed experimentation (Klahr, 2000, 2005; Zimmerman, 2007). Kuhn (2002) has argued that the defining feature of scientific thinking is the set of skills involved in differentiating and coordinating theory and evidence (Kuhn, 1989, 2002). The specific set of skills in scientific reasoning included the isolation and control of variables, producing the full set of factorial combinations in multivariable tasks, selecting an appropriate design or a

1 conclusive test, generating experimental designs or conclusive tests, record keeping, the inductive skills implicated in generating a theory to account for a pattern of evidence, and

general inference skills involved in reconciling existing beliefs with new evidence that

either confirms or disconfirms those beliefs (Zimmerman, 2007). Elements concerning casual mechanisms (Koslowski, 1996) and epistemological understandings (Chinn and

Malhotra, 2002) have also been carefully examined and debated.

From a more operational perspective, scientific reasoning is assessed (and

operationally defined) in terms of a set of basic reasoning skills that are commonly

needed for students to successfully conduct scientific inquiry, which includes exploring a

problem, formulating and testing hypotheses, manipulating and isolating variables, and

observing and evaluating the consequences. The Lawson’s Test of Scientific Reasoning

(LTSR) provides a solid starting point for assessing scientific reasoning skills (Lawson,

1978, 2000). The test is designed to examine a small set of dimensions including (1)

conservation of matter and volume, (2) proportional reasoning, (3) control of variables,

(4) probability reasoning, (5) correlation reasoning, and (6) hypothetical-deductive

reasoning. These skills are important concrete components of the broadly defined

scientific reasoning ability.

Although there exists a wide range of understandings on what constitutes scientific reasoning, the literature seems to generally agree that scientific reasoning represents an important component of science inquiry. Therefore, a good understanding of the nature of scientific reasoning requires extended knowledge of science inquiry.

2 Scientific inquiry has its roots in the early research on constructivism and reasoning.

Vygotsky (1978) stated that children learn constructively when new tasks fall into the zone of proximal development. That is, if a task is one that a child can do with an adult’s help, then the child can eventually learn to do this task on their own by following the adult’s example. The idea that children build on existing knowledge is also reflected in

Inhelder and Piaget’s (1958) work with formal reasoning development. Their model describes the levels through which children progress from birth (sensorimotor stage) to adulthood (formal operational stage).

This pioneering work is the foundation for two schools of thought on student learning: cognitive conflict and scaffolding. Cognitive conflict occurs because students often come into the classroom with established beliefs based on their life experiences.

Many of these beliefs are non-scientific with some being strongly held and difficult to change. Therefore, helping students to “change” their non-scientific preconceptions to the expert beliefs has been the main goal of many of the studies on conceptual change.

Through research, it has been found that by explicitly recognizing the discrepancy between their current beliefs and the scientific ones (often referred to as the experience of a cognitive conflict), students can be motivated to change their current beliefs, which starts the processes of conceptual change. Posner et al. (1982) identified four requirements for successful conceptual change. Students must have (1) dissatisfaction with their current conceptions, and they must see the new conception as (2) intelligible,

(3) plausible, and (4) fruitful. Simply put, students need to first recognize that there is a conflict between their current views and the new information to be learned, and if they

3 are going to reject their old views, the new idea needs to make sense to them. This type of constructivist learning is the basis for courses such as Physics by Inquiry (McDermott

et al., 1996) where conflicts are elicited by the coursework and confronted and resolved

by the student with help from the text, peers, and instructors.

Scaffolding is a fundamentally different process from cognitive conflict. While

cognitive conflict can lead to rapid changes in student conceptions, scaffolding avoids

conflict and uses small steps to build on previous understanding. Coltman, Petyaeva, and

Anghileri (2002) describe scaffolding as the way adults supportively interact with

children during the learning process. The child can solve a problem on their own

(perhaps unintentionally), but the meaning of the achievement or the process that led to it

can be lost on the child unless an adult brings it to the child’s attention. In this way,

children build on their previous knowledge without being forced to confront conflict.

Both of the conceptual change and scaffolding frameworks play important roles in the

current education system regarding the implementation and evaluation of scientific

inquiry, which is generally understood as the process used in developing scientific

knowledge (Schwarz, Lederman, & Crawford, 2004). The stages of this process include

identifying variables, forming a hypothesis , designing an experiment , making

observations , collecting and analyzing data, and drawing conclusions. The process is

cyclic in nature. Once a conclusion has been reached, the original hypothesis can be

revised, which leads to further experimentation. Scientific inquiry is seen in the real-

world work of scientists, but it is also used in student-centered open-ended classroom

4 activities to teach students how to gain scientific knowledge (Roth & Roychoudhury,

Scientific reasoning skills support the stages of inquiry. Identifying variables and

forming hypotheses are primary skills. The skills underlying experimental design are

identification and control of variables. Observation and data collection require data- taking and data-organization skills as well as identification of hidden variables.

Analyzing data and drawing conclusions are arguably the most complex stages as they require an understanding of correlational and causal relations (with single and multiple variables) as well as the ability to interpret graphical information. At all stages, students need written and oral communication skills to present their ideas coherently.

The operational definition of scientific reasoning includes the necessary skills that support scientific inquiry such as control of variables, hypothetical deductive reasoning,

causal and correlational reasoning, proportions and ratios, deductive and inductive

reasoning, and probabilistic reasoning. This is not a complete list as one can argue that other dimensions could be included, but the literature suggests that these are commonly agreed-upon scientific reasoning skills. Some dimensions have been studied in great detail, and the results of a selection of these studies are summarized below.

In the realm of experimental design, one reasoning skill that needs to be used is control of variables (COV). In a recent study, Boudreaux et al. (2008) found that college students and in-service teachers had difficulties with basic methods in COV which included failure to control variables, assuming that only one variable can influence a system’s behavior, and rejection of entire sets of data due to a few uncontrolled

5 experiments. Boudreaux et al. concluded that students and teachers typically understand

that it is important to control variables but often encounter difficulties in implementing

the appropriate COV strategies to interpret experimental results. Other research has

analyzed the capabilities of younger students. Chen and Klahr (1999) asked students to

design experiments involving a ball rolling down a ramp to test a given variable and then

state what they could conclude from the outcomes. With increasing complexity by

involving more variables in contexts of ramps, springs, and sinking objects, Penner and

Klahr (1996) and Toth, Klahr, and Chen (2000) had students design and conduct

experiments, justify their choices, and consolidate and summarize their findings.

Kuhn (2007) tied together COV and causal reasoning. This study had fourth-graders

use computer software to run experiments relating to earthquakes (and ocean voyages).

This study used more variables than previous studies mentioned and asked students to

determine whether each variable was causal, non-causal, or indeterminate. Kuhn found

that students made progress in learning COV skills but struggled when faced with

handling multivariable causality .

Student understanding of probability has been studied as well. Fox and Levav (2004)

focused on conditional probability and found that irrelevant information and problem

wording influenced responses. Denison et al. (2006) found that children as young as four years old can understand random sampling and perform simple probability tasks.

Khazanov (2005) reported that college students hold misconceptions about probability and that this can interfere with their learning of inferential statistics. Interestingly,

6 Khazanov found that confronting these misconceptions lead to better results than traditional instruction.

Correlational reasoning has been widely studied. Many groups have found that prior beliefs about the relationship between two variables have a high influence on student judgment of the correlation between those variables (e.g. Kuhn, Amsel, & O’Loughlin,

1988). Other research shows that subjects have difficulty in handling negative correlations (e.g. Erlick, 1966, and Batanero, Estepa, & Godino, 1997). Furthermore, subjects have a tendency to make causal claims based on a correlational relationship (e.g.

Shaklee & Tucker, 1980). Carlson et al. (2002) found that students have troubles with creating and interpreting graphical displays relating to correlation. Batanero, Estepa, &

Godino (1997) found that use of technology seems to improve the strategies that subjects use to analyze correlation.

Hypothetical deductive reasoning is a skill that Lawson (2000) describes as a pattern of ‘‘If…And…Then…And/But…Therefore…’’ seen in experimentation. For example, if a ball is denser than water and the ball is placed in a bucket of water, then it will sink to the bottom; but it is observed that the ball does not sink; therefore the ball is not denser than water. This system of reasoning can be applied to any experiment. Lawson et al.

(2000) studied biology students’ use of hypothetical deductive reasoning and labeled three stages of development of this skill (not able to test hypotheses, able to test hypotheses for observable causal agents, and able to test hypotheses for unobservable causal agents). They found that instruction that targeted hypothetical deductive strategies improved student performance on hypothetical deductive reasoning assessment items.

7 There was also a positive correlation between hypothetical deductive reasoning ability

and content performance, but Lawson et al. note that content knowledge is not enough to

ensure scientific reasoning ability.

Regarding the development of scientific reasoning skills in formal and informal

education settings, it is clearly lay out in the National Research Council’s (1996) National

Science Education Standards that the scientific methods and skills students are expected

to learn at different grade levels. For example, these standards declare that in fifth

through eighth grades, students should learn how to analyze evidence and data, design

and conduct experiments, and think critically and logically in making connections

between data and explanations.

As science has continued to become fundamental to modern society, there is a

growing need to pass on the essential aspects of scientific inquiry and with it the need to

better impart such knowledge. Previous studies have indicated that scientific reasoning is

critical in enabling the successful management of real-world situations in professions

beyond the classroom. For example, in K-12 education, the development of scientific

reasoning skills has been shown to have a long-term impact on student academic

achievement (Adey & Shayer, 1994). Positive correlations between student scientific

reasoning abilities and measures of students’ gains in learning science content have been

reported (Coletta & Phillips, 2005), and reasoning ability has been shown to be a better

predictor of success in biology courses than prior biology knowledge (Johnson &

Lawson, 1998). The above findings support the consensus of the science education community on the need for K-12 students to develop an adequate level of scientific

8 reasoning skill along with a solid foundation of content knowledge. Zimmerman (2007)

claims that investigation skills and content knowledge bootstrap one another, creating a relationship that underlies the development of scientific thinking. Research has been conducted to determine how these scientific thinking skills can best be fostered and

which teaching strategies contribute most to learning, retention, and transfer of these skills. Zimmerman found that children are more capable in scientific thinking than was originally thought, and that adults are less so. She also states that scientific thinking requires a complex set of cognitive skills, the development of which requires much practice and patience. It is important, then, for educators to understand how scientific reasoning abilities develop.

A great deal of work has been done analyzing student use of scientific reasoning skills, and an understanding of these dimensions is important in defining scientific

reasoning in a broad context. Our current work involves compiling scientific reasoning

assessment questions, data, and resources that can be made available to teachers and

researchers. We are developing a new assessment instrument, “Inquiry for Scientific

Thinking and Reasoning” (iSTAR). An website (www.istarassessment.org) has also been

developed, which is focused on research on scientific reasoning and science learning.

The site contains compilations of existing research, examples of assessment items, and

thorough descriptions of the scientific reasoning dimensions (iSTARAssessment.org).

1.2 Why is Scientific Reasoning Important?

9 Science inquiry has been widely accepted as the core component of STEM education.

Since scientific reasoning represents a set of skills and abilities that are necessary for

successfully conducting science inquiry tasks, it has also been widely emphasized in science education standards and curriculum. Much research has also been conducted to understanding how scientific reasoning interacts with other areas of learning. For example, research has shown that scientific reasoning skills have a long-term impact on student academic achievement (Adey & Shayer, 1994). Researchers have found positive correlations between student scientific reasoning abilities and measures of learning gains

in science content (Coletta & Phillips, 2005; Lawson et al., 2000). Another study found

that students who learned probability in an inquiry-based environment outperformed

students who learned in a traditional environment (Vahey, Enyedy, & Gifford, 2000).

Shayer and Adey (1993) performed a study comparing students who received scientific

reasoning-based teaching with those who did not. Three years after the lessons occurred,

the reasoning-based group outperformed the control group on tests in not only science but

also English and mathematics. Shayer and Adey argue that instruction in scientific

reasoning has a permanent impact on general learning ability.

Scientific reasoning skills are also important because they enter every domain of

society. Their place is evident in the educational domain. A desire for students to

acquire scientific thinking skills is driving some curriculum development. National

standards in education outline various skills that students should have at each grade level

(NRC, 1996). While scientific reasoning skills typically fall under the science education

standards, teachers in any classroom can promote creative thinking and inquiry learning.

10 In any subject, teachers have the option of teaching to the test or using an inquiry-based environment to help students develop a full set of skills that can be used beyond the classroom.

Scientists are not the only people who use scientific reasoning skills on the job. In the workplace domain, employers look for individuals who can learn new tasks and utilize problem solving skills. Scientific reasoning skills are the tools that allow one to obtain new knowledge and think critically. Furthermore, inquiry learning can generate an appreciation for exploration that makes students eager and able to try new things, learn from mistakes, and be their own teachers, which is what employers want. Bauerlein

(2010) reports results from a study that found 89% of employers said written and oral communication skills are the most important skills for employees to have; 81% of employers listed critical thinking and analytical reasoning. Clearly, scientific reasoning skills are necessary to be competitive in the working world.

Finally, in the social domain, those with scientific thinking skills are capable of handling the wealth of information presented to them on a daily basis. Advertisements, political campaigns, and scientific reports made to the general public all use data to convince the consumer, voter, or citizen of a message. It is important to take a step back and analyze the information, and scientific reasoning skills make this possible.

Certain reasoning skills help in everyday decision-making and problem-solving.

Ratio skills are used in determining gas mileage or finding the cheapest brand at the grocery store. Inductive reasoning is used whenever a conclusion is made from limited observations and information. Causal reasoning and probability are used in predicting

11 weather and assessing insurance rates among many other things. Hypothetical deductive reasoning skills are used in everyday problem-solving. For example, if you are trying to figure out why your television remote is not working, you may test the hypothesis that the batteries are dead by inserting new batteries. If this solves the problem, the experiment is done; if it does not, a new hypothesis is developed. While it may not consciously cross one’s mind explicitly, hypothetical deductive reasoning is being used. This is true of many of these skills – they become part of one’s set of abilities and are used automatically.

1.3 How is Scientific Reasoning Learned?

Scientific reasoning ties in very closely with science inquiry. In developing scientific reasoning, research has shown that inquiry-based science instruction can promote scientific reasoning abilities (Adey and Shayer, 1990; Lawson, 1995; Marek and Cavallo,

1997; Benford and Lawson, 2001; Gerber, Cavallo and Marek, 2001). Additionally, studies have shown that students had higher gains on scientific reasoning abilities in inquiry classrooms over non-inquiry classrooms (Bao et al., 2009). Examples of such learning settings include Physics by Inquiry (McDermott et al., 1996), RealTime Physics

(Sokoloff, Thornton, & Laws, 2004), the CUPLE (Comprehensive Unified Physics

Learning Environment) Physics Studio (Wilson, 1994), and The SCALE-UP (Student-

Centered Activities for Large Enrollment Undergraduate Programs) Project (Beichner,

2008). The goal of these classrooms is to engage students in a way that fosters the development of scientific reasoning skills. Such skills are not inherently learned by the

12 student, and rigorous scientific education is not enough. It is not what is taught, but

rather how it is taught, that makes the difference (Bao et al., 2009). Scientific reasoning

skills need to be directly addressed during the course (Schwartz, Lederman, & Crawford,

2004). In a study of pre-service teachers, Schwartz et al. found that providing explicit

opportunities to reflect on scientific reasoning (using journals) strengthened views on

reasoning. Teachers serve as guides, but having students take time to discuss and reflect on scientific reasoning has a positive impact. The role of the teacher is still important,

though, as instructors with higher levels of scientific reasoning skills are found to be

more effective in using inquiry methods in teaching science courses (Benford & Lawson,

1.4 How is Scientific Reasoning Assessed?

In order to understand if students are learning scientific reasoning skills, it is

important to have an accurate assessment instrument. Such tools need to be easy to use,

practical, and applicable to a variety of educational settings.

Traditionally, the Piagetian clinical interview is used to assess students' formal

reasoning abilities, but such a method requires experienced interviewers, special

materials and equipment, and is usually time consuming (Inhelder & Piaget, 1958;

Lawson, 1978). A number of researchers have used the Piagetian method as a basis for developing their own measurement tools in assessing students' scientific reasoning abilities. Outcomes of this work include the Group Assessment of Logical Thinking Test

(GALT) (Roadrangka, Yeany, & Padilla, 1982), the Test of Logical Thinking (TOLT)

13 (Tobin & Capie, 1981), and the Lawson's Classroom Test of Scientific Reasoning

(Lawson, 1978).

Among the various assessment instruments, the Lawson Test has gained wide

popularity in science education communities. In the development of his test, Lawson

(1978) aimed for a balance between the convenience of paper and pencil tests and the

positive factors of interview tasks. Test items were based on several categories of

scientific reasoning: isolation and control of variables, combinatorial reasoning,

correlational reasoning, probabilistic reasoning, and proportional reasoning. The original

format of the test had an instructor perform a demonstration in front of a class, after

which the instructor would pose a question to the entire class and the students would mark their answers in a test booklet. The booklet contained the questions followed by several answer choices. For each of the 15 test items, students had to choose the correct

answer and provide a reasonable explanation in order to receive credit for that item. In its current form, the Lawson test is a 24-item, two-tier, multiple choice test. Treagust

(1995) describes a two-tier item as a question with some possible answers followed by a second question giving possible reasons for the response to the first question. The reasoning options are based on student misconceptions that are discovered via free response tests, interviews, and the literature.

In physics, many education researchers have been using the Lawson test to study the relations between students’ scientific reasoning abilities and physics learning. In a recent study, Coletta and Phillips (2005 & 2007) reported significant correlations (r ≈ 0.5) between pre-post normalized gain on the Force Concept Inventory and students’

14 reasoning abilities measured with Lawson’s test. However, research to critically inspect

the existing assessment instruments and to develop new instruments for scientific

reasoning is largely missing in the literature.

Regarding the Lawson test, for which there are some studies (Stefanich et al., 1983;

Pratt & Hacker, 1984) investigating the validity of the Lawson’s 1978 version of the formal reasoning test, there is little work on its 2000 version. Even though the 2000 edition has become a standard assessment tool in physics education research, the test itself has not been systematically validated. Through research, we have also observed several issues concerning the question designs and data interpretations. In Chapter 2, I will give a more in-depth review of the different assessment instruments on scientific reasoning and discuss in detail of the validity and reliability of the Lawson’s test.

1.5 Scientific Reasoning, an Important Component of the 21st Century Skills

We live in an ever-changing world – demographic change, rise of automation and

workforce structural change, globalization, and corporate change are some major driving

forces that demand fundamental transformations in education and skills on an individual

level. Across the globe, work is becoming increasingly bi-polar with jobs sorting out into

two clusters - a low-wage, lower-skilled, routine work cluster, going to the lowest global

bidder qualified to do the work, and increasingly to automation; and a fast growing, high-

paying, creative work cluster requiring a combination of complex technical skills like

problem-solving and critical thinking, and strong people skills like collaboration and

clear communication. In the U.S., the demand for non-routine skills (expert thinking and

15 complex communication) is rising fast, as the need for routine and manual skills falls

(1960-2002).

Advances in digital technology and telecommunications now enable companies to

send works and tasks to be done wherever they can be completed best and cheapest.

Meanwhile, political and economic changes in developing countries such as India, China and Mexico have freed up many more workers who can adequately perform such jobs. As a result, not only do Americans have to compete for jobs with foreigners in a rising global labor market, but increasing competition will also center on highly skilled workers for more intellectually demanding and higher paying jobs.

Due to technology development and globalization, companies have gone through radical restructure with less hierarchy and lighter supervision where workers experience greater autonomy and personal responsibility. Work has also become much more collaborative and employees must adapt to new challenges and demands when tackling projects and solving problems.

Consequently, a growing number of educators, business leaders and politicians have called for “21st century skills” being taught as part of everyone’s education. Global

competition, increased access to technology, digital information and tools are increasing

the importance of 21st century knowledge-and-skills, which are critical for a country’s

economic success. Advocates base their arguments on a widening gap between the

knowledge and skills acquired in school and the knowledge and skills required in 21st century workplaces. That is, today’s curricula do not adequately prepare students to live and work in a technology-based economy and globalized society. Thus, in order to

16 successfully face career challenges and a globally competitive workforce, schools must be aligned with real world environments by infusing 21st century skills in education practices.

1.5.1 Skills Gap between Schools and Workplaces

Previous studies have demonstrated a huge skill gap between schools and workplace requirements. In 2005 Skills Gap Report, when asking manufacturing employers which types of skills their employees will need more of over the next three years, basic employability skills (attendance, timeliness, work ethics, etc.) and technical skills were the areas most commonly selected (53%). Following that are reading/writing/communication skills, where 51% of the respondents said they will need more of these types of skills over the next three years. Beyond these, there are a number of related skills that will be needed over the next several years that are characteristic of high-performance workforces, such as the ability to work in teams (47%), strong computer skills (40%), the ability to read and translate diagrams and flow charts (39%), strong supervisory and managerial skills (37%), and innovative/creative abilities (31%).

Moreover, manufacturing employers see training as a business necessity and their spending on training is increasing – not just for executives, but across all employee groups. The types of training that most employees receive are technical and basic skills training. The next tier of trainings are for problem solving, teamwork, leadership, computer skills, basic or advanced mathematics, basic reading and writing, and interpersonal skills – all standard skills for high-performance workforces.

17 Another landmark 2006 research study among more than 400 employers, Are They

Really Ready to Work?, (conducted by Corporate Voices for Working Families, the

Conference Board, the Partnership for 21st Century Skills, and the Society for Human

Resource Management), clearly spotlighted employers’ concerns about the lack of

preparedness of new entrants into the workforce regardless of the level of educational attainment. More specifically, the deficiencies are greatest at the high school level, with

42.4% of employers reporting the overall preparation of high school graduates as deficient; 80.9% reporting deficiencies in written communications; 70.3% citing deficiencies in professionalism; and 69.6% reporting deficiencies in critical thinking.

Although preparedness increases with educational level, employers noted significant deficiencies remaining at the four-year college level in written communication (27.8%), leadership (23.8%) and professionalism (18.6%). In addition, employers reported that top five most important skills are critical thinking and problem solving, information technology, teamwork/collaboration, creativity/innovation, and diversity.

A more recent study, “Across the Great Divide”, released March 2011, surveyed 450 businesses and 751 post-secondary educational institutions and found concerning disparities between the goals of higher education and what businesses sought in workers.

The skill gap exists along the entire learning-career continuum – colleges, businesses and the students all had different expectations of what was needed to prepare a workforce for today’s and tomorrow’s jobs. According to the report, employers indicated they believed the most important goal of a four-year degree was to prepare individuals for "success in the workplace" (56%). On the other hand, educational leaders saw higher education as a

18 way of providing individuals with "core academic knowledge and skills" (64%). The

study also found that only 15% of the businesses believed hiring those with an associate degree was a good return on investment for their companies.

Both workers and employers believe that the education sector has the primary responsibility to close the workforce readiness gap. Yet, as surveys indicated, majority of companies do not believe schools are doing a good job preparing students for the workplace. Therefore, continuing contact between schools and businesses is critical to developing a prepared workforce. And it is essential for business leaders, policy makers and educators to work together to address the workforce readiness gap.

1.5.2 What are 21st Century Skills?

So what exactly are 21st century skills? The P21 (Partnership for 21st Century Skills - a group of corporations who partnered with the U.S. Department of Education in 2002)

has created a framework that identifies the key skills for success. Based on their

categorization and definition, ten skills have been identified as the 21st Century skills, in

four groups:

Ways of Thinking

1. Creativity and innovation

2. Critical thinking, problem solving, decision making

3. Learning to learn, Metacognition

Ways of Working

4. Communication

19 5. Collaboration (teamwork)

Tools for Working

6. Information literacy

7. ICT literacy

Living in the World

8. Citizenship – local and global

9. Life and career

10. Personal & social responsibility – including cultural awareness and

The essence of these skills includes collaboration, communication, creativity and

innovation and critical thinking coined the 4Cs by P21. Many other researchers and

authors created lists similar to the 4Cs. For example, Tony Wagner from the Harvard

Graduate School of Education interviewed more than 600 chief executive officers, and

asked them the same essential question: “Which qualities will our graduates need in the

21st-century for success in college, careers and citizenship?” Wagner's subsequent Seven

Survival Skills correspond to the 4Cs but also include agility and adaptability, accessing and analyzing information, as well as curiosity and imagination.

There is agreement among all researchers that these skills of collaboration, communication, creativity and critical thinking are necessary and must be integrated into the classrooms. Indeed, states are adopting new standards to ensure these skills are met.

For example, Common Core State Standards have been adopted by most states and several territories in the United States. Common Core State Standards are designed to

20 provide a national, standardized set of academic standards (organized around 21st century skills) as an alternative to those previously developed by the states on an individual basis.

The Common Core Standards are sought to be more rigorous; demand higher-order thinking; introduce some concepts at an earlier age; and allow for interstate comparisons.

On the other hand, the modern workplace and lifestyle demand that students balance cognitive, personal, and interpersonal abilities, but current education policy discussions have not defined those abilities well, according to a special report released by the

National Research Council of the National Academies of Science in Washington. Based

on the report, 21st century skills generally fall into three categories:

• Cognitive skills, such as critical thinking and analytic reasoning, which in the

context of STEM learning are established as scientific reasoning skills;

• Interpersonal skills, such as teamwork and complex communication; and

• Intrapersonal skills, such as resiliency and conscientiousness (the latter of which

has also been strongly associated with good career earnings and healthy

lifestyles).

A relevant concept that we often hear is the “21st century learning skills.” So what it is?

Ted Lai, Director of Information Technology for the Fullerton Elementary School

District puts it this way:

"In a nutshell, these are the skills that will help people be globally competitive in the

21st Century. Especially with our students, these are skills that include not only the

curricular standards but also a host of other essential skills like communication,

collaboration, and creativity. Literacy doesn’t merely refer to the ability to read and

21 write but also the ability to evaluate and synthesize information, media, and other

technology. At the heart of 21st Century Learning, in my opinion, is the piece on

creating authentic projects and constructing knowledge… essentially making

connections between learning and the real world!"

Clearly, “21st century skills” has become the lasted buzz in education, which has also

re-kindled a long-standing debate about content vs. skills. Among the three major

categories of 21st century skills, the scientific reasoning is a core component of the

“Cognitive Skills”. The existing research on scientific reasoning fully supports the

current movement towards training skills rather than content as the goal of 21st century education, and provides practical pedagogy, instruments, and curriculum for developing the 21st century skills.

Although reading, writing, mathematics and science are cornerstones of today’s

education, curricula must go further to include skills such as scientific reasoning, critical

thinking, collaboration and digital literacy that will prepare students for 21st-century employment and ensure students’ success in the real world. Establishing new forms of assessment can begin a fundamental change in how we approach education worldwide.

1.6 Outline of the Thesis

The research discussed in this thesis focuses on the assessment aspect of scientific reasoning, which is organized in five main parts. Chapter 2 gives a detailed review of the

related literature on prior research and existing assessment instruments on scientific

22 reasoning. In current literature, the Lawson’s test of scientific reasoning is the most widely used quantitative tool in assessing scientific reasoning. However, the test’s validity has not been thoroughly studied. Chapter 3 introduces a study to evaluate the validity of the Lawson’s test. The research has shown a number of test design issues with the current version of the Lawson’s test and also suggested ways to improve the instruments. In Chapter 4, I discuss the study on mapping out a longitudinal developmental scale from 3rd grad to graduate level of scientific reasoning measured with

the Lawson’s test. The developmental trends of students from both USA and China are

also compared. Chapter 5 introduces a data-mining study of Lawson’s test data, which

helps identify learning progression behaviors of selected scientific reasoning skills. The

results also provide evidence for researchers to evaluate and model the scoring methods

of two-tiered questions used in the Lawson’s test. Chapter 6 gives another case study that

investigates the learning progression of the skill of control of variables (COV), which

showed a series of fine grained intermediate levels of COV skills. The thesis ends with

Chapter 7, which summarizes the entire scope of the work and makes suggestions for future work and development.

Chapter 2. Research on Assessment of Scientific Reasoning

2.1 Theoretical Background of Scientific Reasoning

Research on scientific reasoning is rooted in the early studies on cognitive

development of “formal reasoning” (Piaget, 1965) and “critical thinking” (Hawkins &

PEA, 1987). Traditionally, the Piagetian clinical interview is used to assess students’

formal reasoning abilities. In Piaget’s cognitive developmental theory, an individual

moves to the next cognitive level when presented with challenges in the environment that

cause him or her to change, to alter his or her mental structures in order to meet those

challenges (Fowler, 1981). Piaget used the word schema to refer to anything that is

generalizable and repeatable in an action (Piaget & Inhelder, 1969). As children grow and

mature, these mental structures are described as organized abstract mental operations

actively constructed by the children.

As their cognitive structures change, so do their adaptation techniques, and these

periods of time in a child’s life are referred to as stages. The first is the sensorimotor stage of the children 2 years of age and younger (Piaget & Inhelder, 1969), an important period of time when the child is constructing all of the necessary cognitive substructures for later periods of development. These constructions, without representation or thought,

24 are developed through movement and perceptions. The movements and reflexes of the

child in this period form habits that later form intelligence. This happens through 6

successive sub-stages: modification of reflexes, primary circular reactions, secondary

circular reactions, coordination of secondary schemas, tertiary circular reactions, and invention of new means through mental combinations (Miller, 2002). During this stage, three important concepts are believed to be acquired (a) object permanence, when the child understands the object did not cease to exist just because it is hidden from view; (b)

space and time, important to solving “detour” problems; (c) causality, which is when the

child begins to realize cause and effect by his or her own actions and in various other

objects (Piaget & Inhelder, 1969).

The second is the preoperational stage of 2- to 7-year-old children, transitions from

the sensorimotor period with the development of mental representations through semiotic

function, where one object stands for another (Miller, 2002). Signs and symbols are

learned as similar objects and events that signify real ones. Though mental representation

has advanced from its previous stage, children in this period cannot think in reversible

terms (Piaget & Inhelder, 1969). Miller helps to describe other characteristics of this

level, including rigidity of thought, semilogical reasoning, and limited social cognition.

Rigidity of thought is best described with the example of two identical containers that

have equal amounts of liquid. When the contents of a container are poured into a thinner

and taller container or shorter and wider container, children at this level freeze their

thought on the height and assume the volume is more or less, depending on the height of

the container. The height becomes their only focus, rather than the transition of volume.

25 If the liquid is poured from one container into another, children focus on the states of the containers rather than the process of pouring the same amount of liquid.

Cognitively, children are unable to reverse direction of the poured liquid and imagine it being poured back into the original container and containing the same amount. They can, however, understand the identity of the liquid, that it may be poured from one container to another and still be the same kind of liquid. In this level, causal relationships are better understood outside of self, as pulling the cord more makes the curtain open more, though they may not be able to explain how it happened. Rather than thinking logically, children in this level reason semi-logically, often explaining natural events by human behavior or as tied to human activities (Miller, 2002).

Most children in ages 8 to 11 is often categorized as in the concrete operational stage in Piaget’s theory of cognitive development. According to Miller (2002, p. 52) the mental representations of children in this concrete operational period come alive with the ability to use operations, “an internalized mental action that is part of an organized structure.” In the example of the liquid in containers, children now understand the process and can reason the liquid is the same amount though in different sized containers. This ability to use operations may come at different times during this period. Concrete children begin to better understand reversibility and conservation. Classifications based on the understanding of sizes of an included class to the entire class are achieved (Piaget &

Inhelder, 1969). Relations and temporal-spatial representations are additional operations evident in concrete operational children (e.g., children can understand differences in

26 height and length and include the earth’s surface in drawing their perception of things).

All of these operations strengthen gradually over time.

The formal operational period is the fourth and final of the periods of cognitive development in Piaget’s theory (Piaget & Inhelder, 1969). This stage, which follows the

Concrete Operational stage, commences at around 11 years of age and continues into

adulthood. In this stage, individuals move beyond concrete experiences and begin to

think abstractly, reason logically and draw conclusions from the information available, as

well as apply all these processes to hypothetical situations. Rather than simply

acknowledging the results of concrete operations, individuals in this final period can

provide hypotheses about their relations based on logic and abstract thought. This abstract thought looks more like the scientific method than did thought in previous periods. In the concrete operational period, children could observe operations and lack the ability to explain the process. In the formal operational period, they are able to problem-solve and imagine multiple outcomes. One of Piaget’s common tasks in determining if a child has reached formal operational thought is the pendulum problem.

The formal operational thinker demonstrates hypothetico-deductive thought by imagining all of the possible rates that the pendulum may oscillate, observing and keeping track of possible results, and ultimately arriving at possible conclusions (Piaget & Inhelder,

As adolescents grow into adulthood and throughout adulthood, formal operations are still developing and abstract thought is applied to more situations. Miller contends Piaget

27 ended his periods of developmental logical thought with formal operations. Beyond this point, individuals’ thought only change in content and stability of rather than in structure.

2.2 Existing Research and Tools on Assessment of Scientific Reasoning

In the early works on measurement of cognitive development, Piaget used multiple

problems to test a child's operations of thought (Piaget & Inhelder, 1969). Miller (2002)

defined Piaget's methodology as the “clinical method,” which involves a chainlike verbal

interaction between the experimenter and the child. In this interaction, the experimenter

asks a question or poses a problem, and the subsequent questions are then asked based on

the response the child gave to the previous question. Piaget developed this interaction in

order to understand the reasoning behind the children's answers.

Cook and Cook (2005) noted that through Piagetian tasks, Piaget could better

understand preoperational children's thinking. He found these children showed centration,

focusing on only one thing at a time rather than thinking of several aspects. This means

they were centered on the static endpoints, the before and after, rather than the process.

The next aspect of logical thinking noticed in Piaget's finding was preoperational

children's lack of a sense of reversibility. The task of liquid conservation is simple to the

logical thinking child. Water from a short and wide container is poured into a tall and

skinny container. A preoperational thinker would focus only on the height of the liquid

and the fact that the water was first low, then it was at a higher level in the second

container; therefore, there must be more water in the second container. With a lack of a

grasp for reversibility, the preoperational child does not have true operational thought to

28 allow him or her to imagine the pour reversed and realize the same amount of water is in both containers. The other two conservation tasks are similar to the liquid task. They each show a beginning state, a transformation, and an ending state where something has changed. The importance of children's operational and newer logical thought “is not so much that children are no longer deceived by the problem, but rather that they have now learned some basic logical rules that become evident in much of their thinking ”

(Lefrancois, 2001, p. 383).

Guided by Piagetian tasks, a number of researchers (Lawson, 1978a; Shayer & Adey,

1981; Tisher & Dale, 1975) have developed their own measurements in assessing students' scientific reasoning abilities, such as the Group Assessment of Logical Thinking

Test (GALT) (Roadrangka, Yeany, and Padilla 1982), the Test of Logical Thinking

(TOLT) (Tobin and Capie, 1981), and the Lawson's Classroom Test of Scientific

Reasoning (LCTSR) (Lawson, 1978). Below, I will briefly review the three instruments and their measures. A more detailed discussion on validity of the Lawson’s test will be given in Chapter 3.

2.2.1 Group Assessment of Logical Thinking (GALT)

Roadrangka, Yeany, and Padilla (1983) compiled reliable and valid test items for the

Group Assessment of Logical Thinking (GALT). In the pilot testing, Piagetian interview tasks were administered to a sub-sample of students for purposes of validation. The 21- item GALT test is given in appendix A. The first 18 items present multiple-choice problems to be answered by the individual as well as a selection of reasoning choices to

29 support his or her answer. The final three items are scored upon the child's inclusion of

all possible answers and patterns to classify these answers.

GALT measures 6 logical operations, including conservation, correlational reasoning, proportional reasoning, controlling variables, probabilistic reasoning, and combinatorial

reasoning. They also used a multiple-choice style to present answers and possible reasoning behind those answers. The GALT is sufficiently reliable and valid in its ability to distinguish between students at Piagetian stages of development. Reliability was tested

by administering the GALT to students and administering Piagetian Interview Tasks to a

sub-sample of those students. They found a strong correlation, r = .80 (Roadrangka et al.,

1983). The question selection derived from other reliable and valid instruments helped

make this a reliable and valid assessment. The Cronbach’s reliability coefficient for

internal consistency of the GALT was reported as a = .62-.70 (Bunce & Hutchinson,

One of the six modes measures concrete operations and the other five measure formal

operations (Bunce et al., 1993). The answers to the GALT items 1 to 18 were considered

correct only if the best answer and reason were both correct. For item 19, children must

(1) show a pattern and (2) have no more than one error or omission, and for item 20,

children must also show a pattern in answers given, having no more than two errors or

omissions. To be labeled as concrete operational thinkers, the children had to score 0 to 4.

Transitional thinkers was indicative of the score 5 to 7, and abstract operational thinkers

would have been those children who scored 8 to 12 (Roadrangka et al., 1983).

30 Researchers, predominantly in the field of science education have utilized the GALT to determine a developmental level to gauge student performance, phases in the learning cycle, and cognitive/motivational characteristics. In addition, researchers have administered the GALT to determine the best method of teaching a particular subject based on the students’ logical thinking ability (Niaz & Robinson, 1992; Allard &

Barman, 1994; Kang, Scharmann, Noh, & Koh, 2005). Through use of the GALT test,

Allard and Barman assessed the reasoning of 48 college biology students and found 54% of these students would benefit from concrete methods of instruction. Sampling 101 more science students in a basic science course showed these researchers that 72% of these students would benefit from concrete methods rather than a traditional lecture approach in the classroom.

2.2.2 The Test of Logical Thinking (TOLT)

The Test of Logical Thinking (TOLT) is a 10-item test developed by Tobin and Capie

(1981). It measures five skill dimensions of reasoning including proportional reasoning, controlling variables, probabilistic reasoning, correlational reasoning, and combinational reasoning. A high internal consistency reliability (α =0.85) and a reasonably strong one- factor solution obtained from factor analysis of performance on the 10 items suggested that the items were measuring a common underlying dimension. The test is included in appendix B. We can see that the items bare a lot of similarities to the ones used in GALT and the Lawson’s test, and therefore, will not be discussed in details.

31 2.2.3 Lawson’s Classroom Test of Scientific Reasoning (LCTSR)

Lawson (1978) originally designed his test of formal reasoning to address the need for a reliable, convenient assessment tool that would allow for diagnosis of a student’s developmental level. A valid form of measurement prior to the Lawson Test was the administration of Piagetian tasks. This method, however, is time-consuming and requires

experienced interviewers, special materials, and equipment. A paper and pencil test would be more practical for classroom use, but there are also problems with this method.

Paper and pencil tests require reading and writing ability, test takers have no motivation from materials or equipment to use, and it is not as relaxed as a clinical interview setting.

In the development of his test, Lawson (1978) aimed for a balance between the convenience of paper and pencil tests and the positive factors of interview tasks. He

studied eighth- through tenth-grade students to determine their scientific reasoning skill

level. Lawson breaks scientific reasoning into several categories: isolation and control of variables, combinatorial reasoning, correlational reasoning, probabilistic reasoning, and proportional reasoning. Test items were based on these dimensions. The original format of the test had an instructor perform a demonstration in front of a class, after which the instructor would pose a question to the entire class and the students would mark their answers in a test booklet. The booklet contained the questions followed by several answer choices. For each of the 15 test items, students had to choose the correct answer and provide a reasonable explanation in order to receive credit for that item.

To establish the validity of his test, Lawson (1978) compared test scores to responses to interview tasks, which were known to reflect the three established levels of reasoning

32 (concrete, transitional, formal-level). He found that the majority of students were

classified at the same level by both the test and interview tasks but that the classroom test may slightly underestimate student abilities. Validity was further established by referencing previous research on what the test items were supposed to measure as well as performing item analysis and principal-components analysis. The reliability of the

Lawson’s test (Ver. 2000) has been evaluated by researchers who used this test. Typical internal consistency in terms of Cronbach's α range from 0.61 to 0.78 (Lee & She, 2010)

The popularly used version of Lawson's Classroom Test of Scientific Reasoning was released in the year 2000. It is a 24 item two-tier, multiple choice test. Treagust (1995)

describes a two-tier item as a question with some possible answers followed by a second

question giving possible reasons for the response to the first question. The reasoning

options are based on student misconceptions that are discovered via free response tests,

interviews, and the literature.

In the 2000 version, the combinational reasoning is replaced with correlation

reasoning and hypothetic-deductive reasoning. The test is also converted into pure

multiple choice format containing 24 items in 12 pairs. The changes of the target skill

dimensions and the items are summarized in Table 2.1. With a typical two-tier structure,

the first 10 pairs (items 1-20) each begins with a question for a reasoning outcome

followed by a question soliciting students’ judgment on several statements of reasoning

explanations. Items 21-24 are also structured in two pairs, designed to assess students’

hypothetical-deductive reasoning skills concerning unobservable entities (Lawson, 2000).

Partially due to the pathways of hypothesis testing processes, these two pairs follow

33 different response patterns. In the item pair of 21-22, the lead question asks for selection of an experimental design suitable for testing a set of given hypothesis. The follow up question asks students to identify the data pattern that would help draw conclusion about the hypotheses. In the item pair of 23-24, both questions ask students to identify the data pattern that would support the conclusions about the given hypotheses.

The Lawson’s test is widely used in the science education community. Based on literature and through our own research we have also observed a number of issues in the

Lawson’s test regarding its question design and validity. In Chapter 3, I will give a more detailed discussion on the observed issues with the Lawson’s test and possible solutions to improve the assessment. Since all the existing instruments were developed over four decades ago and many have limitations and design issues, it is then important to further develop a more up-to-date version of assessment on scientific reasoning for the education and research community. To do so, we first determine an extended set of scientific reasoning skills that need to be assessed with the new instrument. The next section gives a list of skill dimensions and brief reviews of the related research.

Item Item Scheme Tested Number Number Nature of Task (1978) (2000) Varying the shapes of two identical Conservation of weight 1 1, 2 balls of clay placed on opposite ends of a balance. Examining the displacement volumes of Conservation of volume 2 3,4 two cylinders of different densities. Pouring water between wide and narrow Proportional reasoning 3,4 5,6,7,8 cylinders and predicting levels. Moving weights on a beam balance and Proportional reasoning 5,6 predicting equilibrium positions. Designing experiments to test the Control of variables 7 9,10 influence of length of string on the period of a pendulum Designing experiments to test the Control of variables 8 influence of weight of bob on the period of a pendulum Using a ramp and three metal spheres to Control of variables 9,10 examine the influences of sphere weight and release position on collisions. Using fruit flies and tubes to examine Control of variables 11,12,13,14 the influences of red/blue light and gravity on flies’ responses. Computing combinations of four Combinational reasoning 11 switches that will turn on light. Listing all possible linear arrangements Combinational reasoning 12 of four objects representing stores in a shopping center Predicting chances of withdrawing Probability 13,14,15 15,16,17,18 colored wooden blocks from a sack Predicting whether correlation exits between the size of the mice and the Correlation reasoning 19,20 color of their tails through presented data Designing experiments to find out why Hypothetic-deductive 21,22 the water rush up into the glass after the reasoning candle goes out Designing experiments to find out why Hypothetic-deductive 23,24 the red blood cells become smaller after reasoning adding a few drops of salt water

Table 2.1. The Comparison of Lawson’s Classroom Test of Formal Reasoning between the 1978 version and the 2000 version.

2.3 Expanding the Dimensions of Skills for Assessment of Scientific Reasoning

In our current research on assessment of scientific reasoning, we focus on a set of basic reasoning skills that are commonly needed for students to systematically conduct scientific inquiry, which includes exploring a problem, formulating and testing hypotheses, manipulating and isolating variables, and observing and evaluating the consequences. The Lawson’s Test of Scientific Reasoning (LTSR) provides a solid starting point for assessing scientific reasoning skills (Lawson, 1978, 2000). The test is designed to examine a small set of dimensions including (1) conservation of matter and volume, (2) proportional reasoning, (3) control of variables, (4) probability reasoning, (5) correlation reasoning, and (6) hypothetical-deductive reasoning. These skills are important concrete components of the broadly defined scientific reasoning ability.

To fully assess students’ ability and provide fine-tuned guidance for teachers, we have been working to expand the measurement capability of standardized assessment on scientific reasoning by incorporating sub-categories within the existing skill dimensions and new dimensions that are not included in the Lawson’s test. For example, we have developed questions on conditional probability and Bayesian statistics within the general category of probability reasoning as well as questions on an extended list of additional skill dimensions such as categorization, combinations, logical reasoning, causal reasoning, and advance hypothesis forming and testing.

In addition, for each skill dimension, multiple questions are designed using a wide variety of scientific and social contexts and with different levels of complexity, so that

36 we can measure students with different background and strengths from school age

through college levels. These new dimensions and designs will improve the measurement

capability to target students at a wider range of grade levels and backgrounds, and also

provide more detailed information for researchers and teachers to address the

development of scientific reasoning skills and the interactions of these skills with other

aspects of learning is STEM education. Based on the literature and our own research, the

following dimensions have been identified for assessment of scientific reasoning.

• Control of Variables

• Proportions and Ratios

• Probability

• Correlational Reasoning

• Deductive Reasoning

• Inductive reasoning

• Causal Reasoning

• Hypothetical-Deductive Reasoning

2.3.1 Control of Variables

In a scientific inquiry process involving many variables, the relationship between the variables needs to be determined. To do so, we form a hypothesis and test it experimentally. When designing these experiments, it is important to design controlled experiments rather than confounded experiments. This means we have to control all other variables in order to analyze the relationship between key variables without

37 interference. For example, when considering the relationship between age and frequency

of delinquent activity, the variable of gender has to be treated as a variable to be

controlled.

Control of variables is a necessary strategy in designing unconfounded experiments and in determining whether a given experiment is controlled or confounded. Control of

variables strategy is used in creating and conducting experiments. Because variables

interact with each other, the experimenter needs to make inferences in order to

appropriately control variables and interpret the results (Chinn, C. A., & Hmelo-Silver, C.

E., 2002). Usually, experimenters focus on the effect of a single variable of interest

(Kuhn, D., & Dean, D. 2005).

Control of variables strategy is used in a logical sense to distinguish controlled and

confounded experiments, which is necessary in determining whether an experiment can

lead to a conclusive result. The logical aspects of control of variables include the ability

to make appropriate inferences from the outcomes of unconfounded experiments and to

understand the inherent indeterminacy of confounded experiments. In short, control of

variables is the fundamental idea underlying the design of unconfounded experiments

from which valid, causal, inferences can be made (Chen & Klahr, 1999).

Control of variables (COV) is a core construct supporting a wide range of higher-

order scientific thinking skills and it is also an important skill fundamental to

understanding physics concepts and experiments. In a recent study, Boudreaux et al.

(2008) found that college students and in-service teachers had difficulties with basic

methods in COV which included failure to control variables, assuming that only one

38 variable can influence a system’s behavior, and rejection of entire sets of data due to a few uncontrolled experiments. Boudreaux et al. concluded that students and teachers typically understand that it is important to control variables but often encounter difficulties in implementing the appropriate COV strategies to interpret experimental results.

In learning, control of variables is one of the National Research Council’s (1996) aspects of “designing and conducting a scientific investigation” (p. 145). Several sub- skills are identified in this category including ‘‘systematic observation, making accurate measurements, and identifying and controlling variables’’ (p. 145). Control of variables is a component of scientific inquiry, which is broadly understood to mean skill in discovering or constructing knowledge for oneself (Dean, D., & Kuhn, D. 2007). It is also one of several types of procedural knowledge − or “process skills” − that are deemed central to early science instruction (Klahr, D., & Nigam, M. 2004).

In everyday life, a real world situation is often complicated involving many different kinds of variables, therefore, when solving real problems, people need to determine which variables influence the outcome. To do so, the variable of interest is changed while other variables must be controlled. For example, if a city planner wants to find out whether temperature affects the comfort levels of a city, all variables other than temperature, such as humidity and cloud cover, must be held constant while temperature is varied.

39 Example of Control of Variable Question

1. This is a modified version of a question in the Lawson’s Test:

Shown are drawings of three strings hanging from a bar. The

three strings have metal weights attached to their ends. String 1 and String 3 are the same length. String 2 is shorter. A 10 unit weight is attached to the end of String 1. A 10 unit weight is also attached to the end of

String 2. A 5 unit weight is attached to the end of String 3. The strings (and attached

weights) can be swung back and forth, and the time it takes to make a swing can be

Suppose you want to find out whether the length of the string has an effect on the

time it takes to swing back and forth. Which strings would you use to find out?

a. only one string

b. all three strings

Note: In this problem, there are two variables that may influence the time it takes to swing back and forth: the length of the string and the mass of the attached weight.

Students are asked to determine the relationship between the length of the string and the time it takes to swing back and forth, so the size of the weight needs to be controlled

(held constant). The weights attached to the end of string 1 and 2 are the same, but the

40 lengths of these two strings are different. They can be chosen to test the relationship between length and swing time.

2. This is an revised version of a question from the Lawson’s test:

Twenty fruit flies are placed in 19 each of four glass tubes. The tubes 18 10 10 11 1 are sealed. Tubes I and II are I II III IV partially covered with black paper; All tubes are exposed to red light

Tubes III and IV are not covered. The tubes are placed as shown. Then they are exposed to red light for five minutes. The number of flies in the uncovered part of each tube is shown in the drawing. This experiment shows that flies respond to (move to or away from):

a. red light but not gravity

b. gravity but not red light

c. both red light and gravity

d. neither red light nor gravity

In this problem, there are two variables that would affect the distribution of the fruit flies -- gravity and red light. To investigate the relationship between red light and the distribution of fruit flies, gravity is treated as a controlled variable. Tubes II and IV are compared since gravity is not having an effect on those tubes. The two tubes have a very similar distribution of fruit flies, so we know red light did not have an impact. To test the

41 relationship between gravity and the distribution of fruit flies, red light is the controlled

variable. By comparing Tubes I and II (or III and IV), we see that gravity does have an

impact on the distribution.

2.3.2 Proportions and Ratios

In mathematics and physics, proportionality is a mathematical relation between two

quantities. There are two different views of this “mathematical relation.” One is based on

ratios, and the other is based on functions.

1. A Ratio Viewpoint

In many books, proportionality is expressed as an equality of two ratios:

Given the values of any three of the terms, it is possible to solve for the fourth term.

2. A Functional Viewpoint

Consider the following equation for gravitational force:

m1m2 F = G 2 r

A scientist would say that the force of gravity between two masses is directly

proportional to the product of the two masses and inversely proportional to the square of the distance between the two masses. From this perspective, proportionality is a functional relationship between variables in a mathematical equation.

Proportional reasoning is associated with the formal operational stage of thought, according to Piaget’s theory of intellectual development. In Research proportional

42 reasoning can be conceptualized in the following ways: identification of two extensive variables that are applicable to a problem and application of the given data and relationships to find (i) an additional value for one extensive variable (missing value problems) or (ii) comparison of two values of the intensive variable computed from the data (comparison problem). (Karplus et al., 1983)

In learning, proportional reasoning is recognized as a fundamental reasoning construct necessary for mathematics and science achievement (McLaughlin, 2003). In scientific inquiry, we can define useful quantities through proportional reasoning. For example, we define density, speed, and resistance with ratios. Krajcik and Haney (1987) analyzed the American Chemical Society Exam and found that over 50% of the test involved tasks requiring proportional reasoning. This implies that proportional reasoning is the primary reasoning construct required for success in chemistry, and complete development of this skill is crucial for achieving understanding of the many formal concepts associated with the content. Akatugba and Wallace (1999) contend that almost every concept in physics requires a proficient understanding of proportional reasoning, and students who are not capable of this type of reasoning will have difficulty mastering the concepts.

Proportional reasoning is considered a milestone in students’ cognitive development and is at the heart of middle grade mathematics. Proportional reasoning is associated with

Piaget’s formal operational stage of thought. Many Piagetian and neo-Piagetian researchers identify the formal operational stage in subjects by having them perform tasks that require the use of ratios and proportions (Roth & Milkent, 1991).

43 Proportional reasoning is widely applied in everyday life. For example, gas mileage and unit price are ratios that may be grouped under the general notion of “rates” (Karplus

et al. 1983).

Example of Proportion and Ratio Question

1. A fifth grade class has 18 students. At lunch time, the teacher brings in 12 bottles of

orange juice, which fully fill all students’ cups (no juice is left). How many cups can be filled with 16 bottles of orange juice?

a. 20 b. 24 c. 28

d. 30 e. 32 f. other

12 The question gives that 12 bottles of juice fill 18 cups, so each cup holds of a 18

bottle of juice. If there are 16 bottles of orange juice, a proportional relationship can be

12 16 set up: = . Solving for x yields 24. 18 x

2. This question is selected from the Lawson’s Test. Below are drawings of a wide and a

narrow cylinder. The cylinders have equally spaced marks on them. Water is poured

into the wide cylinder up to the 4th mark (see A). This water rises to the 6thmark when

poured into the narrow cylinder (see B).

44 Both cylinders are emptied (not shown) and water is poured into the wide cylinder up to

the 6thmark. How high would this water rise if it were poured into the empty narrow

a. to about 8

b. to about 9

c. to about 10

d. to about 12

e. none of these answers is correct

To solve the problem, let the height of the water in the wide cylinder be hw, the height

of the water in the narrow cylinder be hn, the size of the wide cylinder be Sw, and the size

of the narrow cylinder be Sn. The volume of water is conserved, so hwSw = hnSn. This

hw Sn can be rearranged into a proportional relation = . Since Sn and Sw are constants, hn Sw the ratio between hw and hn is constant. In part A, hw = 4 and hn = 6. In part B, hw = 6

6 4 and hn = x. This provides us with a new proportion: = . Solving for x yields 9. x 6

2.3.3 Probability

There are two main interpretations of probability, one that could be termed

“objective” and the other “subjective.” A probabilistic situation is a situation in which we

are interested in the fraction of the number of repetitions of a particular process that

produces a particular result when repeated under identical circumstances a large number

45 of times. The process itself, together with noting the results, is often called an experiment. An outcome is a result of an experiment. An event is an outcome or a set of all outcomes of a designated type. An event’s probability is the fraction of the times an event will occur as the outcome of some repeatable process when that process is repeated a large number of times.

The classical interpretation of probability is a theoretical probability based on the physics of the experiment, but does not require the experiment to be performed. For example, we know that the probability of a balanced coin turning up heads is equal to 0.5 without ever performing trials of the experiment. Under the classical interpretation, the probability of an event is defined as the ratio of the number of outcomes favorable to the event divided by the total number of possible outcomes.

Sometimes a situation may be too complex to understand the physical nature of it well enough to calculate probabilities. However, by running a large number of trials and observing the outcomes, we can estimate the probability. This is the empirical probability based on long-run relative frequencies and is defined as the ratio of the number of observed outcomes favorable to the event divided by the total number of observed outcomes. The larger is the number of trials, the more is accurate the estimate of probability. If the system can be modeled by computer, then simulations can be performed in place of physical trials.

A manager frequently faces situations in which neither classical nor empirical probabilities are useful. For example, in a one-shot situation such as the launch of a unique product, the probability of success can neither be calculated nor estimated from

46 repeated trials. However, the manager may make an educated guess of the probability.

This subjective probability can be thought of as a person’s degree of confidence that the event will occur. In absence of better information upon which to rely, subjective

probability may be used to make logically consistent decisions, but the quality of those

decisions depends on the accuracy of the subjective estimate.

Example of Probability Question

Three red square pieces of wood, four yellow square pieces, and five blue square

pieces are put into a cloth bag. Four red round pieces, two yellow round pieces, and three

blue round pieces are also put into the bag. All the pieces are then mixed about. Suppose

someone reaches into the bag (without looking and without feeling for a particular shape

piece) and pulls out one piece. What are the chances that the piece is a red round or blue

round piece?

a. cannot be determined

b. 1 chance out of 3

c. 1 chance out of 21

d. 15 chances out of 21

e. 1 chance out of 2

2.3.4 Correlational Reasoning

In the scientific inquiry process of multi-variable contexts, some variables are

independent to each other, but some are dependent. In social level, people pay much

47 attention to series correlation relationship. Such as the correlation between smoking and the chance to get lung cancer; drinking tea and losing weight; weather and market; the physical statures of parents and their offspring; and the correlation between the demand for a product and its price. For any two variables, they may associate with each other closely, weakly or be no dependence at all. Correlation is used to describe the degree of dependence between two variables. (There exists correlation between more than two variables, but our discussion focus on the link between two variables.)

Lawson’s Definition about Correlational Reasoning:

Correlational reasoning is defined as the thought pat- terns individuals use to determine the strength of mutual or reciprocal relationships between variables.

Correlational reasoning is fundamental to the establishment of relationships between variables; such relationships are, in turn, basic prediction and to scientific exploration.

(Anton E. Lawson, Helen Adi and Robert Karplus 1979)

Though there are multiple versions about the definition of correlation, there are two typical features when we define it:

1. When we see two variables, there are two different way to look at their

relationship. One is to see if there is a link between them. The other is to see how

these two variables are related, which is, in another sense, the mechanism of their

relationship. Researchers studying correlational reasoning mainly focus on if

people think the present data show that two variables are related or not and if

people can get prediction from the data. Correlational reasoning does not require

48 people to see there exists certain mechanisms or causal relationship between two

2. Correlational reasoning is highly related with conditional probability. That means

when correlation exists between incidence A and B, the probability of A can

influence the probability of B and vice versa.

Example of Correlational Reasoning Question

Brown was observing the mice that live in his field. He discovered that all of them were either fat or thin. Also, all of them had either black tails or white tails. This made him wonder if there might be a link between the size of the mice and the color of their tails.

So he captured all of the mice in one part of his field and observed them. The picture shows the mice that he captured.

Based on the captured mice, do you think there is a link between the size of the mice and the color of their tails?

A. appears to be a link

49 B. appears not to be a link

C. cannot make a reasonable guess

This question is asking people to judge whether or not there exists a correlation

between the size of the mice and the color of their tails. We should compare the 4 groups

of mice based on their properties (fat or thin, black or white tail), which gives us the

following table:

Fat mouse Thin mouse Mouse with Black tail 12 2 Mouse with white tail 3 8

We can see that most of the fat mice have black tails while most of the thin mice have

white tails. Therefore, there exists a correlation between the size of the mouse and the

color of its tail.

2.3.5 Deductive Reasoning

Deductive arguments are attempts to show that a conclusion necessarily follows from a set of premises. A deductive argument is valid if the conclusion does follow necessarily from the premises, i.e., if the conclusion must be true provided that the premises are true. A deductive argument is sound if it is valid and its premises are true.

Deductive arguments are valid or invalid, sound or unsound, but are never false or true.

Deductive Reasoning is a method of gaining knowledge.

50 Park & Han (2002) claim that deductive reasoning can be a factor which can help students recognize cognitive conflict and resolve it. They use deductive reasoning to help students to learn the direction of force acting on a moving object. For example, they show these two premises to students:

Premise 1: If an object is moving more and more slowly, then the net force acts on that object in the opposite direction to that of its motion.

Premise 2: A ball which is thrown vertically upward is moving upward more and more slowly.

They then ask what conclusion can be drawn from these premises. Their research shows that using deductive reasoning can help students change their preconceptions.

Deductive reasoning is a basic logic skill and is very useful in our daily life. We make many deductions from what we already know. For example, say you receive a flower as

Christmas gift. You need to put it somewhere. You know all plants need sunshine. Your flower is plant. The flower needs sunshine, so you put it beside the window.

Example of Deductive Reasoning Question

You and your friends play a new card game. One side of each card shows an integer number while the other side is either white or gray. After playing for a while, one of your friends discovers that if a card shows an even number on one side, it will always be gray on the other side. Your friend lays out four cards in front of you as shown. If you want to test whether the rule your friend discovered is true or not, which cards should you turn over (choose as few cards as possible)?

51 a. 3 only

c. 3 and white 3 8

d. 3 and gray

e. 8 and white

f. 8 and gray

g. all four cards

2.3.5 Inductive Reasoning

1) The basic definition of inductive reasoning

"Induction is a major kind of reasoning process in which a conclusion is drawn from particular cases. It is usually contrasted with deduction, the reasoning process in which

the conclusion logically follows from the premises, and in which the conclusion has to be

true if the premises are true. In inductive reasoning, on the contrary, there is no logical movement from premises to conclusion. The premises constitute good reasons for accepting the conclusion. The premises in inductive reasoning are usually based on facts or observations. There is always a possibility, though, that the premises may be true while the conclusion is false, since there is not necessarily a logical relationship between premises and conclusion." (Grolier's 1994 Multimedia Encyclopedia)

52 Inductive reasoning is used when generating hypotheses, formulating theories and

discovering relationships, and is essential for scientific discovery.

2) The definitions of inductive reasoning in research:

Induction can be defined as the process whereby regularities or order are detected

and, inversely, whereby apparent regularities, seeming generalizations, are disproved or

falsified. This is achieved by finding out, for instance, that all swans observed so far are white or, on the contrary, that at least one single swan has another color. To put it more

generally, one can state that the process of induction takes place by detecting

commonalities through a process of comparing. However, with inductive reasoning it is

not enough to compare whole objects globally to each other. Instead, they have to be

compared with respect to their attributes or to the relations held in common. That is the

reason why all inductive reasoning processes are processes of abstract reasoning.

Example of Inductive Reasoning Question

Question 2:What should be in ( )

53 Answer: D

From the front 5 pictures we can induce that the square spot moves clockwise and the

semicircle and semi square exchange their position in turn. So the should be

2.3.6 Causal Reasoning

Causal reasoning concerns with establishing the presence of causal relationships among events. When causal relationships exists, we have good reason to believe that events of one sort (the causes) are systematically related to events of some other sort (the effects), it may become possible for us to alter our environment by producing (or by preventing) the occurrence of certain kinds of events.

Most studies of students ‘ability to coordinate theory and evidence focus on what is best described as inductive causal inference (i.e., given a pattern of evidence, what inferences can be drawn?).

If there are causal relationship between variable x and y, there are several kinds of causes:

1) Necessary causes: If x is a necessary cause of y, then the presence of y necessarily

implies the presence of x with the probability of 100%. The presence of x,

however, does not imply that y will occur.

2) Sufficient causes: If x is a sufficient cause of y, then the presence of x necessarily

implies the presence of y with the probability of 100%. However, another cause z

may alternatively cause y. Thus the presence of y does not imply the presence of

54 x. For instance, in the case of ‘losing the breath means the death of a person’,

losing the breath is a sufficient cause of death of a person, but death of a person is

a necessary cause of losing the breath.

3) Contributory causes: If x is a contributory cause of y, it means the presence of x

makes possible the presence of y, but not with the probability of 100%. In other

words, a contributory cause may be neither necessary nor sufficient but it must be

contributory.

For instance, in the case of ‘having the cancer causes the death of a person’,

‘having the cancer’ is a contributory causes. It is neither a necessary nor sufficient

cause of the death of a person. Because, firstly, having cancer is not sufficiently

cause a person to die (some cancer can be treated); secondly, the death of the

person necessarily caused by cancer (some other factor such as car accident, or

suicide also may cause a person to die. But everyone can’t deny that if a person

has cancer, it will probably lead to his death.

Example of Casual Reasoning Question

A zoologist travels to Africa to study the natural breeding environment of giraffes.

While there, he notices a type of tall tree that produces a special fruit that only grows at the top of the tree. He also notices that giraffes that frequently eat this fruit appear to be stronger and taller than those who cannot reach the fruit. He concludes that the fruit

contains rich nutrients which make the giraffes that eat the fruit grow stronger and taller.

Which one of the following statements do you agree with?

55 a. When a giraffe frequently eats this special fruit, it grows stronger and taller.

b. The nutrients in the fruit can help the giraffe grow stronger and taller.

c. Both A and B are correct.

d. The result is not sufficient to demonstrate that eating the fruit causes a giraffe to grow

stronger and taller.

e. None of the above statements is reasonable.

In this question, the zoologist observes a positive correlation between the frequency

of eating fruit from a tall tree and the height of the giraffes. He then concludes that

eating the fruit cause the giraffes to be taller. This is a typical “correlation implies

causation” fallacy. It is quite possible that tall giraffes eat from the tall trees just because

they are tall. We cannot make any conclusion about whether or not the fruit makes the

giraffe taller.

2.3.7 Hypothetical-deductive reasoning

Hypothetical-deductive method (HD method) is a very important method for testing theories or hypotheses. The HD method is one of the most basic methods common to all

scientific disciplines including biology, physics, and chemistry. Its application can be

divided into five stages:

1) Form many hypotheses and evaluate each hypothesis

2) Select a hypothesis to be tested

3) Generate predications from the hypothesis

56 4) Use experiments to check whether predictions are correct

5) If the predictions are correct, then the hypothesis is confirmed. If not, the

hypothesis is disconfirmed.

HD reasoning involves starting with a general theory of all possible factors that might affect an outcome and forming a hypothesis; then deductions are made from that hypothesis to predict what might happen in an experiment.

In scientific inquiry, HD reasoning is very important because, in order to solve science problems, you need to make hypotheses. Many hypotheses can't be tested directly; you have to deduce from a hypothesis and make predictions which can be tested through experiments.

According to Piaget’s theory of intellectual development, HD reasoning appears in the formal operational stage (Inhelder & Piaget, 1958). Lawson et al. (2000) claim that there are two general developmentally-based levels of hypothesis-testing skill. The first level involves skills associated with testing hypotheses about observable causal agents; the second involves testing hypotheses about unobservable entities. The ability to test alternative explanations involving unseen theoretical entities is a fifth stage of intellectual development that goes beyond Piaget’s four stages.

Example of Hypothetical-deductive Reasoning Question

A student put a drop of blood

on a microscope slide and then

looked at the blood under a

57 microscope. As you can see in the diagram below, the magnified red blood cells look like little round balls. After adding a few drops of salt water to the drop of blood, the student noticed that the cells appeared to become smaller.

This observation raises an interesting question: Why do the red blood cells appear smaller? Here are two possible explanations:

I. Salt ions (Na+ and Cl-) push on the cell membranes and make the cells appear

II. Water molecules are attracted to the salt ions so the water molecules move out of

the cells and leave the cells smaller.

To test these explanations, the student used some salt water, a very accurate weighing device, and some water-filled plastic bags, and assumed the plastic behaves just like red- blood-cell membranes. The experiment involved carefully weighing a water-filled bag in a salt solution for ten minutes and then reweighing the bag.

What result of the experiment would best show that explanation I is probably wrong?

A. the bag loses weight

B. the bag weighs the same

C. the bag appears smaller

What result of the experiment would best show that explanation II is probably wrong?

58 Answer: B

This question gives students two alternative hypotheses and the experiment to test these two hypotheses. Students need to make a prediction about the results of the experiment according to each hypothesis and consider what result could confirm or disconfirm the hypothesis.

If hypothesis I is right, then the weight of the bag won't change because there are no molecules or ions coming into or going out of the bag. If hypothesis II is right, the bag will lose weight because water molecules move out of the bag. From HD reasoning, we know the answers are A and B.

Chapter 3. Validity Evaluation of the Lawson’s Classroom Test of Scientific Reasoning

3.1 A Historical Review on the Development of the Lawson’s Test

Currently, there are increasing interests and activities among many STEM education communities in research on students’ ability in scientific reasoning, in which the

Lawson’s test is often used as the tool for assessing scientific reasoning. However, research to critically inspect the existing assessment instruments and to develop new instruments for scientific reasoning is largely missing in the literature.

Pratt & Hacker, 1984) investigating the validity of the Lawson's 1978 version of the formal reasoning test, there is little work on its 2000 version. Even though the 2000 edition has become a standard assessment tool in physics education research, the test itself has not been systematically validated. Through research, we have also observed several issues concerning the question designs and data interpretations. In this Chapter, we first consider the inception and early examination of the validity of the Lawson test as was originally created in 1978. We then report our research on the validity of the 2000 version of the Lawson test based on large-scale assessment data and follow-up interviews amongst a large swath of education levels. Our findings show both the validity and predicaments with the current examination design for measuring formal reasoning skills.

60 During the 1960s and 70s, various researchers wanted to create an examining tool to

uncover the level of understanding and formal reasoning students possessed. The

impetuous came from the fact that the most accurate and informative tool was through

clinical-style interviews, but this tool was extremely demanding. It required both the

aptitude in implementation as well as significant amounts of time; it may also require

equipment for demonstrations to help probe the minds of their subjects. A more effective

method was needed so that teachers could accurately determine the standing of the

students in their classes. That method should be something that could be administered by

any trained instructor and scored in an objective manner.

Most tests developed were paper-and-pencil methods which could more easily be

graded (Longeot, 1965; Raven, 1973; Burney, 1974; Tisher & Dale, 1975; Tomlishen-

Keasey, 1975; Tobin & Capie, 1980). Other tests may require equipment for the students

to work with and pamphlets for them to fill out (Rowell & Hoffman, 1975), but these

sorts of tests were time-consuming and needed all the said instruments; such issues made

the implementation of this type of examination more restricted and tended to have

smaller sample sizes. Another version of this sort of exam had an instructor conduct a

demonstration for the class, which minimized the time and equipment requirements

(Shayer & Wharry, 1975). This helped strike a decent balance between the power of

interviews with a format that could be more readily implemented.

The test of interest that came out of this person was Lawson (1978a) which had been developed out of Piagetian methods and questions of formal reasoning based on Piaget’s developmental theory (Piaget, 1965). Lawson argued that most pencil-and-paper tests

61 could tend to examine reading and writing skills more than formal reasoning abilities, and he found the questions of Shayer and Wharry (1975) to be insufficient in variety. The test that he developed utilized questions that others had created in different contexts to test formal reasoning for a variety of operations, including control of variables, combinatorial reasoning, correlational reasoning, probabilistic reasoning, and proportional reasoning.

In his test design, Lawson had fifteen items all of which had a demonstration conducted by the instructor to help pose a question to the students. Each pupil had their own test pamphlet which had two parts to each testing time: a multiple-choice question for the correct prediction to the posed question, and a written section for the student to explain their answer. An item was scored as correct only if the prediction was correct and their reasoning satisfactory. Another version of the test was created as well that had only

10 items instead (Lawson, 1978b).

To test his investigative tool, Lawson administered the exam to 513 students from eighth to tenth grade of which 72 were randomly picked for a battery of Piagetian tasks in a clinical interview. The comparison between interview data and the test results would indicate how well the two are correlated. Lawson also had several judges assess the quality of his test questions by those with expertise in Piagetian research; all six judges agreed that the questions tested both concrete and/or formal reasoning. For comparing the interview data and test data, two statistical tools were employed: parametric statistics and principle components analysis. The former tool found an overall correlation of 0.76 that was statistically significant (p< 0.001). The later tool indicated that 66% of the variance in scores would be accounted for if there were three principle factors that were measured

62 by this examination; with the small number of students, this result was considered

tentative. Overall, the assessments indicated that the Lawson test was able to measure

formal reasoning and to correlate reasonably well with clinical methods.

Lawson (1978a) also created a ranking system to help instructors understand what a score on this test would mean. From his results of over 500 students, Lawson created

three classes of reasoning subjects: concrete, transitional, and formal. The first would

have a score from 0-5, the second 6-11, and the last 12-15. About one third (35.3%) of

those examined were classified at the concrete level, about half (49.5%) transitional, and

the remaining (15.2%) at formal. When compared to interview assessments, more than

half of the 72 interviewees fit into the reasoning levels developed by Lawson. Of those

that did not fit into the schema, the data indicated that the Lawson test may have

underestimated those students.

After publication, the Lawson test began to be used in classrooms, and not long after

other researchers began to assess how well the test was performing. For example,

Stephanich et al. (1983) tested the correlation between clinical interviews and students

taking various assessment tests including that of Lawson. A weaker correlation than

reported by Lawson was found and the test was said to overestimate reasoning abilities

rather than underestimate. However, this examination had a small sample size (N=27) and no estimate of statistical significance was provided, so it cannot overturn what was found in the previous study. Nonetheless, an appreciable correlation was found (0.50) between interviews and the pencil-and-paper examination, so we can say that the test from 1978 seems valid.

63 Another study comes from Pratt and Hacker (1984), who implemented the exam on

150 students to uncover if the Lawson test measured one or several factors. Taking issue with the factor analysis of Lawson (1978a), the researchers used another model to indicate that indeed the Lawson test measurement was multi-factorial rather than singular

(this was even more strongly indicated in the test from Lawson [1978b]), which they took

to be as a weakness of the test. This sort of examination was repeated by Hacker (1989)

who found the same results: Lawson’s test is multi-factorial. Other researchers (Harrison,

1986; Reckase, Ackerman & Carlson, 1988) do not find a multi-factorial examination to

be problematic, especially if formal reasoning is multifaceted, so we take Pratt and

Hacker (1984) to prove a point that Lawson (1978a) was uncertain of in his original article.

Later an Israeli study conducted the Lawson test to find how strongly these test scores correlated to success in the sciences and mathematics (Hofstein & Mandler, 1985). In their results, it was found that formal reasoning students outperformed transitional and concrete reasoning students, though the latter two levels of ability were not distinguishable. Concerning performance in STEM, only one of the items (probability reasoning) was found to be predictive in all analyzed sciences, but overall the test was a good indicator for the success of biology students alone. This limitation thus indicates that this formal reasoning test is not sufficient in determining success in STEM overall, and perhaps no formal reasoning test can achieve this result.

After more than twenty years since its original edition, Lawson in 2000 produced a new version of his examination, but this time it was completely multiple-choice and

64 without demonstrations. Also unlike before where there were fifteen items, the 2000 test

had twelve items in question pairs, making twenty-four total questions. Another change

was that a score was a simple count of the number of right answers rather than the

number of items where both questions had to be correctly answered. Combinational

reasoning items were also replaced by more correlational reasoning and hypothetical- deductive reasoning items. A complete list of changes is given in Table 2.1 of Chapter 2.

The first ten items of the 2000 version in its two-question format had, as originally

developed, one question for correctly predicting the outcome to some particular situation,

and the second question was to find the correct reasoning behind that selection in the first

question. The last two items then introduced hypothetical-deductive questions. The first

question of item 11 concerned the experimental design, and the second question asked for

what outcome would support a stated hypothesis. Item 12 was similar to the prior item,

but both questions concerned the data pattern that would support a stated hypothesis.

The utility of the new version of the Lawson test is most obvious in that a completely

multiple-choice exam is more quickly and objectively scored, and thus it becomes better

accommodated to use by instructors to gauge the reasoning abilities of their pupils.

However, the 2000 Lawson test was not presented in a formal study proving its efficacy,

instead resting on the laurels of its earlier incarnation. In one study since its distribution,

it was shown that there was a correlation between increases in scores on the Force and

Concept Inventory (FCI) and Conceptual Survey of Electricity and Magnetism (SCEM)

and Lawson test scores among the community college students (with and without

calculus) to whom the various tests were administered (Diff & Tache, 2007). The

65 correlations were small but significant nonetheless, and similar findings have also been reported (Dubson & Pollock, 2006; Coletta & Phillips, 2005 & 2007; Coletta et al.,

2008), so the new Lawson test still appears to be a useful measure of formal reasoning.

Nonetheless, a proper analysis of the Lawson 2000 exam had not been done.

Moreover, no investigation has analyzed the test across a large swath of education levels.

This renders the current study necessary in determining what weaknesses it may have and where problems may develop.

3.2 Content Evaluation of Lawon’s Test – Item Context Issues

To lead the validity evaluation of the Lawon’s test, a content analysis of some the questions provides useful insights on how experts or examinees may disagree over information included in the questions. The Lawson’s test items were designed with simple scientific context scenarios that may show up as examples in K-12 science courses, such as pendulums, graduated cylinders, candle lighting experiment, etc.

Although simple, these contexts do stand out with a scientific lab flavor that may intimidate certain students who are weak in science.

In addition, since these contexts may be used in courses which are hard to control, there is a potential complication of content interference. For example, Q21 to Q24 on the

Lawson’s test assess students’ hypothetical deductive reasoning ability (i.e., the ability on hypothesis forming and testing). The contexts used are candle lighting (oxygen and carbon dioxide experiment) and cells in salty water (osmosis experiment). Through interviews, we have observed that a significant number of high school and college

66 students reporting that they responded to the questions by recalling the exact experiments

that they have done or observed before and didn’t try to reason through the problem.

Therefore, to these students, the questions become a content test rather than a test on

Practically speaking, the content interference cannot be totally removed, however, in

order to correctly interpret assessment results, systematic research with detailed

interviews is needed to provide valid information on the extent to which the content

interference may have on students of different age groups and backgrounds.

3.3 A Data-Driven Study on the Validity of the Lawson’s Test

The sections below take the data-driven approach to evaluate the validity of the

Lawson’s tests based on students’ quantitative and qualitative responses. For this study with the 2000 version of the Lawson test, data was collected in three forms: (1) large scale quantitative data with students from 3rd grade to graduate level; (2) an added short answer free-response explanation to each item of the Lawson test given to college students in their freshmen year; (3) think-aloud interviews with freshman college students. The first form provides our quantitative data which can indicate particular issues with questions that need to be assessed by other means. The second form provides both qualitative and quantitative data which can more clearly indicate reasoning difficulties or testing worries. The third form extracts detailed information about

students’ thought processes in answering questions. There was additionally, though on a

smaller scale, the collection of eye-tracking data for students that took the exam with

67 computers rather than pencil and paper. This information could help indicate what sorts

of questions may make students more hesitant in answering.

The student populations of the data collection include Chinese students from the 3rd grade to graduate level (N=7131) and U.S. students from the 5th grade to first year of

college (N=2777). The Chinese grade school students (N=6258) are from 141 classes in

20 schools from 8 regions around China. The students in China used a version translated

by physicists fluent in both languages. The translated versions were also piloted with a

small group of undergraduate and graduate students (N~20) to remove language issues.

The U.S. grade school data was collected in several mid-western states from 30 classes of

students across 14 private and public schools (N=1078). The schools were selected from

a wide range of regional and demographical backgrounds to obtain a representative pool

of populations. The college student data were from four U.S. universities (N=1699) and

three Chinese universities (N=873). The students tested were first year science and

engineering majors enrolled in calculus-based introductory physics courses (NChina=458,

NUS=1370). The tests were administered before any college level instruction of the

relevant content topics.

The four U.S. universities are ranked and their backgrounds are given below (based on 2007 U.S. News and World Report Ranking):

• U1 is a large research-1 state university, U.S. ranking top 60, acceptance rate

• U2 is a large research-1 state university, U.S. ranking top 60, acceptance rate

68 • U3 is a large tier-4 state university with an acceptance rate of 84%.

• U4 is a large tier-3 state university with an acceptance rate of 69%.

The three Chinese national universities – schools directly funded and supervised by the nation’s department of education – with their national rankings are also given below

(based on 2007 ranking from Sina Educatoin News, http://edu.sina.com.cn):

• C1 is a top 30 national university.

• C2 is a top 60 national university.

• C3 is a top 130 national university.

In the selected universities, we targeted those with medium ranking in order to make a more representative pool of the populations. For college students, the Lawson test was administered in the beginning of the fall semester (or quarter) before any college level instructions. For grade school students, the test was administered at times throughout the fall semester (or quarter).

3.4 Quantitative Results – Item Score Analysis

To search for questions with potential validity issues, the average scores of each test item were computed for students at selected grade levels for baseline comparisons of students at various developmental stages. These results are seen in Figure 3.1 which gives the average item scores of U.S. and Chinese college students. In general, the two populations have systematic differences on items measuring different skill dimensions.

The Chinese students are higher on proportional reasoning (items 5-8) and hypothetical deductive reasoning (items 21-24) but are lower on probabilistic (items 15-18) and

69 correlational reasoning (items 19-20). Due to the large sample sizes, the differences between US and Chinese students are statistically significant and the error bars based on

standard errors are small and ignored in the plots.

1.0 0.9 0.8 0.7 0.6 0.5

Score 0.4 0.3 0.2 China-13th US-13th 0.1 0.0 012345678910111213141516171819202122232425 Lawson's Test Items Figure 3.1. Average scores of the Lawson test items from Chinese (N=248) and U.S. (N=646) college students.

The items scores show large “dips” (divergences from items in the same skill

dimension) on item 8 for the Chinese students, item 12 for both U.S. and Chinese

students, and item 14 for U.S. students. As these results come from those with a high

70 level of education, this pleaded for an investigation of how apparent the dips are at earlier

stages of education. To inspect if the dips are unique to the college population, we plotted

the item scores of three selected age groups of Chinese students in Figure 3.2.

1.0 0.9 7th 10th 12th 0.8 0.7 0.6 0.5

Score 0.4 0.3 0.2 0.1 0.0 012345678910111213141516171819202122232425 Lawson's Test Items

Figure 3.2. Average scores of the Lawson test items from Chinese students in 7th, 10th, th and 12 grades (N7=529, N10=1195, N12=458).

The results suggest that the dip on item 12 is formed early with students in 7th grade as compared to the question in the same category. This dip remains largely unchanged all the way through college students except to become more pronounced. Among the

71 Chinese students, the dips on items 8 and 14 start to appear at the 10th grade and become more obvious through the 12th grade. These results indicate that the dips are consistently observed in the developmental trend of students from young to senior ages.

All the students’ performances in the test were recorded in Figure 3.1. Using this graph, it can be observed that there are a few items showing abnormal results. On the item pair of Q11 & Q12 at least 80% of both U.S. college students and Chinese students scored correctly on Q11, but only around a third of each nation’s students correctly answered Q12. Similar situations occurred in other item pairs such as Q7 & Q8, and Q13

& Q14. Such artifacts would suggest that there is no progression from novice to expert, but this runs contrary both to expectations and to the progress seen in the question pairs where one answer significantly improves while the other does not as seen in Figure 3.2.

Moreover, it was the question pairs Q11 & Q12 and Q13 & Q14 that were added to the

1978 version of the Lawson test and not analyzed as his prior questions were. This phenomenon led us to investigate the design of the question and discover if it is the wording used which confuses students. To answer this question and verify the validity of the test, both quantitative and qualitative data were applied in this research.

3.5 Quantitative Results – Analysis of Two-Tier Score Patterns

The Lawson 2000 test version is a two-tier test. Usually educators and researchers

(Lee & She, 2009; Coletta & Phillips, 2005, 2007; Hofstein & Mandler, 1985; Lawson,

1978a) grade the test using the pair-scoring schema which means items are scored as being correct (a score of 1) if students choose the right answers for both the question and

72 the proper corresponding reason. The investigation into this two-tier test method is important. For example, there are a large number of students that chose the right answer for the wrong reason, or vice versa. Under the two-tier scoring system, these students are all categorized as not knowing how to solve the problem. However, it is possible that student may have the knowledge and reasoning skills in the domain at some level but were misguided by question design issues, which lead them to answer incorrectly in the content or in the reasoning part. Therefore, to truly test students’ scientific reasoning abilities and verify the rationality of the item design, a careful study of re-analyzing data using the single-scoring schema was utilized.

Response 11- 13- 15- 17- 19- 21- 23- 1-2 3-4 5-6 7-8 9-10 Patterns 12 14 16 18 20 22 24 00 2.9 20.7 31.0 23.2 18.9 41.5 25.0 7.3 12.9 22.2 37.0 19.3 11 94.2 77.8 60.1 43.1 76.3 24.0 35.8 81.1 80.4 64.4 30.7 37.0 01 1.3 0.4 4.5 27.1 2.0 7.1 15.9 10.3 2.9 3.4 20.7 35.3 10 1.6 1.1 4.4 6.6 2.8 27.4 23.4 1.3 3.8 10.0 11.7 8.4 Sum of 2.9 1.5 8.9 33.7 4.8 34.5 39.3 11.6 6.7 13.4 32.4 43.7 01 & 10

Table 3.1. Lawson Test two-tiered response patterns of U.S. college freshmen (N=1699).

The results of this research are summarized in Table 3.1. There are four possible patterns. The (0,0) pattern means students incorrectly responded to both questions; the

(1,1) pattern means they responded correctly on both questions; the (0,1) pattern means

73 they responded incorrectly on the content question, but had reasoning that should lead them to the correct answer; and the (1,0) pattern means they responded correctly on the content question, but with incorrect reasoning.

Response Population Paired Single Patterns Percentage Score Score 00 21.82 0.00 0.00 11 58.73 58.73 58.73 01 10.91 0.00 5.45* 10 8.54 0.00 4.27* Total Score 58.73 68.47

Table 3.2. Comparison of Lawson test total percentage scores of U.S. college freshmen (N=1699) calculated with paired-score vs. single-question-score methods. *For 10 and 01 patterns, their contributions to total scores are weighted at 0.5.

The (0,1) and (1,0) pattern was summed and the percentage of students choosing one correct answer in a question pair was found. Attaining the right answer for the wrong reason or the wrong answer for the right reason would imply that there may be a problem in the way the question was asked. The response patterns for all 12 pairs of questions are giving in Table 3.1. Table 3.2 gives the sum of all four patterns for the complete

Lawson’s Test.

74 From Table 3.1, the subtotal of (0,1) and (1,0) pattern of most questions are relevantly

low, which means both question and reasoning parts are rather consistent question pairs.

However, there are several high concentrations of this pattern in such columns as Q7 &

Q8, Q11 & Q12, and Q13 & Q14, which have been highlighted in the yellow color (as mentioned in the introduction to this paper, Q21 through Q24 no longer followed the pattern of question and then explanation, so they have been omitted from this study, even though they contain a high sum of the 01 and 10 pattern.) Remarkably, these inconsistent cases are matched with the problematic items presented earlier. This result further verifies our above hypothesis that there may be some discrepancies in question design.

3.6 Qualitative Results – Analysis of Student Interviews

To further validate hypothesis that the high percentage of (0,1) and (1,0) answers is caused by question design, we give the Lawson’s test to 181 college freshman science and engineering majors and asked students to provide open-ended reports on their reasoning to each of the questions in the Lawson’s test. We also conducted follow-up interviews with a subset of these students (N=66) asking them to go over the test after completing it and explaining their reasoning on how they solved each of the questions.

These students are from the same pool of college freshmen, with whom the quantitative

data were collected. In this section, the three question pairs of interests are discussed in

75 Q7 & Q8 Pair on Proportional Reasoning

7. Water is now poured into the narrow cylinder (described in Item 5 above) up to the

11th mark. How high would this water rise if it were poured into the empty wide

a. to about 7 1/2

c. to about 8

d. to about 7 1/3

a. the ratios must stay the same.

b. one must actually pour the water and observe to find out.

c. the answer can not be determined with the information given.

d. it was 2 less before so it will be 2 less again.

e. you subtract 2 from the wide for every 3 from the narrow.

From the results in Table 3.1, almost one third of students (27.1%) gave the incorrect

answer to the concrete question, but selected the correct reason. Conversely, 6.6% of

students answered correctly for the incorrect reason. More specifically, almost 11% of the

answers for Q8 were choice “e” even though the correct answer was “a”.

In order to further understand how students respond to this question pair, we asked students to provide open-ended reports on their reasoning and also conducted interviews

76 with a subgroup of these students. Below are two cases selected from the open-ended reasoning.

7. 1) Answer: d

2) Explanation:

8. 1) Answer: e

2) Explanation: Proportions

Here we can see that the student selected thhe correct answer on the question for the wrong reason. Even with this student’s obvious math competence, they picked “e” instead of “a” as the answer for the reasoning part.

In the follow-up interviews, there were a couple of students (5 out of 21) whose answers mirrored the logic of the student in Sample 1. They could correctly solve the problem using knowledge of ratios and mathematics for Q7, but they all picked “e”, an incorrect choice for reasoning defined in the answer key of the Lawson’s Test. Accoording to the interviewed students, this was mainly bbecause they thought “e” was a more detailed description of the ratio, which is specific to the question, whereas “a” is just a

77 vague generally true statement. Therefore, these students chose the answer that they took to be more descriptive out of two choices that appeared to both be correct.

In addition, during the administration of the test, quite a few students would raise their hands to ask the differences between the chhoices “a” and “e”. They believed these two choices were fundamentally equivalent. This further reinforces our contention that the wording leads to perplexity and can upset an accurate measure of their reasoning abilities.

7. Answer B

Explanation:

8. Answer A

Explanation: use ratios

In this case, by contrast, the student failed in obtaining the correct ratio the question

7, but picked the correct reason in question 8. The explanation is rather simplistic by restating that the question is about the ratio. Apparently, the student was not fluent in properly using a ratio to solve this problem, but kknew the iddea to “use ratios”. Therefore, an inference can be made that the student had a general idea the answer would be related with the concept of ratios, which led them to pick “a” – the vague true statement about ratio. This student was able to describe the method to sollve the problem, but did not actually know how to do the operations. This indicates that the design of this question is

78 problematic to those that are competent in the required skills but more forgiving to those that are less competent, leading to assessment uncertainties in terms of both false positives and false negatives.

Through the above analysis of this question pair, it is obvious that the question wording is problematic which has adverse impact on the validdity of the assessment.

Q11 & Q12 and Q13 & Q14 Pairs on Correlation Reasoning

11. This experiment shows that flies respond to (respond means move to or away

12. because

a. most flies are in the upper end of Tube III but spread about evenly in Tube

b. most flies did not go to the bottom of Tubes I and III.

c. the flies need light to see and must fly against gravity.

d. the majority of flies are in the uppper ends and in the lighted ends of

e. some flies are in both ends of each tube.

79 13. In a second experiment, a different kind of fly and blue light was used. The results

are shown in the drawing.

a. blue light but not gravity

b. gravity but not blue light

c. both blue light and gravity

d. neither blue light nor gravity

14. because

a. some flies are in both ends of each tube.

b. the flies need light to see and must fly against gravity.

c. the flies are spread about evenly iin Tube IV and in the upper end of tube

d. most flies are in the lighted end of Tube II but do not go down in Tubes I

e. some flies are in both ends of each tube

The assessment of results in Table 1 indicated that these two question pairs were rather related. They were both highly concentrated on the (0,1) and (1,0) patterns, and they all had almost one-third students (27.1% and 23.4%, respectively) choosing the correct answer for the wrong reason. To further understand student reasoning on these question pairs, we analyze qualitative data from open-ended surveys and interviews.

80 Case 1 (Q11 & Q12 pair):

In this case, the student correctly noted that the flies tend to move to the top of the tubes when they were upright (I and III), but did not realize that one needed to use tubes

II (level and with dark paper over one end to block light) and III (upright and without paper) to test both hypotheses – gravity and red light. It was these tubes that needed to be used to act as controls since tube II is level and independent of a gravitational effect.

Tube III had no paper so it only tested the effects of gravity when compared with tube IV.

Case 2 (Q11 & Q12 pair):

81 In this case, the student answered the first question of the pair incorrectly, but picked

the correct reason in the second question. We can also see an contradictory situation here:

in the first question the student’s reasoning and explanations contradict his/her answer;

the reasoning stated that the black paper had no significant effect, however, the answer

picked (A) clearly stated that light did have an effect. The more curious situation is the

reasoning portion where the student gave the correct answer (choice A) with the

reasoning that “it represents the results”. This is true, but it is also true for answer “b”.

Moreover, the student also wanted to compare tubes (tube I) that were not needed for

controlling variables (so did the student in case 1 student). In addition, the poor graphical representation of the original Lawson’s test was also criticized by students.

Here we can see that the reasoning statements in question 12 need improvement. The current statements give some kind of incomplete descriptions which leave the students

with a sense of uncertainty and their answers are usually based on guessing of the one

that reads best. More clear and explicit comparison of results among different tubes need

to be included in the choices so that the sense of controlling variables with different

conditions can then be clearly manifested. Or the reasoning will still be implicit at an

intuitive gut feeling level without formally recognized by students’ thought processes.

In interviews, similar situations were also found. For example, one student wanted the

answer for Q12 to be a combination of “c” and “e”, though the former choice expresses

the student’s belief of how flies ought to respond rather than something consistent with

the data from the experiment. In follow-up questions, that student indicated they

compared I to III and II to IV.

82 From this data, it appears that students did compare multiple tubes, rather than

looking at a single tube, to test a hypothesis. Perhaps this tendency exists because the

tubes they compare have only one change between them; tubes I and III are both upright

and their only difference is the black paper on the bottom section of the tube or the lack thereof. Since it would seem a single variable has changed between them, then COV skills would suggest they are appropriate for deriving valid conclusions. Similarly, tubes

II and IV are both level and only differ with the use of black paper over one end. The student in case 2 also indicated that a comparison between tubes I and IV and between II and IV could be of use, further validating this hypothesis. The problem is that tube I is frequently used by students in making comparisons, which lead us to believe that at certain developmental level, students tend to build their reasoning on situations that all variables are varied (tube I), which is in fact a confounded condition not useful for hypothesis testing. This further indicates that students at this level lack the formal reasoning on the meaning and utility of COV, but they started to understand the needs for co-variation of variables. It can be a possible intermediate level in the developmental progression of COV reasoning.

The question pair of 13 and 14 has a lot of similar issues to that of 11 and 12. For example, the correct answer (d) to Q14 included examining tube I even though it was not a proper implementation of COV. This might have led to the results for Q14, which is very similar to Q12, to be answered correctly more often than Q12.

In summary, based on the qualitative and quantitative results, we have reason to believe that the choices of the reasoning portion of the fly-COV questions (Q11-Q14) can

83 cause significant uncertainties among students and need to be reworked. In addition, the graphical representations of the questions also need to be improved. These issues are carefully revised in the similar questions used in iSTAR.

3.7 Consideration on Two-Tier Question Structure

While we have so far concentrated on the problems of incorrectly answering a question due to issues of representation and wording of the question, there is also the concern that the format of the test may allow students to answer a question correctly without real understanding of what the Lawson is meant to measure. Because of the two- question format of the exam, students can identify matched pairs of answers for the two questions using simple logic. Therefore, students can use certain cues in the wording or choices of one question pair to help answer the other.

This hypothesis was further supported in interviews with some of the students. For example, about 10% (6 out of 66) changed their answer to Q11 after they had read Q12.

Apparently, something in their reading of the latter question helped them realize some issue with the prior answer. About an equal number of students (5 out of 66) changed their answers to both Q11 & Q12 after reading the Q13 & Q14 pair which is very similar in design. This again indicates that something in the questions cues the test subject into finding the answer to other questions. In the last question pair (Q23 & Q24), some interviewees did not understand the first question until reading the second, which further indicates that some (but not all) students use information from the pair of questions to help them understand and answer the questions.

84 To avoid this problem, one way is to eliminate the two-tier structure by either

combining the two questions into a single question that addresses both content and

reasoning or retaining only one part of the pair and converting the other into as part of the

question in terms of propositions, presumptions, conditions, or extended information

that students would need to use or consider in answering or explaining the question. The

other way is to retain the use of the two-tier structure but design better choices for both questions so that multiple combination of two choices can be the correct answer pairs – this will significantly decrease the success rate of simple ruling-out strategy that students often use in solving MC questions. However, optimal design of such questions and choices is often challenging and needs many cycles of research to validate.

In conclusion here, we see that the two-tier structure of the Lawson’s test does impose an additional validity issue. In our new iSTAR questions, we have revised the fly-

COV pairs, which retains the two-tier structure but builds up as a two stage question so that the second question is more independent to student answers in the first questions.

This allows more varieties of reasonable two choice combinations as possible answers to address the interference between the two questions. More details on the related iSTAR questions will be discussed in later chapters of this thesis.

3.8 The Ceiling Effect and Measurement Saturation of the Lawson’s Test

The Lawson’s test is in general a simple test that measures learners’ fundamental reasoning components, which show non-trivial ceiling effect with students at college levels. The average scores of both freshmen US and Chinese college students in STEM

85 fields are similar, around 75% (US N=1046, China N=332). For non-science majors (US

N=1061, China N=175), the average scores of both populations are also similar, around

60% (Bao et al., 2009).

As part of the research outlined in Bao et al. (2009), to understand the usability of the

Lawson’s test at different age groups, we conducted further research to measure how the

scientific reasoning ability is developed through the school and college years. We

collected data with Chinese students from 3rd grade to graduate level (NTotal = 6357). The

students are from 141 classes in 20 schools from 8 regions around China, thus they form

a more representative population. The results are plotted in Figure 3.3. which shows the

general developmental trend of the Lawson’s test scores of Chinese and U.S. students

expanding from 3rd grade to graduate-level students. The blue dots are grade-average

Lawson’s test scores. The red line is referred as a “Learning Evolution Index Curve”,

which is obtained by fitting the data with a logistic function similar to the one used in

item response theory (see Chapter 4 for more detail on model fitting).

The error bars shown in the graph are standard deviations of the class mean scores,

which give the range of variance if one were to compare the mean scores of different

classes. The U.S. data was collected in three mid-western states from 34 schools (NTotal =

3010). We plotted the mean scores of the U.S. data in red circles on top of the Chinese data. We can see that from 5th grade to first-year college level, the U.S. and Chinese data are within one standard deviation of each other, showing a similar developmental scale.

On the measurement features, the Lawson’s test will work great with 9th graders but there

will be significant ceiling effect for students from senior high school level and up.

86 The developmental data shows that students start to fully develop the basic reasoning abilities around their college years. Therefore, in order to assess the reasoning abilities of senior high school students, college students, and graduate students in STEM fields, we need to develop questions that involve more advanced reasoning components.

100% China Data 90% Fit 80% US Data 70%

Lawson's Test20% TotalScore

0% 0123456789101112131415161718 Grade Level

Figure 3.3. The developmental trend of Chinese and U.S. students’ Lawson’s test scores.

The problem of this ceiling is two-fold. One is that the saturation level is significantly below the 100% mark, including the graduate level students, with the maximum average

87 at about 80%. As the Lawson test is relatively simple, an advanced student should have

reached a ceiling due to the maximum possible score. This lower ceiling strongly

indicates that the presentation and wording of the questions can be an issue contributing to controversies among well-educated individuals regarding their interpretation and reasoning of the questions. This ceiling effect has the secondary effect of making it difficult to differentiate a senior in high school from a graduate student when it comes to reasoning abilities. This is unexpected, as people often expect that after four years of

college, students would have fully developed their basic level reasoning skills. The limitation described here for the Lawson’s test calls for the needs to develop new instruments on scientific reasoning for the assessment of more advanced individuals.

3.9 Conclusions

In this chapter, I reviewed existing work and discussed our own research regarding

the validity of the 2000 version of the Lawson Classroom Test of Scientific Reasoning.

While the test has been a widely used tool in the assessment of students in their abilities

as formal prisoners, multiple validity issues have been identified which include

item/choice design issues, item context issues, item structure and wording issues (e.g.

two-tier design), the limited scale of measurement range, and the ceiling effect for

advanced students. All these call for research on revising the Lawson’s test and further

development of new instruments that measure scientific reasoning.

Chapter 4. The Developmental Metric of Scientific Reasoning

4.1 Context of the Study

Research on student scientific reasoning ability has been gaining popularity in recent

years. An important area is the assessment of the reasoning ability. The Lawson’s

Classroom Test of Scientific Reasoning (LCTSR) is a readily available instrument and

has been used in several recent studies. However, there isn’t established knowledge about

the basic measurement parameters of the Lawson test such as the performance baselines

for students at different ages and of different backgrounds. Using the Lawson test, we

have collected data from students in both U.S. and China, which are analyzed to

determine the developmental metric of student reasoning ability spanning from

elementary school level to graduate level. The results provide quantitative measures on key assessment parameters of the Lawson test including flooring and ceiling baselines, average variations due to population differences, and developmental gains at different grade levels. These results will provide important information for measurement calibration and validation in future studies on student reasoning ability.

Although the validity of the Lawson’s test hasn’t been well established, it is the only readily available quantitative instrument on scientific reasoning and therefore, it has been

89 widely used. From the assessment point of view, it is always important to determine the basic measurement features of this instrument with large scale data, which will also help further establish the validity of this instrument and provide baseline results for researchers and teachers to properly interpret their assessment outcomes.

4.2 Data Collection

Using the Lawson’s Classroom Test of Scientific Reasoning (LCTSR), we collected data in U.S. (N=3010) and China (N=6357) with students starting from the third grade to graduate school. The Chinese students are from 141 classes in 20 schools from eight regions around China; thus, they form a more representative population. The US data were collected from three Midwestern states from 34 private and public schools. The

Chinese data from grades 3 to 12 were taken between 2007 and 2009. The US data from grades 5 to 12 were taken between 2007 and 2010.

The college student data are from first and second year college students of science and engineering majors enrolled in entry level calculus-based physics courses. These groups of students form the main body of the next generation technology workforce in both U.S.A. and China.

Data from four U.S. universities and three Chinese universities are used in this study.

The four U.S. universities are labeled U1, U2, U3, and U4. University ranking and backgrounds are given below (based on 2007 U.S. News and World Report Ranking):

• U1 is a large research-1 state university, U.S. ranking top 50, acceptance rate

90 • U2 is a large research-1 state university, U.S. ranking top 60, acceptance rate

• U3 is a large tier-4 state university with an acceptance rate of 84%.

The three Chinese universities are labeled with C1, C2, and C3. Their national rankings are also given below (based on 2007 ranking from Sina Educatoin News, http://edu.sina.com.cn). (A national university is one that is under direct control of the department of education.)

In the selection of universities, we targeted the ones with medium ranking in order to make a more representative pool of the population. All data from the college students were taken between 2007 and 2009. A small group of second year graduate students

majoring in physics engineering from C1 and U1 respectively also took the Lawson’s test

as anchoring mark for fully developed students in the formal education settings.

A summary of student performances on the Lawson’s test is given in Table 4.1. The

data of the Chinese students span from 3rd grade to graduate school (grades 3 to17), while the data of the US students span from 5th grade to sophomore year of college (grades 5 to

14). The summarized results are population means of students from each grade level and

include the mean scores on the entire Lawson’s test as well as the mean scores on the six

individual skill dimensions of the Lawson’s test.

91 USA 3010 Total Conservation Proportion COV Probability Correlation HD Grade N S SD S SD S SD S SD S SD S SD S SD 5 83 0.274 0.119 0.360 0.288 0.123 0.187 0.260 0.200 0.269 0.274 0.381 0.362 0.310 0.263 6 256 0.371 0.138 0.575 0.296 0.222 0.210 0.292 0.203 0.466 0.321 0.458 0.420 0.325 0.227 7 256 0.365 0.162 0.570 0.314 0.221 0.126 0.275 0.232 0.522 0.340 0.309 0.431 0.304 0.235 8 44 0.372 0.162 0.586 0.392 0.215 0.165 0.293 0.171 0.536 0.384 0.527 0.451 0.251 0.219 9 507 0.453 0.172 0.646 0.318 0.246 0.171 0.358 0.232 0.664 0.278 0.473 0.434 0.368 0.248 10 182 0.488 0.199 0.679 0.314 0.264 0.169 0.424 0.251 0.656 0.327 0.458 0.456 0.390 0.248 11 483 0.604 0.161 0.792 0.265 0.333 0.179 0.498 0.239 0.875 0.218 0.654 0.446 0.440 0.266 12 210 0.642 0.195 0.829 0.269 0.395 0.242 0.575 0.247 0.810 0.253 0.667 0.426 0.509 0.293 13 782 0.708 0.164 0.864 0.212 0.629 0.328 0.590 0.256 0.867 0.241 0.709 0.410 0.541 0.292 14 207 0.758 0.162 0.920 0.187 0.739 0.322 0.666 0.243 0.930 0.186 0.749 0.401 0.586 0.301 China 6357 Total Conservation Proportion COV Probability Correlation HD Grade N S SD S SD S SD S SD S SD S SD S SD

92 3 102 0.212 0.086 0.252 0.245 0.107 0.164 0.214 0.193 0.196 0.207 0.353 0.319 0.225 0.174 4 336 0.239 0.103 0.475 0.330 0.175 0.208 0.161 0.170 0.228 0.225 0.217 0.292 0.220 0.199 5 464 0.260 0.112 0.435 0.329 0.189 0.221 0.216 0.193 0.190 0.230 0.255 0.348 0.294 0.228 6 333 0.310 0.144 0.520 0.351 0.263 0.304 0.226 0.196 0.298 0.324 0.312 0.384 0.282 0.228 7 529 0.338 0.143 0.594 0.363 0.347 0.334 0.252 0.222 0.259 0.289 0.354 0.395 0.271 0.243 8 378 0.409 0.177 0.704 0.330 0.491 0.370 0.318 0.253 0.300 0.339 0.351 0.435 0.306 0.248 9 765 0.468 0.169 0.802 0.292 0.490 0.364 0.363 0.247 0.427 0.364 0.392 0.432 0.348 0.287 10 1195 0.533 0.193 0.834 0.278 0.585 0.374 0.453 0.284 0.424 0.392 0.431 0.453 0.458 0.308 11 1328 0.600 0.176 0.893 0.232 0.701 0.339 0.493 0.266 0.474 0.395 0.522 0.444 0.534 0.309 12 458 0.702 0.169 0.920 0.212 0.784 0.279 0.612 0.271 0.619 0.403 0.560 0.461 0.688 0.303 13 248 0.738 0.159 0.851 0.292 0.773 0.292 0.691 0.227 0.673 0.398 0.655 0.449 0.768 0.267 14 122 0.765 0.156 0.895 0.222 0.777 0.305 0.728 0.231 0.721 0.385 0.664 0.447 0.770 0.279 17 99 0.783 0.136 0.937 0.193 0.851 0.228 0.652 0.260 0.770 0.296 0.778 0.405 0.775 0.241

Table 4.1. Summary of Lawson’s Test Data from USA and China. COV (Control of Variables), HD (Hypothetical Deductive)

4.3 The Developmental Scales of the Lawson’s Test Scores

The results in Table 4.1 show a steady developmental trend from young children to fully developed leaners. The results on the total scores from Chinese and US students are also comparable. However, large differences between the two populations are also observed on several skill dimensions. To make the comparisons visually straightforward and quantitatively accurate, we model the developmental trend using a logistic function motivated by item response theory (Hambleton & Swaminathan, 1985).

4.3.1 The Learning Evolution Index Curve of the Lawson’s Test

To quantitatively determine the features of the developmental trends, we fit the

Lawson’s test scores with a logistic function shown in Eq. (4.1) :

C − F y = F + (4.1) 1+ e−α (x−b)

x – Student grade level

y – Student score

F – Floor, the model predicted lowest score based on student data

C – Ceiling, the model predicted highest score based on student data

α – Discrimination factor, which determines the steepness of the central part of the

b – Grade-equivalent difficulty level, which controls the center of the curve.

30% C 0.829 F 0.201 Lawson's Test20% TotalScore α 0.474 10% b 9.680

Figure 4.1. The developmental trend of Chinese and U.S. students’ total LCTSR scores

Since the Chinese data is in large quantity and covers a wide spectrum of social economic backgrounds, we fit the model described in Eq. 4.1 with the Chinese data and obtained the developmental curve shown in Figure 4.1. The mean LCTSR scores of the

U.S. data are scatter plotted and overlaid on top of the curve. The error bars are the standard deviations of the student scores. We can see that from 5th grade to sophomore

year in college, the U.S. and Chinese data follow each other very well, showing a similar

developmental scale.

The results shown in Figure 4.1 suggest that the U.S. and Chinese students have a

similar developmental path during school years. The development occurs most rapidly

during 8-11 grades. For college students, the results are affected by the selection process,

especially in China where only the top 10% high school graduates can enter the tier-1

universities. It is interesting to observe that the average score of 12th grade Chinese

students is almost identical to that of the U.S. students in non-selective universities.

During the college years, the reasoning ability doesn’t change much. This result is consistent with existing studies which showed that the current education in STEM disciplines often train students on content understanding but have little impact on reasoning (Bao et al., 2009). Our research has shown that traditional physics courses make little changes on students’ pre and post LCTSR scores (pre-post test score effect size ~0.1), while inquiry based courses often make a sizeable impact (pre-post test score effect size ~0.6) (Bao et al., 2009). Since the components of scientific reasoning such as control of variables, correlation analysis, and hypothesis testing are explicitly and repeatedly emphasized in inquiry-based learning, it is not surprising that this type of instruction can help develop students’ abilities in these areas.

For quantitative evaluations of the model fitting, the fitting parameters are summarized in Table 4.2, which also includes the fits of the six individual skill dimensions. The performances of the fits are evaluated in terms of the root-mean-square deviations (RMSD) between the actually measured scores and the model predicted

scores. Two types of RMSDs are given. One is the RMSD of the mean scores of each

grade level and the other is the weighted population RMSD calculated with each

individual student’s measured and predicted scores. As expected, the mean score RMSD

is much smaller than the population RMSD, since the calculation of mean scores will

remove a significant amount of variances from the data. Comparing to the average

standard deviations (SD) of individual student scores, the population RMSD is typically

1/4 to 1/3 of the SD and the mean score RMSD is about 10% of the SD, which suggests

that the model fits the data well.

Total Conservation Proportion COV Probability Correlation HD C 0.829 0.940 0.860 0.770 0.790 0.860 0.820 F 0.201 0.280 0.050 0.200 0.190 0.160 0.240 α 0.474 0.576 0.504 0.693 0.499 0.334 0.819 b 9.680 6.740 8.240 10.27010.390 10.890 10.740 Mean Score 0.014 0.046 0.030 0.025 0.032 0.047 0.023 RMSD Weighted Population 0.106 0.079 0.106 0.062 0.092 0.093 0.074 RMSD Average SD 0.148 0.282 0.291 0.232 0.327 0.405 0.255

Table 4.2. The model fit parameters and the root-mean-square deviations (RMSD) of the fit for the mean scores and population.

The fitting parameters provide quantitative scales of the measurement features of the instrument, namely the Lawson’s test. From the fit of the total score, the model predicts a nominal ceiling of 83% and a floor of 20%. The floor is consistent with the multiple

choice design of 5-answer questions. The ceiling is only 83%, instead of 100%, which

gives a difference of approximately 5 questions (out of the total 24 questions). This consistent with our analysis in Chapter 3, which shows that a number of the Lawson’s test questions have design and wording issues that contribute to a variety of different interpretations of the questions leading to large uncertainties in answering the questions even with fully developed learners.

The “b” parameter of the fit of the total test score, which gives the grade-level difficulty, is found to be 9.68. This means that students from grades between 9 and 10 would score in the middle between the floor and the ceiling. Since the Lawson’s test was indeed designed to measure the high school students’ scientific reasoning, the observed difficulty level matches well with the targeted population.

The “α” parameter, often called the discrimination parameter in the original IRT model, describes how quickly students’ scores change with grade levels. For the logistic model, students’ scores change most rapidly when the grade level equals the difficulty.

Taking the derivative of Eq. 4.1 and for α=0.47 at x=b, we get

dy α(C − F)e−α (x−b) = −α (x−b) 2 ≈ 7.5% . (4.2) dx x=b (1+ e ) x=b

This shows that the most rapid change of the score is approximately 7.5% per year of development, which occurs between grades 9 and 10. Comparing to the measured outcomes summarized in Table 4.1, this change is about half of the standard deviation of student scores. Therefore, we can see that the baseline for highest development the

Lawson’s test scores happens at the freshman high school year with an yearly effect size of approximately 0.5.

The quantitative results from the model fitting can help researchers and teachers to compare and align their particular measures of student gains on scientific reasoning. The results also show that the scientific reasoning ability is a slow-changing ability without targeted instruction in formal education settings.

Summarizing our results, it is clear that current style of STEM education, even carried out at a very demanding level, has little impact on developing students’ scientific reasoning abilities. It is not what we teach but how we teach will make a difference on what students can develop. Ideally, we want students to gain on both content and reasoning; therefore, we need to put in more research and effort on developing a balanced method of education, incorporating more inquiry based learning into the education settings.

4.3.2 The Developmental Scales of the Six Skill Dimensions of the Lawson’s Test

The Lawson’s test measures six dimensions of the scientific reasoning. These include conservation of mass and volume, proportional thinking, control of variables, probabilistic thinking, correlational thinking, and hypothetical deductive reasoning. The mapping of the individual test items and the six skill dimensions are listed in Table 4.3.

A quick glance of the results shown in Table 4.1 shows that although the total

Lawson’s test scores and the developmental trends of US and Chinese students are similar, large differences exist in student scores on the several skill dimensions.

Questions Skill Categories 1, 2, 3, 4 Conservation of Mass and Volume 5, 6, 7, 8 Proportional Reasoning 9, 10, 11, 12, 13, 14 Control of Variables 15, 16, 17, 18 Probabilistic Thinking 19, 20 Correlational Thinking 21, 22, 23, 24 Hypothetical-deductive Reasoning

Table 4.3. The six skill dimensions of the Lawson’s test.

To facilitate the comparisons of students’ performances on the individual skill dimensions, the developmental model described in Eq. 4.1 is fitted with student data on the different skill dimensions. The results are discussed in the sections below.

4.3.3 The Developmental Curve of the Conservation of Mass and Volume

In the Lawson’s test, questions 1-4 measure students’ understanding of conservation of mass and volumes. This is a very basic reasoning skill that children develop at young age and is often considered as a conceptual construct in the formation of more complicated mental models of the (Halford, 1993). Therefore, it is expected that students should have high scores on this dimension at early grade levels. The result of the model fit is plotted in Figure 4.2. Similar to the method used in fitting the total scores, the model is fitted with the Chinese data only and the US mean scores at each grade level are scatter plotted on top of the fitting curve.

40% Conservation 30% C 0.940 F 0.280 20% α 0.576 10% b 6.740

Figure 4.2. The developmental trends on conservation of matter and volumes.

From Figure 4.2, we can see that students start to fully develop this skill early on. The fitting parameters indicate high ceiling (0.94) and floor (0.28), highest among all the skill dimensions (see Table 4.2), while the difficulty level is the lowest (6.74). The discrimination parameter is also relatively high among the different skill dimensions, indicating a moderately fast developing skill.

It is worth noting that although young children will develop the fundamental ideas of conservation before kindergarten age, the Lawson’s test questions require not only the

basic conservation ideas but also the ability to read and comprehend the question narratives and the underlying contexts and to apply the ideas into the contexts. Therefore, students need to establish their reading and comprehension skills before they can correctly answer the Lawson’s test questions, which can be one of the major contributing factors for the conservation dimension to have a difficulty level equivalent to grade 6 to 7 but not earlier.

Comparing the US and Chinese data, we can see that the Chinese students are slightly ahead of the US students on this skill dimension during the middle school and high school years. The two populations start to split around 8th grade and re-converge during the college years. The possible causes for this observation are subject to many possible factors including sample selection, science curriculum in elementary schools, etc. With the current data size and sample distributions, it is difficult to identify conclusive causes to explain the differences, which will not be pursued in the scope of this thesis.

4.3.4 The Developmental Curve of the Proportional Reasoning

The ability of handling ratios and proportionality in problem solving is a highly emphasized skill, which is often considered as the fundamental requirement for STEM learners (McLaughlin, 2003). (See Chapter 2 for a detailed review of research on proportions and ratios.) Using a similar method, the development model is fitted with the student data on dimension of proportional reasoning and fitting result is shown in Figure

40% C 0.860 30% F 0.050 Proportional Reasoning 20% α 0.504 b 8.240 10%

Figure 4.3. The developmental trends on proportional reasoning.

From Figure 4.3, we can see a quite low floor (0.05) which below chance. Since the proportional reasoning is highly dependent on basic math computation skills typically taught grades 5-7, low scores at lower grade levels (3-5) are often expected. The fact the floor is below chance may be caused by the mathematical nature of the questions – students will engage certain types of math computation, rather than guessing, to generate answers to the questions. Using guessing will create a probability for correct answers at the chance level, however, using computation can cause the probability for correct

answers to be lower than chance as the lower grade level students will more frequently encounter computation errors and produce wrong answers.

It is noticed that the ceiling of the fit is only 0.86. It can be surprising to see that college and graduate students are not able to score perfectly on these simple math questions. As discussed in Chapter 3, it is believed that the major contributor to this lower than expected ceiling is due to issues of the question designs. As detailed in

Chapter 3, Question 8, which is the reasoning part of the second two-tiered questions on proportional reasoning has two choices that experts would consider equivalent. The

Chinese students overwhelmingly scored low on this question as they hesitate to pick the generally true choice as the correct answer.

From the fit, the difficulty level turns out to be 8.2, which is the second easiest among the six dimensions. This is consistent with the Chinese curriculum, which emphasizes math drilling. Comparing the Chinese data with the US data, we can see a clear gap: starting from the 7th grade, the Chinese students outperform the US students by 1 to 2 standard deviations. The gap starts to diminish at college ages as the US students become more fluent in basic math computations and the Chinese students hit the ceiling on the

Lawson’s test.

4.3.5 The Developmental Curve of the Control of Variables

The control of variables is a major category in scientific reasoning. As detailed in

Chapter 2, many of the existing studies on scientific reasoning did research solely in this particular area. The Lawson’s test contains six questions on control of variables, showing

its emphasis on this skill. The contexts of the questions have a strong influence of biology course background. However, the contexts are simple enough so that learners without established knowledge of biology are expected to be able to understand and reason through the questions.

Control of Variables of Control C 0.770 F 0.200 20% α 0.693 10% b 10.270

Figure 4.4. The developmental trends on control of variables.

The fitting results on control of variables are plotted in Figure 4.4. For this skill, the ceiling is the lowest among all skill dimensions (see Table 4.2). This is likely the result of

the item design issues discussed in Chapter 3, where the design of choices of question 12, which is the reasoning part of the first pair of fly question, was found to be controversial.

The floor of student scores is found to be right around the chance level (0.2), indicating that students without developed understanding of control of variables often guess for the answers.

The difficulty level of this skill dimension is between grades 10 and 11, slightly higher than that of the Lawson’s test. The discrimination is the second highest among all skill dimensions, suggesting that COV is more rapidly developed around sophomore years in high school.

Comparing the Chinese and US data, it appears the both populations are similar on this skill dimension – the differences are typically less than a small fraction of the standard deviations. This implies that any culturally embedded education factors within the two countries didn’t make a significant impact on students’ development of COV skills.

4.3.6 The Developmental Curve of the Probabilistic Reasoning

Understanding probability is an important and fundamental ability for students to correctly interpret scientific data and conduct data analysis as most scientific experiments involve uncertainties of many kinds, some are systematic while others are stochastic. In the Lawson’s test, four items are devoted to the measurement of basic probabilistic reasoning. The contexts are straight forward involving simple scenarios of counting objects of specific features and finding the likelihood for certain combinational patterns

to occur. Typically, these skills are addressed in math courses in middle schools. The results of the model fitting for this skill dimension are shown in Figure 4.5.

30% C 0.790 Probablistic Reasoning F 0.190 20% α 0.499 10% b 10.390

Figure 4.5. The developmental trends on probabilistic reasoning.

From Figure 4.5, the most striking result is that the US students outperform the

Chinese students by a big step, at least one standard deviation ahead. The gap started immediately at the beginning of the middle school (6th grade) and sustained ever since all

the way through college. Based on our validity evaluations discussed in Chapter 3, there are no known issues concerning the probability questions in the Lawson’s test, and therefore, the gap must be of an origin residing in the educational and or cultural settings of the two populations.

Discussions with teachers from both countries have revealed traces of evidence that may explain the differences. In the US, probability is a standard component in middle school math curriculum. It is also emphasized by teachers and taught with hands on activities very similar to the contexts of the questions in the Lawson’s test.

While in China, probability is only slightly touched in the curriculum partially due to its simplistic nature in computation. Teachers often assume students would understand the counting based frequency calculation very quickly, spending only a few classes on it, and jump into more complicated topics such as combinations and conditional probabilities. This makes it difficult for students to grasp the main underlying concepts such as randomness and independence. The exposures to students are often in the form of narrowly structured questions that require more complicated calculation but less in conceptual modeling. These are possible causes from the formal education settings that might have contributed to the lower performance of Chinese students on probabilistic reasoning, especially when facing a real life context.

From a more culturally based perspective, in both real life and the exposed formal education, students in China are always heavily guided in learning and all the problems they have encountered have precisely defined correct answers. Years of such training

often lead students to develop a preference for certainty and in the meantime a deficit in proper understanding of uncertainty.

The possible causes of this gap are based on a small number of unstructured discussions with teachers from USA and China. The results may not be representative but are recognized by the involved educators for being plausible hypotheses. A common consensus is that this area of reasoning is definitely a topic warrants more in depth studies, which are being pursued in our current research.

4.3.7 The Developmental Curve of the Correlation Thinking

The correlation thinking bears a lot of similarities to probabilistic reasoning. One way to interpret a correlation is the conditional probability for a pair of events to co-exist. For example, one can easily rephrase a statement to explain a correlation between two events in terms of the likelihood to observe one of the events should the other occurs or not.

Likewise an established understanding of correlation is fundamental to students’ capacity in analyzing experimental data and drawing conclusions.

On the Lawson’s test, there are only two questions on correlation, which also use a simple context of counting mice of various forms from a defined area. The model fitting results on the correlation dimension are plotted in Figure 4.6. Due to the small number of questions, the variance of the data is quite large causing an average standard deviation in the order of 40%.

40% Correlation

30% C 0.860 F 0.160 20% α 0.334 10% b 10.890

Figure 4.6. The developmental trends on correlation thinking.

The results on the dimension of correlation are in general similar to that of probabilistic reasoning. The US students also outperform Chinese students but the gap is smaller, averaging at half of a standard deviation. The possible causes of this gap is also similar. In China, statistical measure such as correlation are not emphasized in the K-12 curriculum and even when it is introduced, students often learn to calculate a well- defined problem using the summation equation for calculating correlation. Seldom would they encounter a problem that gives a real world experimental setting and requires

students to both identify the related variables and come up with a possible model behind the identified variables.

The fitting parameters also suggest that for the Chinese students the difficulty level is the highest among the six skills dimensions, and it is also the slowest changing ability.

4.3.8 The Developmental Curve of the Hypothetical-deductive Reasoning

The hypothetical-deductive reasoning is thought to be most complicated ability in the

Lawson’s test (and therefore put at last), representing the last stage of formal reasoning.

Nevertheless, it is the most important core skill in scientific reasoning as hypothesis testing is always the goal of scientific inquiries and applications of scientific methods.

On the Lawson’s test, the last four questions are reserved for this skill dimension.

These are fairly long questions requiring a whole page of reading and parsing. The two questions in a pair are not structured as answer-explanation anymore. In the first pair

(questions 21-22), the first question provides a number of experimental design while the second question gives a set of possible experimental outcomes. The two questions need to be coordinated in order to form a consistent pair of design and outcome that also can be used to test the provided hypothesis. In the second pair of the questions (questions 23-

24), the narratives of the questions present an experimental setting and two possible hypotheses. The first question asks for a selection of experimental outcomes that would prove the first hypothesis wrong, while the second question asks for experimental outcomes that would negate the second hypothesis. As a result, in order to respond correctly to these questions, students need to have a well-established reading and

comprehension capacity as well as information processing skills to parse out the useful information from an abundant collection of co-existing but not relevant features. The results of the model fitting are given in Figure 4.7.

30% C 0.820 F 0.240 20% α 0.819 Hypothetico-deductive Reasoning Hypothetico-deductive 10% b 10.740

Figure 4.7. The developmental trends on hypothetical-deductive reasoning.

The model fitting results show that hypothesis testing is a more advanced ability that students start to develop in high school years. It is also the most rapidly changing

(discriminating) ability among all six skill dimensions. On this skill dimension, the

ceiling of the Lawson’s test questions is a little over 80%. Two potential causes of this low ceiling have been observed in research. One is the length of the reading, which often causes students to lose track of the relevant experimental structure and variables and misinterpret the questions. The other is related to the contextual elements of the questions. Students often tried to use their prior knowledge about red cells in question pair of 21-22 to answer the questions rather than using reasoning. In question pair 23-24, it uses a plastic bag that is semi-permeable. This is often considered against common sense as most plastic bags encountered in real life are water proof. Therefore, some students thought the designs were implausible, which prevented them from reasoning further through.

Comparing the Chinese and US data, we see that both populations are similar all the way up to 9th grade. The Chinese students start to outperform US students from 10th grade and beyond. This might be the result of population selection. In China, the period of compulsory secondary education is grades 1-9. The high school starts at 10th grade and about 50% of the middle school students move on to attend high school. While in the school communities tested in USA, almost all middle school students move on to high school and the high school starts at 9th grade. Therefore, it is possible that the gap between US and Chinese students after 10th grade is due to the population selection at the onset of high school.

4.4 Summary

This chapter presents a detailed analysis of the developmental trends of scientific reasoning abilities for US and Chinese students measured with the Lawson’s test. A logistic model is developed and fitted to the data. The model fits the data reasonably well with a mean score RMSD at the 10% level of the average standard deviations.

The parameters obtained from the model fitting provide quantitative measures to compare the overall scientific reasoning ability and the individual skill dimensions. The results show that the Chinese and US students are similar in their total scores on the

Lawson’s test but diverge in five out of the six skill dimensions. The actual causes for the differences are still under investigation; however, initial clues suggest that cultural and educational settings of the two countries have a lot to contribute to the differences.

The analysis also provides quantitative metrics for the difficulty levels of the

Lawson’s test and the individual skill dimensions. The results show that this test is optimized for assessing high school students.

From the model fitting, we also see the common issue of low ceiling, which is typically in the 80% level. This is consistent with the test design issues identified in the validity evaluation studies discussed in Chapter 3.

Chapter 5. Study Learning Progression of Scientific Reasoning Skills through Pattern Analysis of Reponses to the Lawson’s Test

5.1 Context of the Study

Scientific reasoning skills have been widely researched, in which the Lawson

Classroom Test of Scientific Reasoning is widely used. The study discussed in this chapter aims at developing a method of mining data from the Lawson’s Test. By analyzing the patterns of responses to four items from the Lawson’s test, we have determined that students providing the correct answer without the correct reasoning are at an intermediate level of understanding, a construct that is overlooked by traditional scoring of two-tier test items. From this, we were able to identify six levels of performance based on student responses to the Lawson’s Test. Based on the analysis, a new scoring method for the Lawson’s Test is also proposed.

5.2 Review of Learning Progression in the Context of Scientific Reasoning

Scientific reasoning skills are vital to a student’s education. In her extensive review of scientific reasoning literature, Zimmerman (2007) claims that investigation skills and content knowledge bootstrap one another, creating a relationship that underlies the development of scientific thinking. Research has been conducted to determine how these

114 scientific thinking skills can best be fostered and which teaching strategies contribute most to learning, retention, and transfer of these skills. Zimmerman found that children are more capable in scientific thinking than was originally thought, whereas, and that adults are less so. Additionally, there she found that there is a long developmental path followed by those acquiring scientific thinking skills, student performance varies as they progress along this path. Zimmerman (2007) further stated that scientific thinking requires a complex set of cognitive skills, the development of which requires much practice and patience. It is important, then, for educators to understand how scientific reasoning abilities develop.

The idea that students build their knowledge progressively is well established;

Lawson (1979) states that the developmentalist’s view of intelligence is not that of innate ability but rather how a student’s abilities have progressed over time. Lawson also poses important questions: how does reasoning develop, and does every student develop skills in the same order and at the same rate? Learning progressions, a relatively recent focus for researchers (Duncan & Hmelo-Silver, 2009; Steedle & Shavelson, 2009), are excellent tools that can be used in answering these questions.

Generally defined, a learning progression is a way of describing how students build their knowledge of a certain concept over time (Alonzo & Steedle, 2009). Duncan and

Hmelo-Silver (2009) offer a four-part, comprehensive definition. (1) Learning progressions are focused on a few basic ideas and practices that are developed over time.

(2) Learning progressions have an upper and lower bound. The upper bound describes what students are expected to know and is determined by academic standards (Alonzo &

115 Steedle) and research in the content area of the learning progression. The lower bound is determined by the prior knowledge and skills students have as they enter the progression.

(3) Learning progressions are comprised of levels that describe the steps students take between the lower and upper bounds. The levels are determined by examining existing research and by empirical studies of the progression. (4) Learning progressions do not describe learning as it would occur naturally; instruction is required. Learning progressions may seem linear, but they do not assume that all students follow a single path through the progression.

Other researchers have described the traits of a learning progression. The period of time covered by a learning progression can vary from one instructional unit to several years (Alonzo & Steedle, 2009), and the number and size of levels can differ between progressions (Duncan & Hmelo-Silver; 2009). Additionally, learning progressions are at least partly hypothetical in nature, so research is needed to verify the learning progression

(Duncan & Hmelo-Silver, 2009; Steedle & Shavelson, 2009). Alonzo & Steedle (2009) argue that longitudinal studies are necessary in validating a learning progression.

Learning progressions can be informed by new research and should be modified accordingly. This leads to an iterative process: define a learning progression, use it to design assessment instruments, use these assessments to modify the learning progression, and so on.

There are multiple well-defined learning progressions already, such as force and motion (Alonzo & Steedle, 2009) and biodiversity (Songer, Kelcey, & Gotwals, 2009), but there are limited studies relating to learning progressions and scientific reasoning. In

116 the field of biology, Lawson, Alkhoury, Benford, Clark, and Falconer (2000) proposed a

progression describing student reasoning. The levels they named were (1) sensory-motor stage, (2) preoperational stage, (3) descriptive concepts, (4) hypothetical concepts, and

(5) theoretical concepts. They ordered the levels based on the difficulty of the tasks associated with each level; descriptive concepts are based on experience, hypothetical

concepts require imagining past or future events, and theoretical concepts cannot be

derived from observation. A student at a particular level should be able to perform at all

lower levels as well. Their study used a modified form (Lawson, 2000) of the original

Lawson’s Classroom Test of Scientific Reasoning (Lawson, 1978), so it is highly relevant

to our focus of scientific reasoning skills. The progression that was developed serves

biology concepts well as it categorizes these concepts according to difficulty. However,

the scientific reasoning aspects of the progression are not fine-grained.

Since national standards indicate that students are expected to learn scientific

reasoning skills (National Research Council, 1996), educators need a way to determine

what their students should know and at what point in their schooling they should know it.

Lawson (1979) states that only material that is appropriate to the developmental level of the students should be utilized. By defining a scientific reasoning learning progression, curricula could be appropriately tailored to a given age group. To define a scientific reasoning learning progression, more work is needed.

Defining a learning progression is not an easy task, particularly because each student learns in a different way (Alonzo & Steedle, 2009), but a common trait among learning progressions is that they are designed based on research evidence (Duncan & Hmelo-

117 Silver, 2009). Typical methodology in researching a learning progression involves both qualitative (interviews, open-ended questions) and quantitative (multiple choice) data.

One such strategy is ordered multiple choice (OMC) (Briggs, Alonzo, Schwab, &

Wilson, 2006). OMC items are unique in that they are based on what Briggs et al. call a construct map—a model of student cognitive development. Each answer option depicts a different level of understanding within the construct map. This allows for straightforward diagnosis of the student’s performance. Briggs et al. state OMC items are useful because they provide more diagnostic information than traditional multiple choice items while remaining efficient.

Briggs et al. (2006) state that the key to writing effective OMC items is carefully defining a construct map for the concept at hand. Distracters are written to represent common misconceptions or errors that students would be expected to make. These errors represent different levels of the construct map. To validate OMC items, Briggs et al. used open-ended questions in the same content area.

OMC items have great diagnostic power (Briggs et al., 2006). Briggs et al. describe a chain of reasoning to support this claim: OMC items are based on concept maps; concept maps are based on student cognitive development; cognitive development is based on national standards; thus, OMC items can indicate where students are in a larger content domain, which cannot be achieved with traditional multiple choice.

Alonzo and Steedle (2009) employed the OMC methodology described above in a study of the force and motion learning progression. One of their research goals was to determine the advantages and disadvantages of OMC and how OMC compares to more

118 qualitative methods (interviews and open-ended questions). Three separate studies were carried out, the subjects of which were eighth-grade, seventh-grade, and ninth- through twelfth-grade students, respectively. Students responded to a variety of questions based on force and motion. Their scores determined their positions within the force and motion learning progression. As part of this study, interviews were conducted; several students

responded to OMC using a think-out-loud method (saying whatever they were thinking)

and other students participated in traditional interviews. This qualitative data determined

how accurate the assessment items were in reflecting the students’ knowledge and

position on the learning progression. Alonzo and Steedle point out that open-ended

questions allow for students to be placed at any level of a learning progression while

OMC items reflect distinct, discrete levels. Despite this disadvantage to open-ended

items, OMC items are easier to score and provide more narrowly defined categorical

information on student thinking, which reduces interpretation biases. Alonzo and

Steedle’s study showed that OMC items better assess students than open-ended questions

for the force and motion learning progression.

Alonzo and Steedle’s (2009) study used a learning progression that was already well-

defined, but not all learning progression research is done this way. Songer, Kelcey, and

Gotwals (2009) developed a biodiversity learning progression from scratch. They defined both content and inquiry reasoning progressions that combine to form the learning progression. Their development process can be described by five steps.

The first step was working with experts (in this case, zoologists) to determine the focal points of the learning progression (Songer, Kelcey, & Gotwals, 2009). The goal

119 was to make content “simply complex” (i.e. maintain rigor but make it accessible to

fourth- through sixth-graders). In the second step, the focal points were translated into

curricular activities. This was done by referencing previous work in scaffolding as well

as working with current teachers. The third step involved developing assessment items

that corresponded to the content and inquiry reasoning progressions. Pre-tests, embedded

assessments, and post-tests were developed using both forward and reverse engineering.

The purpose of the test items was to indicate how the students’ knowledge development

connected to the progressions. The fourth step was to empirically evaluate the content

and inquiry reasoning progressions using the assessment instruments from step three.

This was done using both cross-sectional and growth curve analyses. The growth curve

was done in a piecewise fashion since the researchers did not want to assume linear

growth. The fifth and final step was to take the results of step four, as well as national and state education standards, and develop a three year learning progression.

The biodiversity learning progression (Songer, Kelcey, & Gotwals, 2009) is fully developed (Duncan & Hmelo-Silver, 2009), but other learning progressions are still in their early stages of development. One example is Duncan, Rogat, and Yarden’s (2009) modern genetics learning progression. The progression, which was developed based on existing research of student understanding in genetics as well as national education standards, has well-defined levels, but it has not yet been validated. There are eight big ideas within the progression that fall into three models (genetic, meiotic, and molecular).

These three models were previously established, but Duncan, Rogat, and Yarden expanded them to include what it means to have understanding at those levels and how

120 that understanding might develop. The learning progression was further organized

around two domain-specific questions: how do genes influence how organisms look and

function and why do organisms vary in how they look and function? These two

questions provide a meaningful way to categorize important ideas. The learning

progression spans grades five through ten. Fifth grade was chosen as the starting level

because students have been exposed to some ideas in genetics at this point.

An important point made by Duncan, Rogat, and Yarden (2009) is that the three

models named above should all be introduced at the same time. They state that progress

is not defined by simply learning what the models are. Rather, developing more

sophisticated versions of the models and the relationships between them indicates mental growth. This growth falls into three bands, each of which loosely encompasses two grade levels. The expectations of student performance at each band were developed based on the theoretical framework of the learning progression as well as research done in student learning of genetics. However, these expectations have yet to be tested, and empirical evidence will be necessary to refine and validate the learning progression.

One way Duncan, Rogat, and Yarden (2009) plan on assessing their learning progression is through the use of learning performances. Learning performances combine scientific inquiry practices with scientific concepts to describe the ways in which students should be able to use their knowledge. Duncan, Rogat, and Yarden have developed a set of these learning progressions that reflect the big ideas in genetics at each grade level. Variation in student responses will indicate different positions within the learning progression. They acknowledge that the learning progressions will need to have

121 psychometric properties in order to be reliable. A possible downside to the learning performances is that they may not cover all the levels within the learning progression; that is, an item may reveal placement in levels two and three but not level one.

Described above were three learning progressions in different stages of development.

Any learning progression, regardless of its stage of development, needs to be refined and verified using empirical data. Qualitative data has the advantage of being full of information, but it is time-consuming to obtain. Quantitative data is readily available, but the methods with which it can be used are still lacking. There are vast amounts of existing data (from standardized tests, for example) that we can not yet utilize.

One test that has potential in applying to defining a scientific reasoning learning progression is Lawson's Classroom Test of Scientific Reasoning (Lawson, 1978). The

Lawson Test has been widely used by the education research community (Bao et al.,

2009). Additionally, in validating his test, Lawson used both quantitative and qualitative methods. As a result, there is a complete and rich existing data set available, but how can we use it? Is there a way we can extract learning progression information from these resources? Data we have collected in a large scale assessment using the Lawson Test shows a developmental progression through the grade levels, which suggests that the existing data has great potential to be useful. Data mining could provide an opportunity to define a scientific reasoning learning progression and would renew existing data.

Data mining is not a new technique. It is part of a larger effort referred to as knowledge discovery in databases, which is centered on the investigation of processes, algorithms, and mechanisms for retrieving potential knowledge from data collections

122 (Norton, 1999). At present, there is not an established method for renewing data from the

Lawson Test to define a scientific reasoning learning progression. While there are many methodologies available to extract information from data (Norton, 1999), the one that we

believe will be most applicable is pattern recognition. We will need to determine how to

utilize this method with the existing data to extract information and define learning

progression levels. Doing so could be extremely valuable as data that is no longer in use

would become renewable. In addition, we would be able to define a scientific reasoning learning progression based on a valid assessment instrument.

5.3 Research Design to Study Learning Progress in Scientific Reasoning

The overarching goal of this research is to develop a better method for categorizing student scientific reasoning ability. The result will help identify possible learning progression levels based on responses to Lawson’s Test, which will shed light on understanding and revising the scoring method of the Lawson’s test.

123 In the development of his test, Lawson (1978) aimed for a balance between the convenience of paper and pencil tests and the positive factors of interview tasks. He

(concrete, transitional, formal-level). He found that the majority of students were classified at the same level by both the test and interview tasks but that the classroom test may slightly underestimate student abilities. Validity was further established by referencing previous research on what the test items were supposed to measure as well as performing item analysis and principal-components analysis.

The context for our study is the modified form of Lawson's Classroom Test of

Scientific Reasoning. It is a 24 item, two-tier, multiple choice test. Treagust (1995) describes a two-tier item as a question with some possible answers followed by a second question giving possible reasons for the response to the first question. The reasoning

124 options are based on student misconceptions that are discovered via free response tests,

The traditional scoring method for two-tier items, such as those on the Lawson Test,

is described by Table 5.1; both the answer to the question and the reasoning need to be

correct in order for the student to receive credit (Lawson, 1978; Treagust, 1995).

Answer Reasoning Total score Incorrect Incorrect 0 Incorrect Correct 0 Correct Incorrect 0 Correct Correct 1

Table 5.1. Traditional scoring on a two-tier item from the Lawson Test.

According to the traditionally used scoring method for two-tier type questions, the

first three rows of the table represent equivalent skill levels. This leads to a step-function

in the scoring of a particular problem. We feel we can identify skill levels at a finer grain size. We believe each row of Table 1 represents a different level of understanding.

Getting both the answer and the reasoning incorrect certainly indicates the lowest skill level while getting both correct indicates the highest, but the skill level when only the answer or the reasoning is correct is unclear. One goal of our research is to identify if getting just the answer or just the reasoning correct indicate different skill levels, and

125 which represents a higher skill level. Once this is accomplished, student scores will resemble a ramp from low to high skill levels.

We hypothesize that students may understand the answer to a question before they can fully articulate the reasoning behind their response. This is based on teaching experience; we have seen that students can often recite an answer without being able to describe the reasoning that led them to that answer. There is also existing research proposing that reasoning is preceded by a subconscious bias toward the right answer

(Bechara, Damasio, Tranel, & Damasio, 1997). Bechara et al. studied risk-taking behavior by having a control group and a test group (patients with prefrontal cortex damage and decision-making deficiencies) perform a gambling task. They found that the control group began to choose advantageously even before they had realized the correct strategy. This behavior was not seen in the test group at all; in fact, they continued to choose disadvantageously even when then knew the correct strategy. This suggests that there is a subconscious influence on decision-making that develops before reasoning.

Bechara et al. proposed that this subconscious bias calls on the individual’s previous experience.

This research suggests to us that students who answer the question correctly but the reasoning incorrectly are at a higher level of skill that those who answer both incorrectly or just the reasoning correctly. To study this, we will examine student performance on several questions from the Lawson Test.

When Lawson (1978) first designed his test, he found that the questions typically fell into three levels of difficulty. Similarly, we have chosen items that appear to be easy or

126 difficult for students based on the overall performance. The items we have chosen are shown in Figure 5.1 and will be referred to as P1 (pendulum answer, easy), P2 (pendulum reasoning, easy), F1 (flies in a tube answer, difficult), and F2 (flies in a tube reasoning, difficult). See Table 5.2 for a sampling of student performance on these questions. We believe students will understand P1 and P2 before they understand F1 and F2.

Percent correct Difficulty Item Context Grades Grades College level 6-7 9-10 P1 Pendulum – answer 34% 66% 79% Easy P2 Pendulum – reasoning 29% 57% 78% F1 Flies – answer 16% 29% 50% Difficult F2 Flies – reasoning 16% 17% 30%

Table 5.2. Student performance on two easy and two difficult questions.

One method that can be used to analyze data in studies such as this one is Item

Response Theory (IRT), which operates under several assumptions, the most relevant of which is local independence (Hambleton, Swaminathan, & Rogers, 1991). Local independence means that responses to any two items on a given test are statistically independent. Clearly IRT will not apply to an analysis of the Lawson Test as it has a two-tier design and responses to consecutive items are highly dependent. If fact, we rely on the dependency between questions to extract information about student reasoning.

Thus, we need to develop a new method to utilize the existing Lawson Test data. We

127 believe that by analyzing the patterns in student responses, we can identify a learning progression.

Figure 5.1. Items from the Lawson’s Test used in this stuudy.

Any learning progression needs to be verified after it has been deveeloped (Alonzo &

Steedle, 2009). Steedle & Shavelson (2009) used latent class analysis to verify a force and motion learning progression. Latent class analysis states that observed variables are dependent on an unobservable (latent) variable (Lazarsfeld & Henry, 1968). For example, items responses are observed variables while ability is the latent variable.

According to Lazarsfeld and Henry, some assumptions need to be made about the latent

128 variable since there is no way that it can be directly measured. A latent variable is

defined, then, by the effects it has on certain indicators. Steedle & Shavelson argue that

this method is appropriate because it assumes each student belongs to a particular latent

class that accounts for that student’s performance patterns. It is also useful because it

provides information about individuals within classes (i.e. ability groupings) and does not

make any assumptions about an existing learning progression. Use of this method relies

on Bayes’ Theorem, so Steedle and Shavelson developed two models: exploratory (made

no learning progression assumptions) and confirmatory (based on their proposed learning progression). They found that the number of latent classes in the confirmatory model was fixed by the number of levels in the proposed learning progression. To determine the effectiveness of their latent class model, they compared the item response learning progression levels (for example, option “A” indicates level 3, etc.) to the latent classes. If a lower latent class had a high probability of selecting a low level response, it signaled that the latent class model was correct.

While latent class analysis is a useful tool in developing a learning progression, there are multiple reasons why it was not used in this study. First, latent class analysis is not commonly used for two-tier items or grouped items, which is what our study entails.

Lazarsfeld and Henry (1968) state that latent class analysis is primarily used for

dichotomous (two-response) items. If this method could be used with two-tier items, and

to our knowledge no one has attempted to do so, the process would likely be very

complicated. The statistical work that is done with individual questions is already

complicated, which is a second reason why we are choosing to not use latent class

129 analysis—it is not easily accessible. The algorithm used in latent class analysis is run

through a computer program that provides results; Steedle and Shavelson (2009), for

example, estimated their latent class parameters using Markov Chain Monte Carlo

methods carried out with Gibbs sampling implemented by WinBUGS version 1.4. The program is something of a black box; the processes that go on are unclear. The method

we are using, pattern analysis, is much more intuitive. It is a straightforward method that

can be easily implemented since all data organization is done using ordinary

spreadsheets. Furthermore, it connects the data to the learning progression more directly;

rather than having to interpret latent class analysis matrices, any pattern that is seen is

literally the learning progression result.

We will utilize pattern analysis in three different ways by examining (1) a cross-

section of all the data with multiple grade levels, (2) transitional behavior with pre- and

post-tests for a particular grade level, and (3) the distribution of scores within a given

population.

5.4 Data Collection

Data for this study was based on the first data set collected from 2007 to 2009 with

students in grades three through twelve in both China and the United States as well as

college students from a large Midwestern university in the United States. The results

showed that both Chinese and US students have similar reasoning abilities (Bao et al.,

2009). Therefore, the data from both countries are combined. The distribution of

collected data across grade levels is given in Table 5.3. All students were given enough

130 time to finish the test. Younger students took 45 to 50 minutes while college students needed about 30 minutes. The Chinese student used a translated Chinese version of the

Grade 3 4 5 6 7 8 9 10 11 12 College N 102 336 547 588 868 606 1489 1520 2083 847 1823

Table 5.3. Distribution of collected student data across different grade levels.

5.5 Data Analysis and Results

It is hypothesized that students will be able to provide a correct answer before they can provide correct reasoning. This suggests to us possible levels of student performance when analyzing the four items chosen from the Lawson Test. Additional motivation for defining levels comes from item difficulty; since P1 and P2 are easy questions, students will likely answer those correctly before answering F1 and F2 correctly. Since each of the four items can be answered correctly or incorrectly, there are 16 possible responses total. We group these into six levels of performance based on the ideas above. The groupings are shown in Table 5.4.

Responses are coded using “0” and “1”. There are four items, which corresponds to two groups of two, each group having an answer and a reason component. The two responses for P1 and P2 are listed as the first pair, while those for F1 and F2 are listed as

131 the second pair. The answer is the first digit in each pair, and the reasoning is the second

digit. Thus, the code 00-00 means all responses were incorrect, while a code of 11-11

means all responses were correct. A code of 11-10 would mean that P1 and P2 were

correct, F1 (the answer) was correct, and F2 (the reasoning) was incorrect.

Level 1 represents entirely incorrect responses and serves as the lower anchor of a

possible learning progression. Level 2 includes all responses with P1 incorrect and P2

correct. If a student cannot answer P1 correctly, we believe any other items answered

correctly have a large chance being the results of guessing. These responses are not

included in Level 1 for reasons that will be discussed shortly. Level 3 includes responses

where P1 is correct and P2 is incorrect. That is, students correctly answer the first item

but miss the reasoning. A 10-01 response is included in this level as F2 was likely

guessed correctly. A 10-11 response is grouped into Level 4 rather than Level 3 because

there is a chance that students can miss P2 while still fully understanding F1 and F2.

Also, it would be unlikely that a student could guess correctly on both F1 and F2. Other

Level 4 responses involve getting both P1 and F1 correct or getting P1 and P2 correct; it is unclear which of these responses indicates a higher ability, which is why they are included in the same level. Level 5 requires that students answer P1 and P2 correctly and

F1 correctly; that is, they fully understand the easy items and are partway to understanding the difficult items. Finally, Level 6 represents entirely correct responses and serves as the upper anchor for our potential learning progression.

Level 2 is comprised mostly of guessing responses. This is not combined with Level

1 because a correctly guessed response may indicate some amount of understanding. For

132 instance, a student could have eliminated some responses knowing that they are incorrect but still had to guess from those remaining. At the same time, “00” responses may indicate a misconception; students may have some ideas relating to the problem, but their ideas are misconceptions, which is separate from pure guessing. These considerations make it difficult to determine the relative skill requirements of Levels 1 and 2. Future research will be aimed at better defining these levels. For the present study, the levels shown in Table 5.4 will be used.

We can make some rough predictions of the patterns that will be seen in the six levels. Since Levels 1 and 2 indicate relatively low skill levels, the number of students performing at these levels should decrease with age. Level 3 is an intermediate level that should stay about the same for all ages. Levels 4 through 6 indicate meaningful learning, so we expect them to increase with age. Levels 4 should increase more rapidly than

Levels 5 or 6.

Table 5.4 shows all responses to Lawson Test items P1, P2, F1, and F2. Scores from these four items were separated from the rest of the items. Students were then grouped by grade level and performance within that grade level. This performance division is based on Lawson’s (1978) results with his original test. When comparing classification of ability based on interview results to scores on the test, he found that those at the lowest level (concrete reasoning) generally scored 0 to 5 points out of 15. Those at the middle level (transitional reasoning) generally scored 6 to 11 points. Those at the highest level

(formal reasoning) generally scored 12 to 15 points. These point values represent the lowest 30%, middle 40%, and highest 30% of scores. We divide our populations into

133 similar percentages while making sure to include all the same scores in the same category. That is, if the lowest 30% includes scores of 0 to 5 but a student with a score of

5 falls into the middle 40%, that student will be grouped with the lowest 30%. For example, in grades 6-7, the division is as follows: lowest 30% (score of 0-5 of 20), middle 42% (score of 6-9), and highest 28% (score of 10-19). The percentages vary slightly in each grade, but we aim at dividing into lower 30%, middle 40%, and upper

Our goal in looking at Table 5.4 is to compare how students of different ages and abilities respond to the test items in order to find patterns within the responses. We chose to examine grades 6-7, 9-10, and college. By having two years separating each group, any significant changes in performance will be clearer. Since the college-level data came from American students, American student responses are the only ones presented in

There are three methods of analysis we will use in this study. First, a single population can be divided into performance-based levels (as in Table 5.4) to see how ability affects responses. Second, a cross-sectional study of all students can be examined for changes in responses with age. Third, pre- and post-tests can be examine to see how one year of learning impacts responses. These methods revealed three main results.

Level 1 Level 2 Level 3 Level 4 Level 5 Level 6 Score N 00-00 00-01 01-00 01-01 00-10 01-10 00-11 01-11 10-00 10-01 11-00 11-01 10-10 10-11 11-10 11-11

Low 30% 0-5 433 44.5% 11.1% 6.2% 1.8% 10.8% 1.8% 2.8% 0.2% 7.8% 1.6% 6.2% 1.6% 1.6% 0.7% 0.9% 0.2% Grades 6-7 Mid 42% 6-9 616 47.8% 7.3% 5.8% 1.0% 6.6% 0.8% 1.8% 0.0% 8.6% 1.5% 13.6% 1.6% 1.9% 0.3% 1.1% 0.2% High 28% 10-19 404 27.2% 5.4% 4.0% 0.0% 4.7% 0.5% 1.2% 0.2% 11.6% 1.5% 28.9% 3.5% 1.7% 0.5% 6.2% 3.0%

Low 33% 0-8 991 32.3% 4.6% 7.8% 1.1% 5.5% 1.9% 1.4% 0.6% 11.3% 2.0% 20.7% 2.4% 2.6% 1.0% 3.1% 1.6% Grades 9- Mid 40% 9-13 1211 16.4% 2.4% 2.6% 0.6% 4.5% 0.9% 1.3% 0.9% 11.8% 1.4% 34.0% 3.5% 3.9% 1.2% 10.2% 4.4% 10 High 27% 14-20 804 6.0% 0.4% 2.2% 0.7% 2.5% 0.5% 0.4% 0.7% 8.4% 0.6% 32.7% 3.2% 2.6% 1.6% 23.1% 14.3%

135 College Low 29% 0-11 523 27.7% 4.6% 1.1% 0.6% 6.1% 1.9% 2.1% 0.0% 3.4% 0.2% 32.5% 4.2% 1.5% 0.0% 9.2% 4.8% Mid 40% 12-16 724 8.8% 1.8% 0.1% 0.3% 3.0% 0.7% 2.3% 0.4% 1.4% 0.7% 32.3% 6.2% 0.8% 0.4% 20.8% 19.9% High 31% 17-20 573 1.2% 0.0% 0.2% 0.0% 1.9% 0.3% 1.2% 0.7% 0.2% 0.2% 22.1% 2.3% 0.2% 0.0% 33.6% 35.9%

Table 5.4. Responses to Lawson Test items P1, P2, F1, and F2 from grades 6-7, 9-10, and college.

5.5.1 Result 1: Defining a new level in the scoring of the Lawson Test

As described in the research design section, traditional scoring of the Lawson Test allows for only two levels of performance—both the answer and reasoning need to be correct or no credit is given. We believe this does not accurately reflect the possible levels of student understanding. Students who get just the answer or just the reasoning correct may be at a higher level of understanding than those who get both incorrect. We want to determine if a “10” (answer correct) response is at a higher level of understanding than a “01” (reasoning correct) response. To do so, we examine responses to Lawson Test items P1, P2, F1, and F2.

First, we examine college students’ responses and divide the population into performance-based levels as described above. The performance of the three groups within the college student population is shown in the bottom section of Table 4. We can narrow our focus to compare “01” and “10” responses by looking specifically at two pairs of columns: 01-00 with 10-00 and 11-01 with 11-10. In both pairs, we see that many more students respond “10”, which leads us to believe that responses of “01” could be due to random guessing. We also note when comparing 11-01 to 11-10 that as ability level increases, the number of “10” responses increases while the number of “01” responses decreases. This suggests that “10” responses indicate a higher level of ability than “01” responses. This is an important result because a “10” response is traditionally worth zero points. These results appear to indicate that “10” actually represents a higher

level of reasoning and should therefore be worth some credit.

Grade 00-00 01-00 10-00 11-00 11-01 11-10 11-11 3 38.2% 5.9% 14.7% 3.9% 2.0% 1.0% 0.0% 4 51.5% 9.5% 6.5% 1.8% 0.6% 0.3% 0.0% 5 43.0% 6.4% 9.0% 7.1% 0.9% 1.1% 0.2% 6 39.5% 6.0% 10.7% 15.6% 2.4% 1.5% 0.5% 7 42.2% 5.1% 8.2% 15.7% 2.0% 3.1% 1.3% 8 25.4% 2.8% 17.0% 25.6% 1.5% 10.1% 2.6% 9 22.6% 3.8% 12.0% 27.3% 3.2% 9.9% 4.8% 10 15.2% 4.6% 9.5% 31.1% 2.9% 12.8% 7.4% 11 12.0% 3.5% 8.5% 32.6% 4.6% 14.5% 10.9% 12 8.5% 1.8% 4.4% 25.4% 4.1% 23.1% 18.5%

Table 5.5. Student performance on P1, P2, F1, and F2 from grades 3 to 12.

Next, we look at a cross-section of grades 3 through 12. Table 5.5 shows responses to P1, P2, F1, and F2. In comparing the right pair of shaded columns, we see that students answer “10” more frequently as they get older, which corroborates with our result that “10” indicates a higher level of reasoning than “00” or “01”.

Pre-test Post-test 00-00 2.9% 12.6% 00-00 to 01-00 0.6% 10-00 0.6% 11-00 18.3% 11-01 1.7% 40.0% 11-00 to 11-10 8.6% 11-11 5.7%

Table 5.6. College student responses to P1, P2, F1, and F2 on pre- and post-tests.

The left pair of shaded columns in Table 5.5 also provides valuable information.

Consider the “01-00” column; there is no pattern to the responses other than a slight

decrease in the older grades. This suggests that a “01” response is likely a guess. We can

also compare the ratio of percentages in the “10-00” column to the “01-00” column.

Between third and seventh grade, the average ratio is 1.6. At the eighth grade, however, there is a dramatic shift in responses, and the ratio jumps to 6.1. This indicates a major learning shift and a possible step in a scientific reasoning learning progression. After the ninth grade (ratio 3.2), the ratio decreases (though it is still higher than in the lower grades). This is due to older students moving to a higher level (i.e. answering P1 and P2

correctly).

Finally, we examine pre- and post-test data from college students. Some students

(N=175) were given the Lawson Test before and after taking a college physics course.

Results are given in Table 5.6 with the same coding as above. The post-test responses

were only taken from the 12.6% and 40.0% represented in the pre-test portion of the

P1 and P2 tend to be very easy for college students, so there is little information to be

gleaned from studying 00-00 responses on the pre-test. F1 and F2, on the other hand,

give college students difficulty, so meaningful comparisons can be made by analyzing the

changes in 11-00 pre-test responses. The shaded cells in Table 6 indicate that many

students stayed at the 11-00 level, but if learning gains occurred, the students moved into

the 11-10 or 11-11 categories more than the 11-01 category. Again it suggests that “10”

responses indicate a higher skill level than “01” responses.

To summarize our first result, three forms of analysis suggest that for an item on the

Lawson Test, a correct answer with incorrect reasoning indicates a higher skill level than getting both incorrect. This is a construct that is overlooked by traditional two-tier item scoring. There are important educational implications of this result. Students may be performing at better levels than teachers are realizing. By recognizing that providing a correct answer is progress toward full understanding, teachers will know what to look for in their students and will be able to offer proper amounts of encouragement and credit.

5.5.2 Result 2: Combined patterns of responses as indicators for performance levels

The first result indicates that responding to an item with a correct answer and incorrect reasoning represents a different level of understanding than responding to both incorrectly. Knowing that “00”, “01”, and “10” responses cannot be treated equally, we can divide student responses into the six levels shown in Table 4 and described above.

We can see in Table 5.4 that the total distribution is sparse; most cells have a very low percentage of students. By looking for areas with high concentrations of students, we can find meaningful patterns. Columns in Table 4 that show such concentrations have bold text. Note that many of the non-bold columns in Table 5.4 include “01” responses

(answer incorrect, reasoning correct). Such responses do not yield any meaningful patterns. They are also more common at younger ages, which provides further evidence that “01” responses indicate guessing.

As described in the first part of the results section, we group responses into six ability levels as shown in Table 4. The highest level is 11-11 while the lowest is 00-00. The

middle levels are ordered based on the reasoning that students will be able to provide a

correct answer before they provide correct reasoning to the same item (our first result)

and that “01” responses are likely due to guessing.

Having grouped possible responses into six levels, scores in each level can be

summed to make the data less sparse and make patterns more visible. This essentially

condenses Table 5.4 to allow for better data analysis. Table 5.7 shows the number of

responses at each level for grades 3 through 12. Figure 5.2 shows a plot of the data

represented in Table 7 to more clearly see any patterns or trends in the data.

Each level in Table 5.7 and Figure 5.2 shows a distinct progression. Level 1 (00-00)

starts with high percentages in the third grade and continually decreases. This is to be

expected; many younger students will answer all four items incorrectly, but as students get older, they will answer more items correctly.

Level 2 (0x-xx,00-xx) responses decrease with age. This supports our proposition that many of the responses in Level 2 reflect guessing. Older students no longer need to guess, particularly on P1 and P2.

Level 3 (10-0x) shows that percentages remain relatively steady until a significant jump between seventh and eighth grade (p<0.001, effect size = 0.68), after which we see a continual decrease. At eighth grade, there is a peak in students getting P1 (the first answer) right, then the decrease is due to getting P2 (the reasoning) correct as well. This matches our first result that the answer precedes the reasoning.

Level 4 (11-0x, 10-1x) shows a rapid, steady increase from grades four to six, little change between sixth and seventh grades, then steady increase (though less rapid) from

grades eight to ten, and finally a decrease from grades eleven to twelve. More and more

students are able to correctly answer P1 and P2 as they get older, which accounts for the increases. By twelfth grade, though, students begin to answer F1 correctly as well, which accounts for the decrease.

Level 5 (11-10) shows an increase as age increases. Few young students are able to answer this many items correctly. Around eighth grade is when this level begins to take off, and there is a significant increase (p<0.001, effect size=0.50) between the eleventh and twelfth grades. We do not see a decrease in this level for older students as Level 5 is often the highest level reached.

Level 1 2 3 4 5 6 00-00 00-01, 01-00 10-00 11-00 11-10 11-11 01-01, 00-10 10-01 11-01 01-10, 00-11 10-10 Grade 01-11 10-11 3 38.2% 32.4% 17.6% 10.8% 1.0% 0.0% 4 51.5% 34.8% 8.6% 4.8% 0.3% 0.0% 5 43.0% 33.3% 11.9% 10.6% 1.1% 0.2% 6 39.5% 26.5% 12.2% 19.7% 1.5% 0.5% 7 42.2% 23.5% 9.7% 20.3% 3.1% 1.3% 8 25.4% 11.6% 18.8% 31.5% 10.1% 2.6% 9 22.6% 14.1% 13.8% 34.8% 9.9% 4.8% 10 15.2% 15.7% 10.5% 38.6% 12.8% 7.4% 11 12.0% 11.1% 9.6% 41.9% 14.5% 10.9% 12 8.5% 10.2% 5.0% 34.7% 23.1% 18.5%

Table 5.7. Percentage of grades 3-12 at the six levels of Lawson Test performance.

Level 6 (11-11) shows a slow increase in younger grades and a large increase by twelfth grade. We would expect this behavior; students need strong reasoning skills to

answer all four items correctly, and these skills are not developed until the later years of

school. The percentages reflect this as it is not until twelfth grade that this level really takes off. Between eleventh and twelfth grades, there is a significant jump in the percentage of students at Level 6 (p<0.001, effect size=0.63).

40.0% Level 1 Level 2 Level 3 30.0% Level 4 Level 5 20.0% Level 6

0.0% 3456789101112 Grade

Figure 5.2. Percentage of grades 3-12 at the six levels of Lawson Test performance.

We can also consider all of the patterns together and the distributions of each level across the grades. In performing a Chi-Square test with Pearson Chi-Square method, we

see a significant difference between each grade level (χ2=2106.448, p=0.000) with the

exception of sixth to seventh grades (χ2=9.765, p=0.082) which shows borderline

significance. This tells us that scientific reasoning learning gains are made at each grade

There are also some broader patterns shown in Figure 5.2. There appears to be an

important transition point between seventh and eighth grade. Levels 1 and 2 (lower skill)

have big decreases and Levels 3, 4, and 5 (higher skill) have big increases between seventh and eighth grade. We believe this is the time when students begin to grasp the ideas in the Lawson Test items selected.

Since Levels 5 and 6 have very low percentages until eighth grade or higher, F1 and

F2 appear to be difficult and discriminating questions. They serve as a kind of low-pass filter; they filter out the rapidly changing responses (only a few students with the same answer) while allowing consistent responses (many students with the same answer) to be seen as a pattern.

Table 5.7 and Figure 5.2 give very detailed information, but it may be valuable to take a more general look at the patterns in this data. By grouping students by grades 3-5,

6-7, 8-10, and 11-12 and plotting what percentage of each grade grouping is at a particular level, learning progression patterns can be observed. Grouping the grades will allow us to see more dramatic shifts in performance. We can predict what trends will be seen. Since Levels 1 and 2 represent a lower ability level, there should be a sharp decrease from the first grade grouping to the last. Level 3 is an intermediate level, so we would expect it to remain relatively steady over the years. Levels 4-6 represent high

ability levels, so there should be an increase from the lowest grade grouping to the highest. Figure 3 shows that these expectations are indeed correct.

40% Grade groupings 3-5 30% 6-7 8-10 20% 11-12

Figure 5.3. Percentage of grade groupings at each of the six levels.

We can also see that some levels show big jumps between grade groupings. For example, Level 1 shows a large gap between grades 6-7 and grades 8-10. Similar jumps occur between grades 6-7 and 8-10 in Levels 4 and 5 as well as between grades 8-10 and

11-12 in Level 6. The meaning of these jumps is that large learning gains occur between those grades. It is important for teachers to realize when students are making these scientific reasoning developments so that they can promote learning and expect significant changes in their students.

To summarize our second result, examining existing data from the Lawson Test

reveals distinct progressions through six different performance levels with a dramatic

jump in ability between seventh and eighth grades. Progression through these levels

indicates that students will be able to provide a correct answer before they can provide

correct reasoning.

5.5.3 Result 3: Proposing a three-level scoring system for the Lawson’s Test

The information available in Table 5.7 and Figure 5.2 is highly valuable, but teachers

may not need such detailed information when assessing their students. We want to strike

a balance between the complexity of the data analysis done above and the simplicity of

grading a multiple choice assessment instrument. This is possible by using a new scoring

method for the Lawson Test.

As previously discussed, traditional scoring of the Lawson Test allows for only one point on each pair of items (answer and reasoning). We propose that scoring should be

restructured to award points to students who can provide the correct answer but not the

correct reasoning. Our first result shows that such responses indicate a higher skill level,

so this type of response should be recognized.

There are three ways that a two-tier item could logically be scored. First is the

traditional method where both the answer and reasoning need to be correct for credit.

Second, each individual item (the answer and the reasoning) could be worth one point,

and credit would be awarded for getting either one correct. Third, since “01” responses

appear to be due to guessing while “10” responses indicate a higher skill level, a point

could be awarded for responding with the right answer while two points would be awarded for getting both the answer and these reasoning correct. We refer to this last method as a three-level scoring system because it reflects the ability levels established in the previous result: (1) nothing correct and/or guessing, (2) answer correct, (3) answer and reasoning correct. These scoring methods are summarized in Table 5.8.

To establish a base level of validity for either of the proposed methods, we compare the traditional scoring method to the proposed methods using the data from this study.

Figure 5.4 shows the three scoring methods applied to P1, P2, F1, and F2. Since individual item and three-level scoring allow for a higher point total, the scores have been scaled appropriately.

Answer Reasoning Traditional Individual Three-level Incorrect Incorrect 0 0 0 Incorrect Correct 0 1 0 Correct Incorrect 0 1 1 Correct Correct 1 2 2

Table 5.8. Traditional and proposed scoring methods for two-tier items on the Lawson’s Test.

Figure 5.4 shows that the proposed methods are potentially valid. The curves of the individual, three-level, and traditional scoring systems are very similarly shaped. There is, however, a gap between each of the curves that must be taken into account.

First, the gap between either of our two proposed methods and the traditional method

is due to the fact that our proposed methods give credit for “01” and “10” responses.

Such responses traditionally receive zero points. The gaps reflect students who provide

“10” and “01” responses. Second, there is a gap between the individual and the three-

level scoring methods. This is because the three-level system eliminates points for guessing (a “01” response) that the individual method includes. Note that the gap between the three-level and individual curves is greater for younger students, particularly for P1 and P2. This is when we see the most guessing. For later grades, when students no longer need to guess on P1 and P2, the gap closes. The curves for F1 and F2 have a larger gap for a longer period of time. These are harder items, so guessing continues into higher grade levels.

From these results, we believe that the three-level scoring system best reflects student ability. There are other reasons for choosing this system. The three-level method allows for better statistical analysis. The average score using traditional method is 21% while the average score using the three-level system is 34%. Having higher average scores allows for better analysis; i.e. when the average score is so low, it is difficult to distinguish between results and noise that may result from student guessing. The lowest traditional score is less than 5%, which is very small, but the lowest score for the individual method is near 20%, which seems too high. Thus, the three-level system with a low score of about 10% seems like the best choice.

P1 and P2 F1 and F2

80% 80% 70% 70%

60% 60% 50% 50% 40% 40% Score Score 30% 30% 20% 20% 10% 10% 0% 0% 3456789101112 3456789101112 Grade Grade

Average of P1, P2, F1, and F2

80% 70% Traditional 60% 50% Individual 40% Score 30% Three-level 20% 10% 0% 3456789101112 Grade

Figure 5.4. Traditional, individual, and three-level Lawson Test scoring.

Another benefit to our proposed three-level scoring method is that it rewards students who have accomplished something by providing a correct answer. Our results

have shown that these students are at a higher skill level, which is something our assessment should reflect. A three-level scoring system is a more accurate assessment, but it also lets the student know that he/she is doing something right, which can provide motivation and a sense of achievement. Traditional scoring does not recognize the accomplishment of providing a correct answer.

To summarize our third result, traditional scoring of the Lawson Test does not accurately reflect student ability. As indicated by the developmental data on the

Lawson’s Test (see Chapter 4), reasoning skills develop slowly, and there is an intermediate level that traditional scoring does not recognize. A three-level scoring system more accurately reflects student ability, allows finer grained data analysis, and provides teachers with a simple way to track progression.

5.6 Conclusions

In this chapter, a data mining method is introduced and with it an analysis of four items on the Lawson Classroom Test of Scientific Reasoning revealed three results. First, we acquired some theoretical insight into student responses; students will be able to answer a question correctly before they can provide the correct reasoning. Second, we established six performance levels based on student responses to the four items. These levels revealed a learning progression with distinct patterns being seen at all levels.

Third, we proposed a new scoring method for the Lawson Test. In line with our first result, we believe a three-level scoring system (where students get credit for providing

correct answers with incorrect reasoning) better reflects student understanding and is

therefore more accurate in assessment.

All three results are based on a new insight into a learning progression that was

obtained via pattern analysis. The primary goal of this study was to explore a method

that could be applied to existing Lawson Test data. The pattern analysis method used has

proved to be successful. It is an easily accessible method; all data was analyzed using spreadsheets. The data can be intuitively analyzed by looking for patterns based on groupings of questions, and these patterns can be interpreted in a straightforward manner.

This method could be employed by teachers in assessing their students and by researchers in developing and verifying learning progressions.

This method of data mining and pattern analysis that we have developed has great potential as vast quantities of previously collected Lawson’s test data can be mined to identify new information about student learning. However, use of this method is context dependent. For instance, a different question design might not allow for levels to be seen as distinctly. The item content and student prior knowledge may affect the patterns.

Even additional items on the Lawson Test may not show the patterns that we have seen.

Each item, then, deserves a detailed analysis as many factors affect student responses, but this is beyond the scope of this paper. An important message is to not blindly apply this method to any question. No analysis method works perfectly in all situations, and this

holds true here.

Future work will involve addressing the issues mentioned above. Additional items on the Lawson test as well as other scientific reasoning assessment instruments should be analyzed for their potential power in defining a scientific reasoning learning progression.

Chapter 6. A Case Study on Fine Grained Learning Progression of Control of Variables

6.1 Context of the Study

Scientific reasoning skills are more and more emphasized in science curricula. This study focuses on a particular skill of scientific reasoning, the control of variables (COV), to identify fine grained learning progression levels, which can inform teachers and researchers for developing and delivering better aligned curriculum and assessment.

The main hypothesis of the research is based on observations from our previous research which suggested that when students were given experimental design and evaluation tasks in teaching and assessment of COV skills, the presence of experimental data often trigger students into different mode of reasoning. In particular, when experimental data is given, students in a transitional stage of understanding COV often have a tendency to focus on the plausibility of the experimental data which is related but not directly addressing the

COV skills involved in the experimental design method. To quantitative determine these intermediate levels of understanding in manipulating COV conditions, two forms of assessment (providing and not providing experimental data) were developed to probe how students handle data and how context affects performance. The design of the assessment tool can help identify common student difficulties and reasoning patterns at a finer grain size. Results from this study show that (1) students perform better when no

152 experimental data is provided, (2) students perform better in physics contexts than in real- life contexts, and (3) students potentially have a tendency to equate non-influential variables to non-testable variables. Additional analysis begins to reveal a possible progression of different levels of control of variables skills. The new form of assessment design developed in this study provides a practical means for researchers and teachers to evaluate student learning progression on control of variables.

6.2 Review of Research on Control of Variables

Physics courses provide opportunities to teach scientific reasoning and the American

Association of Physics Teachers has laid out goals for physics education that reflect this fact; categories include the art of experimentation, experimental and analytical skills, conceptual learning, understanding the basis of knowledge in physics, and developing collaborative learning skills (Boudreaux et al. 2008, AAPT, 1998). To better achieve these goals in physics education, an increasing number of reformed physics curricula have been designed with inquiry learning as their focus, which helps students learn both science content and scientific reasoning skills. A non-exhaustive list of such new courses includes Physics by Inquiry (McDermott & Shaffer, 1996), RealTime Physics (Sokoloff,

Thornton & Laws, 2004), ISLE (Etkina & Van Heuvelen, 2007), Modeling Instruction

(Wells, Hestenes & Swackhamer), and The SCALE-UP (Student-Centered Activities for

Large Enrollment Undergraduate Programs) Project (Beichner, 1999; 2008). A common emphasis of these reformed curricula is to engage students in a constructive inquiry learning process, which has been shown to have positive impacts on advancing students’

153 problem solving abilities, improving conceptual understanding, and reducing failure rate

in physics courses. Most importantly, the inquiry-based learning environment in these

reformed courses offers students more opportunities to develop their reasoning skills;

these opportunities are otherwise unavailable in traditionally taught courses (Etkina &

Van Heuvelen, 2007; Beichner & Saul, 2003).

Since scientific reasoning is increasingly emphasized in reformed physics courses, it is important to understand how and why students may be struggling with specific skills in scientific reasoning. For both research and teaching purposes, we need a good knowledge base and assessment tools regarding student difficulties in specific aspects of scientific reasoning. Unfortunately, there has been limited research on scientific reasoning in the context of learning physics. Resources on assessment of specific scientific reasoning skills using physics contexts are also scarce. Instead, research and assessment tools have been more focused on student learning of scientific information, and teachers are often inexperienced in assessing student performance on abilities and skills underlying the surface level content knowledge (Yung, 2001; Hofstein & Lunetta, 2004).

Among the different dimensions in scientific reasoning, control of variables (COV) is a core construct supporting a wide range of higher-order scientific thinking skills. COV is also an important skill fundamental to understanding physics concepts and experiments.

In a recent study, Boudreaux et al. found that college students and in-service teachers had difficulties with basic methods in COV which included failure to control variables, assuming that only one variable can influence a system’s behavior, and rejection of entire sets of data due to a few uncontrolled experiments (Boudreaux et al. 2008). Boudreaux et

154 al. concluded that students and teachers typically understand that it is important to control variables but often encounter difficulties in implementing the appropriate COV strategies to interpret experimental results.

As a fundamental construct in scientific reasoning, control of variables has been heavily researched by cognitive scientists for more than a decade (Chen & Klahr, 1999;

Toth, Klahr & Chen, 2000; Kuhn & Dean 2005; Kuhn, 2007). Their studies have typically centered on the scientific reasoning skills (specifically COV) of elementary school students. The recent work by Boudreaux et al. focused on college students’ and in-service teachers’ understanding of COV in physics contexts (Boudreaux et al., 2008).

The existing research has revealed a rich spectrum of COV skills from simple tests of

COV conditions to complex tasks involving multi-variable controls and causal inferences from experimental evidence.

For example, in examinations of simple COV skills, researchers used simple experiments involving few variables (Chen & Klahr, 1999; Toth, Klahr & Chen, 2000).

Second through fourth grade students were presented with a pair of pictures and asked to identify whether they showed a valid or invalid experiment to determine the effect of a particular variable. Chen and Klahr (1999) found that elementary students are capable of learning how to perform COV experiments. Students as young as second grade were able to transfer their COV knowledge when the learning task and the transfer task were the same.

To study more complex COV constructs, Chen and Klahr asked students to design experiments involving a ball rolling down a ramp to test a given variable and then state

155 what they could conclude from the outcomes (Chen & Klahr, 1999). With increasing

complexity by involving more variables in contexts of ramps, springs, and sinking

objects, Penner and Klahr (1996); Toth, Klahr, and Chen (2000) had students design and conduct experiments, justify their choices, and consolidate and summarize their findings.

In the context of sinking objects, researchers also probed student understanding of multi- variable influence by asking students what combination of variables would make the fastest sinking object. They found that older students (14-year-olds) performed better

than younger students (10-year-olds).

Kuhn focused more on high-end skills regarding students’ abilities in deriving multi- variable causal relations (Kuhn, 2007). Kuhn had fourth-graders use computer software to run experiments relating to earthquakes (and ocean voyages). This study used more variables than previous studies mentioned and asked students to determine whether each variable was causal, non-causal, or indeterminate. Identifying a causal variable is an intermediate level task, but identifying a non-causal variable is higher on the spectrum of

skills because students do not always realize that one can test something even if it does

not influence the result. Kuhn found that students made progress in learning COV skills

despite lacking direct instruction. Nevertheless, students struggled when faced with

handling multivariable causality.

In testing the high-end skill of understanding a multi-variable context, researchers

also found inconsistencies in student reasoning, especially with transfer tasks relating to

COV; students sometimes described what they thought would be the cause of an outcome

using descriptors that did not match their experimental results. For example, a student

156 might describe a certain material as being necessary even though they did not mention it

during experimentation. Chen and Klahr noted that students can learn how to do COV

experiments but will often deem them unnecessary during transfer tasks (Chen & Klahr,

1999). Kuhn saw that students could correctly design experiments but did not have a

good method for handling multivariable causality (Kuhn, 2007).

Turning to older and more educated subjects, Boudreaux et al. studied college

students’ and in-service teachers’ understanding of COV. They observed three distinctive

abilities at different levels of complexity (Boudreaux et al. 2008). The first and simplest

level was the ability to design experimental trials. To test this, students were given a set-

up with a specific set of variables and were asked to design an experiment that could test

whether a particular variable influenced the outcome and explain their reasoning. The second level was the ability to interpret results when the data warrant a conclusion.

Students were presented with a table of trials and data from a COV experiment and asked whether a given variable influenced the behavior of the system. The third level was the ability to interpret results when the experimental design and data do not warrant a conclusion. In this case, students were provided with a table of trials and data that did

not represent a COV experiment and were asked if a given variable is influential.

Boudreaux et al. showed that students often have more difficulty interpreting data with an inconclusive relation than with a conclusive one (Boudreaux et al. 2008).

From these studies, we can see a rich spectrum of scientific reasoning skills relating to COV in which there is also a possible developmental progression from simple control of a few variables to complex multivariable control and causal analysis. The structure of

157 the complexity can come from several categories of factors including the number of involved variables, structures and context features of the problem or task, the type of embedded relations (conclusive or inconclusive), and control forms (testable and non- testable).

Low-end • Identifying or recognizing a COV situation skills • Designing a COV experiment to test a possible causal relation

• Deciding whether an experimental design involving multiple variables (>2) is a valid COV test of selected variables Intermediate • Deciding, given an experimental design, whether a test is skills NOT a valid COV test • Inferring from experiment results and designs that a variable, among several, is causally related to the outcome

• Inferring from experiment results and designs that a variable is testable in the design when it is non-causal High-end skills • Being able to reason through experiments and hypotheses by manipulating an integrated network of multivariate causal relations

Table 6.1. A summary of different levels of COV skills studied in the literature

Table 6.1 gives a compact list of several different levels of COV skills that have been commonly studied in the existing literature. However, the existing work cannot provide a complete metric to place the different skills in terms of their developmental levels. This

158 is because only subsets of these skills were researched in individual studies, which makes it difficult to pull together a holistic picture of the developmental progression of the different skills. In this study, we designed an experiment that can probe all the related skills in a single study, allowing us to more accurately map out the relative difficulties of the different skills and investigate how the difficulty of COV tasks is affected by task formats, contexts, and tested relations.

The results of this study can advance our understanding in one fundamental scientific reasoning element that is central to the hands-on and minds-on inquiry-based learning method. Researchers and teachers can gain insight into typical patterns of student reasoning at different developmental stages. The assessment method can also directly

facilitate teaching and research in science courses.

6.3 Research Design

6.3.1 Research Questions and Goals

Built on the existing research, we have conducted a study to further investigate

student difficulties regarding the understanding and application of COV in physics and

real world contexts. In particular, this research aims to (1) identify at finer grain sizes

common student difficulties and reasoning patterns in COV under the influence of question context and difficulty, (2) study if student difficulties reveal a developmental progression from naïve to expert learners, and (3) develop a practical assessment design

for evaluating student ability in COV.

159 6.3.2 The Design of the Assessment Instrument

As described in the previous section, Boudreaux et al. did important work in

identifying several common difficulties relating to COV (Boudreaux et al. 2008). For our study, we modified two questions from Boudreaux et al.’s study and included one additional question. These alterations aim to identify student levels of understanding, see if there is a progression through these levels, and determine how context plays a role. The instruments used in this study include two tests, each of which contains three questions on COV experiments. The questions in the two tests are identical except that in one test all questions provided experimental conditions alone while in the other test all questions provided both experimental conditions and data. Figure 6.1 shows one form of the test with three test questions all containing experimental data. The format of the test without data was the same as Figure 6.1 except the rows of data were removed. Table 6.2 outlines the details of the questions.

Tests Item Context Posed Question Correct Answer Can a named Real- Named variable cannot Fishing variable be Version A life be tested tested? (data not given) Two variables can be Which variables Spring Physics tested; one is influential, Version B can be tested? one is not. (data given) Which variables No variables can be Pendulum Physics can be tested? tested

Table 6.2. Information about test items

Figure 6.1. Test questions on COV with expeerimental data. Question 1 poses a COV situation using a real-life context. Questions 2 and 3 are in physics contexts and are based on the tasks used in Boudreaux et al. (2008).

161 There are three main results we expect to obtain from our research design. The first is a comparison between a test where data were given and a test where there were no data provided. That is, we test how the structure of the task influences performance. When data are not given, the test question is basically probing the reasoning on experimental design (determining whether it is a COV experiment), and therefore is a low-end COV skill. On the other hand, when data are given, students can be drawn into reasoning through possible causal relations between variables and experimental outcomes on top of the COV situations. In such cases, students will be engaged in coordinating evidence with hypotheses in a multivariable COV setting that involves a network of possible causal relations, which is a higher level reasoning skill (Kuhn & Dean, 2005; Klahr &

Dunbar, 1988). Therefore, we hypothesize that providing experimental data in a COV task will increase its difficulty. The results from this study will allow us to evaluate this hypothesis and develop an understanding of how task format may affect task difficulty in

COV situations.

The first result will tell us how providing data influences performance. For the second result, we want to know if context influences how subjects handle data. In our instrument, the fishing question is a real-life context, while the spring and pendulum questions are physics contexts. By contrasting students’ responses and explanations to these questions, we can study if (or the extent to which) student reasoning is affected by contexts.

The third result centers on two threads of COV skills used in determining if selected variables are testable under a given COV condition and, when testable, if selected

162 variables are causally related to the experimental outcomes. The spring question

provided a design for two testable variables (spring length and distance pulled back at

release). In Test B, where data were given, one of these variables is influential

(influences the experiment outcome) while the other is not.

Boudreaux et al. showed that students performed better on questions that had testable

variables (both influential and non-influential) than on questions that had non-testable

variables – uncontrolled experiments from which no conclusion can be drawn

(Boudreaux et al. 2008). In their design, students were asked if an experiment can “be

used to test” whether a variable “influences” the result. The statement is clear to an expert

that it probes the testability of variables on whether a relation can be tested with the given

experiments. However, novice learners may be distracted by the “influence” component

of the problem and misinterpret the question. This is also evident from the results

reported by Boudreaux et al. (p.164), which suggest that students tend to intertwine

causal mechanisms (influential relations) with control of variables (Boudreaux et al.

2008). Therefore, we believe that the posed questions in Boudreaux et al. contain a

subtlety of wording that may confound students’ interpretations of the question and does

not allow a clear measure of the ability to distinguish between testability and influence.

Moreover, students seem to have a real difficulty with this skill such that they may

equate, either explicitly or implicitly, a non-influential variable to a non-testable variable.

A careful inspection of results from Boudreaux et al. (p.166) confirms the possibility:

when given non-testable conditions, half of the students failed to provide the correct

response (non-testable) and most of these students stated that “the data indicate that mass

163 does not affect (influence) the number of swings”, which is an obvious example of conflating non-influential with non-testable (Boudreaux et al. 2008).

In our study, we make an emphasis on measuring the ability to distinguish between

testability and influence, which is considered an important construct fundamental to

advanced reasoning in COV situations. Since the testability of variables is solely

determined by the experimental design without the need of experimental data, providing

and not providing experimental data in questions constitute two stages of measurement

on this particular construct. Without experimental data, the task is equivalent to an

experimental design that can only test students’ ability in recognizing COV conditions

which leads to the conclusion on the testability of variables in a given experiment. With

experimental data, the task turns into measuring if students can distinguish between

testability and influence, or, in other words, if students can resist the distraction of

considering the possible casual relations instead of the testability of such relations. Based

on the discussions, it is reasonable to hypothesize that tasks not showing experimental

data are easier (measuring basic COV skills) than those showing the outcomes

(measuring more advanced COV reasoning).

To summarize, in this study we have employed a unique design of contrasting

identical COV tasks in two forms, with and without experimental data. In the tasks, we

explicitly ask students which variables can be tested. In addition, the questions used in

this study allow multiple answers, in which both influential and non-influential relations

were embedded. Therefore, the questions can measure the COV reasoning for

simultaneously handling multiple variables and relations. This allows us to probe a more

164 complete set of skills ranging from a low-end skill such as identifying a COV experiment, an intermediate skill such as identifying a causal variable, and a high-end skill such as identifying a non-causal variable in multi-variable conditions. The data collected from this study will allow us to quantitatively determine the relative difficulty levels of the targeted skills.

6.3.3 Data Collection

The two versions of questions discussed in the previous section were compiled into two tests: Test A contains three multiple choice questions without experimental data, and

Test B contains the same three multiple choice questions in the same order but with experimental data given. To obtain more details on student reasoning, we added an open response field after each question labeled with “Please write down your reason”, in which students gave short explanations on their reasoning in solving the questions. Students’ responses to this open-ended question are coded and analyzed with the goal of determining response validity for the questions and probing student reasoning. In particular, we want to find out (1) whether a student attended to the fact that the experimental data were given or not when writing an explanation, (2) what variables a student considered in different contexts, and (3) if a student attended differently to influential and non-influential variables (in the spring question).

The subjects of this study were 314 high school students in tenth grade (with an average age of 16) from two public schools in China, one in the Beijing area with 198 students and the other in the Guangdong province with 116 students. Our test results

165 (mean scores) show no statistical difference between students from the two schools

(p=0.32), and therefore the two groups of students were treated as equivalent and their

data were mixed together in our analysis.

At both schools, the two forms of the test, Tests A and B, were equally mixed one

after the other and handed out to students in random order. That is, students in a class randomly received either Test A or Test B. Students were given 15 minutes to complete

the test, which seemed to be enough time to finish three questions. Most students took less than 10 minutes to complete the test.

6.4 Data Analysis and Results

6.4.1 Impact of giving experimental data in a question on student performance

As outlined in the Research Design section, there are three main results we aim to get

from this study. First we examine the difference between student performance on

questions with and without experimental data. The scores from all three questions were

summed to determine a main effect. Figure 6.2 shows the results. There is a statistically

significant difference (p=0.001, effect size=0.38) between the two groups; those taking

Test A (data not given) outperformed those taking Test B (data given). The students who

were not given data had a mean score of 56% while the students who were given data had

a mean score of 37% (N ~ 150 for each group).

Testing how students handle data aligns with our first research goal in trying to

identify at finer grain sizes student difficulties with COV. The results show that students

perform better on COV tasks when data are not present. Our result is consistent with the

166 literature, which notes that there is a considerable difference between identifying a COV experiment (a low-end skill by our definition) and coordinating results with an experiment (an intermediate to high-end skill) (Kuhn, 2007). Identifying or designing a simple COV experiment is a basic skill that young students can learn (Chen & Klahr,

1999; Klahr & Nigam, 2004; Dean & Kuhn, 2007). When the question does not provide data and asks if the experiment is a valid COV experiment, the students only need to identify which variables are changing and which are held constant. As evident from students’ free responses, many students clearly attended to features of the variables being changing or held constant across trials and used such features in their reasoning. Over

90% of the students explicitly mentioned the COV strategy in terms of changing and controlling variables. Therefore questions probing only basic COV reasoning are relatively simple to these students.

0 Test A Test B

Figure 6.2. Mean scores on Test A (data not given) and Test B (data given). The error bars (approximately ±0.04) represent the standard error.

Integrating COV into more advanced situations (such as multivariable causal reasoning) represents a higher level ability. Kuhn performed such a study where students were asked to determine the causal nature of multiple variables by using COV experimentation methods (Kuhn, 2007). In Kuhn’s study, COV supports the multivariable causal reasoning but is no longer the only thing students need to consider.

The same can be said in our study where students are shown experimental data and asked if an experiment is valid. Students are no longer just deciding if an experiment is valid

(even though that is what the question is asking). Rather, we have observed that students tend to go back and forth between the data and the experiment designs, trying to coordinate the two and decide if the variable is influential. In our research design, Test A provides the simple task of identifying a COV experiment, but Test B uses COV as a context for more complicated reasoning. Our data show this to be the case as we see that when students are shown identical questions, those who are also shown data (Test B) perform at a lower level than those who are not shown data (Test A).

The student written responses support this explanation as well. For all questions, we found that student explanations were consistent with their choices and the intended interpretation of the questions. It is observed that when data were absent (Test A), more students appear to reason using the COV strategy. As shown from written explanations,

94% of students taking Test A (data not given) used the COV strategy in their reasoning, although many still produced incorrect answers. One student stated in response to the spring question, “The mass of the bob, which does not change in all three trials, can’t be

168 tested to see whether it affects the outcome or not. However, both the remaining two

variables are changing, so they can be tested.” Another student had a similar comment on the pendulum question, stating that “Every pair of two experiments does not satisfy the rule that only one variable varies while the other variables keep constant. So they cannot be tested.”

On the other hand, in Test B when data were given, many students seemed to base their reasoning on that data. For example, one student explained, “If both the mass of the bob and the distance the bob is pulled from its balanced position at the time of release are

the same, and the un-stretched lengths of the spring are different, then the numbers of oscillations that occur in 10s also vary, so the un-stretched length of the spring affects the number of oscillations.” As another student noted, “When the length of the string is kept the same, the number of swings varies with changes in the mass of the ball and the angle at release.” Apparently, students were attending to the experiment data and were using the data in their reasoning, which can sidetrack into considering the casual relations

between variables and experimental data rather than the COV design of the experiment.

Of the students who took Test B (data given), 59% had similar reasoning in their written responses. In other words, only 41% of students taking Test B used the desired COV strategies in their reasoning, which is less than half of the number of students with correct reasoning taking Test A.

Apparently, students are attracted to the data, which distracts them into thinking about the possible causal relations (influence) associated with the variables instead of the testability of the variables. Based the literature and the results from this study, we

169 consider that a student is at a higher skill level if he/she can resist the distracter and still

perform sound COV reasoning. Using this technique of providing distracters is similar to

what has already been widely used by physics teachers and others writing physics tests.

The Force Concept Inventory (FCI) for instance tries to distract test-takers by providing

answer options that play to common sense (Hestenes, Wells & Swackhamer ). A student

who does well on the FCI is one who ignores (or considers and rejects) the tempting

answers and instead thinks though the problem using sound physics reasoning.

An additional finding is that the new question design method of giving identical

questions with and without experimental data seems to work well in probing and

distinguishing basic and advanced levels of COV skills in terms of the testability and

influence of variables.

6.4.2 Impact of question context on student performance

Our second result compares performance based on the context of the question, which

is our third research goal discussed in the introduction. Figure 6.3 shows student

performance on real-life context and physics context problems. The real-life context data

are from question 1 (fishing); the physics context data are averages of questions 2

(spring) and 3 (pendulum). We continue to see the trend from our first result that students

who were not given data (Test A) outperform students who were given data (Test B), but we now turn our focus to the context. On the real-life context problem, those not given

data (N=154) had a mean of 58% while those given data (N=149) had a mean of 30%; on

170 the physics context problems, those not given data (N=149) had a mean of 55% while

those given data (N=150) had a mean of 41%.

We can see that students taking Test A (data not given) perform at essentially the

same level in both contexts, but students taking Test B (data given) perform significantly better (p<0.05, effect size=0.23) on physics context questions than on real-life context

questions. The relative performances on Tests A and B in the two contexts is an

important piece of data. In the real-life context, the separation is 28%, while in the

physics context, the separation is 14%. The difference between the two separations is

statistically significant (p=0.001). Why does student performance vary so much

depending on context?

0 Test A Test B Test A Test B Real-life context Physics context

Figure 6.3. The mean scores on Test A (data not given) and Test B (data given) for each context. The real-life context shows a greater difference between the means of Tests A and B than the physics context. The error bars (±0.04) indicate the standard errors.

By carefully studying student written responses, we see that students seem to use

different reasoning when looking at a real-life scenario compared to a physics scenario.

Two factors appear to impact reasoning. One of these factors is that real-life contexts

trigger real-world knowledge, which biases student reasoning. Students may insist on

what they believe to be true, regardless of what the question at hand states (Caramazza,

McCloskey & Green, 1981) For example, if a student believes firmly that a thin fish hook will work better, he/she will answer the question so that his/her belief remains intact. Boudreaux et al. (p. 166) described this as a “failure to distinguish between expectations and evidence” and noticed that many students had this behavior. In a physics context, prior knowledge is more limited, and the knowledge that students do have is more likely to be something learned formally, not well-established, and less tied with their intuitive beliefs (Boudreaux et al., 2008). Therefore, the physics knowledge is less tempting for students to fall back on, so they may be more likely use their COV reasoning skills to answer the question.

The second factor that appears to influence reasoning when faced with a real-life question is a tendency for students to consider additional variables other than those given in the problem. The written responses to the fishing question make this clear. One student indicated several additional variables he considered in his response, including the types of fish, cleverness of fish, water flow, fish’s fear of hook, difficulty in baiting the hook, and the physical properties of the hook. Note that the only variables provided were hook size, fisherman location, and fishing rod length, and students were instructed to

172 ignore other variables. It is evident that a familiar real-life context triggers students into considering an extended set of variables that are well associated to the question context through life experience. A total of 17 additional variables (other than those given in the question) were named in the written responses to the fishing question. This can be contrasted with the physics context questions where students indicated far fewer additional variables; only 5 additional variables total were mentioned for the physics context problems (mainly material properties or air resistance).

The consideration of additional variables may be connected to the open-endedness of a question. Compared to a real-life situation, a physics context is often pre-processed to have extraneous information removed, and therefore is more constrained and close- ended. For example, students know through formal education that certain variables, such as color, do not matter in a physics question on mechanics. In this sense, the variables are pre-processed in a way consistent with physics domain knowledge, which filters the variables to present a cleaner, more confined question context. A real-life context problem, on the other hand, is very open-ended. The variables are either not pre- processed or are pre-processed in a wide variety of ways depending on the individuals’ personal experiences. Therefore, there is a richer set of possibilities that can come to the student’s mind for consideration. This makes the task much less confined as there is a more diverse set of variables for the student to manipulate.

Our explanation of open-endedness is similar to what Hammer, Elby, Scherr, and

Redish refer to as cognitive framing (Hammer, Elby, Scherr & Redish, 2004). Framing is a pre-established way, developed from life experiences, that an individual interprets a

173 specific situation. In Hammer et al.’s words, a frame is a “set of expectations” and rules of operation that a student will use when solving a problem that affects how the student handles that problem. This applies to what we see in our study. In a real-life context, the frame a student has is affected by a lifetime of experiences, which explains why so many extra variables make their way into students’ explanations. In a physics context, students are accustomed to using traditionally recognized physics methods, so other thoughts are less likely to be triggered. Typically, physics classes train students to use only the variables given in the problem, which confines the task but also gradually habituates students into a “plug-and-chug” type of problem-solving method limiting their abilities in open-ended explorations.

It can be inferred from the data that context (real-life vs. physics) does have an impact on student reasoning (reflected in the number and types of variables students call on), particularly when the question provides experimental results. This is supported by analysis of student performance and written responses on Test B. However, we do not have enough evidence to clearly determine how context affects reasoning; this is an important topic that warrants future research.

6.4.3 Impact of embedded relationships between variables on student performance

The third result of this study looks at how students handle variables that are testable versus non-testable and influential versus non-influential, which aligns with our first research goal of defining COV reasoning skills and difficulties at smaller grain sizes. The testability and influence of variables are two threads of features and relations in a COV

174 experiment. The experimental designs provided in the question are enough to determine

which variables are testable. In such cases, the students need to identify which

experimental trials are controlled or not. If a variable is testable, experimental data

provided by the question will help determine if that variable has an influence on the

result. As discussed previously, the relationship between these two threads is

complicated and needs to be carefully distinguished, particularly when analyzing student

reasoning on handling variables. As discussed earlier, we consider that distinguishing

between the two threads is an area of difficulty; students may equate non-influential

variables with non-testable variables.

Boudreaux et al. studied a related issue, but their design did not distinguish the

different types of student reasoning regarding variable testability and influence.5 In our study, the new features in the question design will allow us to distinguish between testability and influence. For example, the two test forms (giving and not giving data) can help in discriminating advanced COV reasoning from basic COV strategies. In addition, the three questions form a progressive ladder that measures the different levels of related skills. The spring question in Test B (data given) is at the last stage of the measurement scale designed to probe the ability of distinguishing between testability and influence. It uses multivariable conditions and allows multiple responses. Permitting multiple responses also leads to a more open-ended testing approach.

The spring question in Test B probes if students are sensitive to the differences between non-influential and non-testable variables. Among the three variables involved in the spring question, two (un-stretched length and distance pulled from equilibrium) are

175 testable, and the third (mass) is not. Of the two testable variables, one (un-stretched

length) is influential and the other is not. There are three levels of skill tested in this

question. The lowest level is to recognize that mass is not a testable variable. Students

who succeed on this level have basic COV skills. At the second level, students are able to

correctly recognize the influential variable (un-stretched length) as testable. However,

these students often miss the non-influential variable that is also testable, indicating a

possibility of equating non-testable and non-influential. At the third and highest level,

students are able to correctly recognize all testable variables regardless of their influence or non-influence.

Table 6.3 shows that students perform significantly better on the spring question

(choice d) when no data are given (p=0.002, effect size=0.37). Among the students

taking Test B (data given), 26% chose the influential variable as the only testable variable while only 3% of students taking Test A (data not given) made the same choice. This suggests that students can better pinpoint a testable variable when data are not provided

(when there is no interference from considering influence). Therefore, the three questions in Test A constitute the measure for the lower-end basic COV skills.

With Test B, more advanced reasoning can be probed. On the spring question in Test

B, a significant portion of students (26%) picked the choice a, which corresponds to the

second level of COV skills tested in this question. These students are suspected to have

utilized incorrect reasoning in equating non-testable and non-influential variables. The

results also show that a little over 1/3 of the students achieved the highest level of COV

176 skills tested in this question: these students are able to engage correct COV reasoning in

complicated multi-variable conditions that include both causal and non-causal relations.

Questions Test A Test B COV Skills (Choice) (no data) (with data)

Fishing Deciding if a variable is testable when it is testable 58% 30% (a, correct)

Deciding if any of several variables are testable Pendulum 55% 46% when all are non-testable (h, correct) Being able to decide if any of several variables are Spring testable when some variables are influential (only in 3% 26% (a, partially correct) Test B) Being able to decide if any of several variables are Spring testable when some variables are non-influential 5% 5% (b, partially correct ) (only in Test B) Being able to decide if any of several variables are Spring testable when some variables are influential and 54% 36% (d, correct) some non-influential (only in Test B)

Table 6.3. Percentage of students responding with selected choices of the three questions on Test A (data not given) and Test B (data given). For the pendulum question, none of the variables were testable, and there is no significant difference between Test A and B. For the spring question, two variables were testable, one of which was influential, and there was a significant difference between Test A and B.

The quantitative data in Table 6.3 points to a group of students who may conflate testability with influence. To find evidence about students’ reasoning, we analyze the written responses to the spring question in Test B, which are truly enlightening. In their explanations, some students explicitly used the fact that a particular variable was not

177 influential as the reason to not choose the corresponding answer: “The number of

oscillations has nothing to do with the distance the bob is pulled from its balanced position at the time of release.” These students often chose the variable that was influential as the only testable variable (choice a). There were also students who clearly stated that non-influential means non-testable: “The mass of the bob and the un-stretched length are kept constant, but the distance the bob is pulled from its equilibrium position at the time of release varies in the second and third trials, and the results in these two experiments do not change. Therefore, we cannot obviously test the influence of the distance the bob is pulled from its equilibrium position on the number of oscillations.”

On the spring question in Test B, 21% of the students showed reasoning similar to the two examples discussed above. Nearly all the students who picked the correct answer

(choice d) used the correct reasoning to choose variables that were both influential and non-influential as testable. The 21% using improper reasoning is a large number, but it can go undetected if only one of the two versions of questions were used, as it can appear to be a mistake in COV reasoning rather than an important misconception. By designing two versions of the test, this result can be more clearly revealed by contrasting results between Tests A and B. For example in Test B (data given), over a quarter of students missed the non-influential testable variable, while only 3% of the students taking Test A

(data not given) missed the same variable. This result suggests that the new format of giving and not giving data in questions can be a useful assessment design in measuring student reasoning at a finer grain size.

178 In Table 6.3, the results from the fishing and pendulum questions are also included.

The results on the fishing question reflect largely the impact of context features, which have been discussed in the previous section. With the pendulum question, since none of the variables are testable, the interference from the influence relations is less, which resulted in a non-significant difference between Tests A and B (p=0.119, effect size=0.18). It appears that students can identify a non-COV experiment with less distraction from experimental data, which is confirmed by students’ written responses:

“There is more than one variable changing in each trial. It can be tested only if one variable varies and other variables keep constant” and “Two variables are different in each pair of experiments, so all of the variables are non-testable.” From the students tested, 41% in Test A and 36% in Test B showed similar explanations.

6.6 Conclusions and Discussions

COV is a very important aspect of scientific reasoning. Current research tends to address broad definitions within COV, but it is also necessary to identify COV skills at a smaller grain size. Based on the literature and our results, we begin to see a progression of different levels of COV skills and their assessment settings, which are listed in Table

6.5. We have made a rough estimate of difficulty and ordered the skills accordingly. The actual order remains up for debate and calls for further research to make mapping the progress from naïve to expert learners possible.

From this study, we found that students perform better on COV tasks when experimental data are not provided. Providing data seems to trigger students into

179 thinking beyond the testability of the variables and attempting to determine if variables

are influential. The reason for this behavior could be that students may mix the concepts of testability and influence; in particular, students seem to have a tendency to equate non-

influential variables to non-testable variables.

Possible Test Versions COV Skill Level and Items Deciding if a variable is testable when it is testable Test A Low-end (without experimental data) Fishing Deciding if a set of variables are testable when none is Test A Low-end testable (without experimental data) Pendulum Deciding if an experimental design involving multiple Low-end to Test A testable and non-testable variables (>2) is a valid COV Intermediate Spring experiment (without experimental data) Deciding if a variable is testable when it is not testable Low-end to Test B (with experimental data) Intermediate Pendulum Deciding if a variable is testable when it is testable Test B and influential in a real-life context (with experimental Intermediate data) Fishing Deciding if any of several variables are testable when Test B some variables are influential (with experimental data Intermediate Spring given) Deciding if any of several variables are testable when Test B High-end some variables are non-influential (with experimental) Spring

Table 6.5. A progression of COV skills tested in this study.

When a task is perceived as asking for influence (rather than testability), it becomes more demanding as it requires coordinating evidence and hypothesis, which is a high-end

180 skill that students may not have mastered, despite having basic understanding of the COV

method. For example, multivariable causal reasoning can be tested in a COV context, but

in this case, COV is not the thought process being tested. Rather, COV needs to be

understood first and can then be used as the context for testing another skill. This is supported by Kuhn’s argument that COV is not the only challenge students face during

scientific reasoning (Kuhn, 2007). COV does, however, deserve much of the focus Klahr

and others (Chen & Klahr, 1999 for example) give it because it is the foundation that

supports higher level skills.

Furthermore, it is found that students perform better on physics context questions

than on real-life context questions. When faced with a real-life question, students tend to

call on additional variables (other than those given) more so than in a physics context.

We believe this is influenced by the open-endedness of the questions and how students’

knowledge is pre-processed. Students are trained in answering physics context questions

in school, so their knowledge is pre-processed in a way that limits the variables they will

consider. Real-life knowledge, on the other hand, is pre-processed by a lifetime of

experiences, so there are many more variables for a student to consider.

The instrument developed in this study has helped identify a unique assessment

structure – providing experimental data versus not providing data. Our results show that

giving data triggers a different thought process, one that can exclude COV knowledge.

Thus, a test that provides experimental data will separate out students more as it is

fundamentally more difficult. This is important for teachers who want to assess their

students’ knowledge. Designing and evaluating COV experiments are low-end skills, but

181 doing the same tasks when considering experimental data is a high-end skill. Teachers

need to be conscious of this and make sure they are writing tests aimed at the proper

level. Young students should be tested on low-end skills, so data should not be provided.

Older students need to be tested for high-end skills, so data can be provided. The ages at which these skills should be developed have not yet been determined, but doing so is the goal of future research.

Chapter 7. Summary

As part of the ongoing education research on scientific reasoning, this dissertation

project conducted a series of studies on the assessment of scientific reasoning. In current

literature, the Lawson’s test of scientific reasoning is the most widely used quantitative

tool in assessing scientific reasoning. However, the test’s validity has not been

thoroughly studied.

This dissertation project started with an in-depth study to evaluate the validity of the

Lawson’s test. The research has shown multiple test design issues with the current version of the Lawson’s test and also suggested ways to improve the instruments. For example, the choices of question 8 seem to have two correct answers, which have resulted in many students scoring low on this question although they have shown correct understanding and reasoning from interviews. Similar issues also exist in several other items. As a result, the Lawson’s test has a low ceiling of approximately 80% even for well-developed learners.

Although there are validity concerns with the Lawson’s test, since it is widely used in almost all of the fields in STEM education, the existing data provides a rich resource for educators and researchers to compare and evaluate their own students and research outcomes. Therefore, a collection of well-established baseline results would greatly add

183 educators and researchers in using Lawson’s test in teaching and research. To this end, the work discussed in Chapter 4 provides a detailed analysis of the developmental trends of scientific reasoning abilities for US and Chinese students measured with the Lawson’s test.

A logistic model is developed and fitted to the data. The model fits the data reasonably well with a mean score RMSD at the 10% level of the average standard deviations. The parameters obtained from the model fitting provide quantitative measures to compare the overall scientific reasoning ability and the individual skill dimensions. The results show that the Chinese and US students are similar in their total scores on the Lawson’s test but diverge in five out of the six skill dimensions. The actual causes for the differences are still under investigation; however, initial clues suggest that cultural and educational settings of the two countries have a lot to contribute to the differences.

In most existing research, the Lawson’s test data has been analyzed using the total scores, which significantly limits the capacity of information that can be extracted. The

Lawson’s test has a unique design using a two-tiered structure, which can produce a wealth of information regarding students’ abilities in coordinating conclusions and explanations. Such information is not well utilized with the existing score based analysis.

The work in Chapter 5 aims at developing a data mining method using combinational

184 patterns of responses study the Lawson’s test data. The method can help identify fine grained learning progression behaviors of selected scientific reasoning skills.

Four items on the Lawson’s test have been studied using this method, which revealed three results. First, we acquired some theoretical insight into student responses; students will be able to answer a question correctly before they can provide the correct reasoning.

Second, we established six performance levels based on student responses to the four items. These levels revealed a learning progression with distinct patterns being seen at all levels. Third, we proposed a new scoring method for the Lawson Test. In line with our first result, we believe a three-level scoring system (where students get credit for providing correct answers with incorrect reasoning) better reflects student understanding and is therefore more accurate in assessment. This pattern analysis method seems to have great potential as vast quantities of previously collected Lawson’s test data can be mined to identify new information about student learning.

As another effort to identify fine grained levels for more precise assessment of student scientific reasoning skills, Chapter 6 presents a case study on the learning progressions of the ability of control of variables. Different from the data mining method, this study uses a randomized testing approach to obtain and validate fine grained level of students’ ability in using control of variables. It found that students perform better on

COV tasks when experimental data are not provided. Providing data seems to trigger students into thinking beyond the testability of the variables and attempting to determine if variables are influential. The reason for this behavior could be that students may mix the concepts of testability and influence; in particular, students seem to have a tendency

185 to equate non-influential variables to non-testable variables. The new form of

assessment design developed in this study provides a practical means for researchers and

teachers to evaluate student learning progression on control of variables.

Summarizing the overall scope of this dissertation work, the project started with a

detailed review of the research in the area of scientific reasoning with an emphasis on its assessment aspect. A significant amount of effort has been devoted to evaluate the validity of the Lawson’s test and to establish a solid baseline of students’ performances

on this test. Building on the work with the Lawson’s test, two further studies have been

conducted, one on a data mining method and the other on a question design method, pave

the ways for the on-going research that moves beyond the Lawson’s test to develop a new

generation of assessment instruments on scientific reasoning.

21st Century Skills Standards. (2007). Retrieved from

http://ksweo.files.wordpress.com/2010/12/21st_century_skills_standards.pdf

21st Century’ Skills: Not New, but a Worthy Challenge. (2010). American Educator, 34(1),

Abdullah S., & Shariff A. (2008). The effects of inquiry-based computer simulation with

cooperative learning on scientific thinking and conceptual understanding of gas laws.

Eurasia Journal of Mathematics, Science and Technology Education, 4 (4), 387-298.

Adey, P., & Shayer, M. (1990). Accelerating the development of formal thinking in

middle and high school students. Journal of Research in Science Teaching, 27, 267-

Adey, P., & Shayer, M. (1994). Really Raising Standards. London: Routledge.

Alonzo, A. C., & Steedle, J. T. (2009). Developing and assessing a force and motion

learning progression. Science Education, 93(3), 389-421.

187 American Association of Colleges and Universities. (2007). College Learning for the

New Global Century. Washington, DC: AACU.

American Association of Physics Teachers. (1998). Goals of the introductory physics

laboratory, American Journal of Physics, 66(6), 483-485.

Araya, D., & Peters M. A. (2010). Education in the Creative Economy: Knowledge and

Learning in the Age of Innovation. New York: Peter Lang Publishers.

Are They Really Ready to Work? Employers’ Perspectives on the Basic Knowledge and

Applied Skills of New Entrants to the 21st-Century U.S. Workforce. (2006b).

Retrieved from http://www.p21.org/storage/documents/FINAL_REPORT_PDF09-

Ates S., & Cataloglu E. (2007). The effects of students' reasoning abilities on onceptual

understandings and problem-solving skills in introductory mechanics. Eur. J.

Phys., 28, 1161-1171.

Bao, L., Cai, T., Koenig, K., Fang, K., Han, J., Wang, J., … Wu, N. (2009). Learning and

Scientific Reasoning. Science, 323(5914), 586-587.

Baterno, C., Estepa, A., & Godino, J. D. (1997). Evolution of students’ understanding of

statistical association in a computer based teaching environment. Research on the role

of technology in teaching and learning statistics: Proceedings of the 1996 IASE

Round Table Conference. J. B. Garfield & G. Burrill (Eds.). 191-205. Voorburg,

Netherlands: International Statistical Institute.

188 Bauerlein, M. (2010). Employers want 18th-century skills. The Chronicle of Higher

Education. Retrieved from http://chronicle.com/blogs/brainstorm/employers-want-

18th-century-skills/21687

Bechara, A., Damasio, H., Tranel, D., & Damasio, A. R. (1997). Deciding

advantageously before knowing the advantageous strategy. Science, 275, 1293-1295.

Beers, S. (2011). Teaching 21st Century Skills: An ASCD Action Tool. Association for

Supervision & Curriculum Development.

Beichner, R. J. (1999). Student-Centered Activities for Large-Enrollment University

Physics (SCALE UP). Presented at the Sigma Xi Forum: Reshaping Undergraduate

Science and Engineering Education: Tools for Better Learning. Minneapolis, MN

Retrieved from ftp://ftp.ncsu.edu/pub/ncsu/beichner/RB/SigmaXi.pdf

Beichner, R. (2008). The SCALE-UP project: a student-centered, active learning

environment for undergraduate programs. An invited white paper for the National

Academy of Sciences, Retrieved from

http://www7.nationalacademies.org/bose/Beichner_CommissionedPaper.pdf

Beichner, R. J., & Saul, J. M. (2003). Introduction to the SCALE-UP (student-centered

activities for large enrollment undergraduate programs) project. Proceedings of the

International School of Physics.

Bellanca, J. (2010). Enriched learning projects: A practical pathway to 21st century Skills.

Bloomington, IN: Solution Tree.

189 Bellance, J., & Brandt, R. (2010). 21st Century skills: Rethinking how students learn.

Benford, R., & Lawson, A. E. (2001). Relationships between effective inquiry use and the

development of scientific reasoning skills in college biology labs. MS Thesis, Arizona

State University. (ERIC Accession No.: ED456157)

Bloom, B. S. (1956). Taxonomy of educational objectives: The classification of

educational Goals. Susan Fauer Company, Inc.

Boudreaux A., Shaffer P. S., Heron P. R. L., & McDermott L. C. (2008). Student

understanding of control of variables: deciding whether or not a variable influences

the behavior of a system. American Journal of Physics 76 (2), 163-170.

Bridgeland, J., Milano, J. & Rosenblum, E. (2011). Across the great divide: Perspectives

of CEOs and college presidents on America’s higher education and skills gap.

Retrieved from

http://www.corporatevoices.org/system/files/Across%20the%20Divide%20Final%20

Briggs, D. C., Alonzo, A. C., Schwab, C., & Wilson, M. (2006). Diagnostic Assessment

with Ordered Multiple-Choice Items. Educational Assessment, 11(1), 33-63.

Bybee, R., & Fuchs, B. (2006). Preparing the 21st century workforce: a new reform in

science and technology education. Journal of Research in Science Teaching,

43(4), 349-352.

190 Caramazza A., McCloskey M., & Green B. (1981). Naive beliefs in “sophisticated”

subjects: misconceptions about trajectories of objects. Cognition 9 (2), 117-123.

Carlson, M., Jacobs, S., Coe, E., Larsen, S., & Hsu, E. (2002). Applying covariational

reasoning while modeling dynamic events. Journal for Research in Mathematics

Education, 33(5), 352-378.

Chen, Z., & Klahr, D. (1999). All other things being equal: acquisition and transfer of the

control of variables strategy. Child Development, 70(5), 1098-1120.

Cognitive and Ethical Growth: The Making of Meaning. (1981). The Modern American

College. Arthur W. Chickering Associates (Ed.). 76-116. San Francisco, CA: Jossey-

Coletta, V. P. & Phillips, J. A. (2005). Interpreting FCI scores: normalized gain,

reinstruction scores, and scientific reasoning ability. American Journal of Physics,

73(12), 1172-1179.

Coltman, P., Petyaeva, D., & Anghileri, J. (2002). Scaffolding learning through

meaningful tasks and adult interaction. Early Years, 22(1), 39-49.

Comparing Frameworks for 21st Century Skill. (2010b). 21st Century Skills. J. Bellanca,

& R. Brandt (Eds.). 51-76. Bloomington, IN: Solution Tree Press.

Cook, J. L., & Cook, G. (2005). Child development principles and perspectives. Boston:

191 Dean, D., & Kuhn, D. (2007) Direct instruction vs. discovery: the long view. Science

Education 91(3), 384-397.

Dede, C. (2009). Technologies that facilitate generating knowledge and possibly wisdom:

A response to ‘Web 2.0 and classroom research.’ Educational Researcher 38(4), 60-

Deloitte, The Manufacturing Institute, and Oracle. (2009). People & Profitability – A

Time for Change, A 2009 Management Practices Survey of the Manufacturing

Industry. Retrieved from http://www.deloitte.com/assets/Dcom-

UnitedStates/Local%20Assets/Documents/us_pip_peoplemanagementreport_100509.

Demirtas Z. (2011). Scientific reasoning skills of high school students’ relationship

gender and their academic success. International Journal of Human Sciences, 8(1),

Denison, S., Konopczynski, K., Garcia, V., & Xu, F. (2006). Probabilistic reasoning in

preschoolers: random sampling and base rate. Paper presented at: The 28th Annual

Conference of the Cognitive Science Society. Vancouver, British Columbia. Retrieved

from http://csjarchive.cogsci.rpi.edu/proceedings/2006/docs/p1216.pdf

Duncan, R. G., & Hmelo-Silver, C. E. (2009). Learning progressions: aligning

curriculum, instruction, and assessment. Journal of Research in Science Teaching,

46(6), 606-609.

192 Duncan, R. G., Rogat, A. D., & Yarden, A. (2009). A learning progression for deepening

students’ understandings of modern genetics across the 5th-10th grades. Journal of

Research in Science Teaching, 46(6), 655-674.

Educational Testing Service. (2007). Digital Transformation: A Framework for ICT

Literacy. Princeton, NJ: ETS.

Eisen, P., Jasinowski, J. J., & Kleinert, R. (2005). 2005 Skills Gap Report – A Survey of

the American Manufacturing Workforce. Retrieved from

http://www.themanufacturinginstitute.org/~/media/738F5D310119448DBB03DF300

45084EF/2005_Skills_Gap_Report.pdf

Erlick, D. E. (1966). Human estimates of statistical relatedness. Psychonomic Science, 5,

Etkina, E., & Heuvelen, A. V. (2007) Investigative Science Learning Environment – A

Science Process Approach to Learning Physics. Research Based Reform of University

Physics. E. F. Redish, & P. Cooney, (Eds.). Wiley, NY.

Gardner, H. (1983). Frames of Mind: The Theory of Multiple Intelligences. New York,

NJ: Basic Books.

Gardner, H., & T. Hatch. (1989). Multiple intelligences go to school: educational

implications of the theory of multiple intelligences. Educational Researcher, 18(8): 4-

193 Gerber, B. L., Cavallo, A. M., & Marek, E. A. (2001). Relationships among informal

learning environments, teaching procedures and scientific reasoning ability.

International Journal of Science Education, 23(5), 535-549.

Grunwald and Associates. (2010). Educators, technology and 21st century skills:

Dispelling five myths. Retrieved from http://www.waldenu.edu/Documents/Degree-

Programs/Report_Summary_-_Dispelling_Five_Myths.pdf

Halford, Graeme S. (1993). Children's understanding: The development of mental models.

Hillsdale, NJ: L. Erlbaum Associates

Hambleton, R. K., & Swaminathan, H. (1985). Item response theory: principles and

applications. Norwell, MA: Kluwer Academic Publishers.

Hambleton, R. K., Swaminathan,H., & Rogers, H. J. (1991). Fundamentals of item

response theory. Newbury Park, CA: Sage Publications, Inc.

Hammer D., Elby, A., Scherr, R. E., & Redish, E. F. (2004). Resources, framing, and

transfer. Transfer of learning: research and perspectives. J. Mestre (Ed.). Greenwich,

CT: Information Age Publishing.

Hawkins, J., & Pea, R. D. (1987). Tools for bridging everyday and scientific thinking.

Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. The

Physics Teacher, 30, 141-158.

194 Hofer, B. K., & Pintrich, P. R. (1997). The Development of Epistemological Theories:

Beliefs about Knowledge and Knowing and Their Relation to Learning. Review of

Educational Research, 67(1), 88-140.

Hofstein A., & Lunetta V. N. (2004). The laboratory in science education: foundations

for the twenty-first century. Science Education 88 (1), 28-54. http://apolloresearchinstitute.com/sites/default/files/future_work_skills_2020_full_resear

ch_report_final_1.pdf

Inhelder, B., & Piaget, J. (1958). The growth of logical thinking from childhood to

adolescence. New York, NJ: Basic Books.

Institute for the Future. (2011). Future Work Skills 2020. Retrieved from

Intelligence Reframed: Multiple Intelligences for the 21st Century. (1999). New York, NJ:

Basic Books. iSTAR Assessment. (2010). iSTAR Assessment: inquiry for scientific thinking and

reasoning. Retrieved from http://www.istarassessment.org/

Jacobs, H. H. (2010). Curriculum 21: Essential Education for a Changing World.

Association for Supervision & Curriculum Development.

Johnson, M. A., & Lawson, A. E. (1998). What are the relative effects of reasoning

ability and prior knowledge on biology achievement in expository and inquiry classes?

Journal of Research in Science Teaching, 35(1), 89-103.

195 Judy, W. R., & D’Amico, C. (1997). Workforce 2020: Work and Workers in the 21st

Century. Indianapolis, Indiana: Hudson Institute.

Karoly, L. A. (2004). The 21st century at work: Forces shaping the future workforce and

workplace in the United States. Santa Monica, CA: RAND Corporation.

Khazanov, L. (2005). An investigation of approaches and strategies for resolving students’

misconceptions about probability in introductory college statistics. Proceedings of the

AMATYC 31st Annual Conference, San Diego, California, 40-48.

Klahr, D., & Nigam, M. (2004). The equivalence of learning paths in early science

instruction: effects of direct instruction and discovery learning. Psychological

Science, 15(10), 661-667.

Klahr, D., & Dunbar, K., (1998). Dual space search during scientific reasoning. Cognitive

Science 12(1), 1-48.

Kuhn D. (2007). Reasoning about multiple variables: control of variables is not the only

challenge. Science Education ,91(5), 710-726.

Kuhn, D., & Dean, D. (2005). Is developing scientific thinking all about learning to

control variables? Psychological Science, 16(11), 866-870.

Kuhn, D., Amsel, E., & O’Loughlin, M. (1988). The development of scientific thinking

skills. Orlando, FL: Academic Press.

196 Laskey, M. L., & Carole J. H. (2010). Self-Regulated Learning, Metacognition, and Soft

Skills: The 21st Century Learner.

Lawson A.E., Adi, H., & Karplus, R. (1979). Development of correlational reasoning in

secondary schools: do biology courses make a difference? The American Biology

Teacher, 41, 420-425.

Lawson, A. E. (1978). The development and validation of a classroom test of formal

reasoning. Journal of Research in Science Teaching, 15(1), 11-24.

Lawson, A. E. (1979). The developmental learning paradigm . Journal of Research in

Science Teaching, 16(6), 501-515.

Lawson, A. E. (1995). Science Teaching and the Development of Thinking. Belmont, CA:

Wadsworth Publishing Company.

Lawson, A. E. (2000). The generality of hypothetico-deductive reasoning: making

scientific thinking explicit. The American Biology Teacher, 62(7), 482-495.

Lawson, A. E., Clark, B., Cramer-Meldrum, E., Falconer, K. A., Sequist, J. M., & Kwon,

Y. J. (2000). Development of scientific reasoning in college biology: do two levels of

general hypothesis-testing skills exist? Journal of Research in Science Teaching,

37(1), 81-101.

Lazarsfeld, P. F., & Henry, N. W. (1968). Latent structure analysis. Boston: Houghton

Mifflin Company.

197 Learning Point Associates. (2005). Transforming Education for the 21st Century: An

Economic Imperative. Chicago, IL: Dede, C., Korte, S., Nelson, R., Valdez, G., &

Ledward, B. C. and Hirata, D. (2011). An Overview of 21st Century Skills. Summary of

21st Century Skills for Students and Teachers, by Pacific Policy Research Center.

Honolulu: Kamehameha Schools–Research & Evaluation.

Lee, C.Q., & She, H.C. (2009). Facilitating Students’ Conceptual Change and Scientific

Reasoning Involving the Unit of Combustion. Res. Sci. Educ, 40, 479-504. doi:

10.1007/s11165-009-9130-4

Lee, Chin-Quenand & Hsiao-Ching She (2010). Facilitating Students' Conceptual Change

and Scientific Reasoning Involving the Unit of Combustion. Research in Science

Education, 40 (4) 479-504.

Levy, F., & Murnane, R. J. (2004). The New Division of Labor: How Computers Are

Creating the Next Job Market. Princeton, NJ: Princeton University Press.

Marek, E. A. & Cavallo, A. M. L. (1997). The learning cycle and elementary school

science. Portsmouth, NH: Heinemann.

McDermott, L. C., Shaffer, P. S., & the Physics Education Group at the University of

Washington. (1996). Physics by Inquiry, Volumes I & II. New York, NJ: John Wiley

and Sons, Inc.

198 Measuring Skills for 21st-Century Learning. (2009). The Phi Delta Kappan. 90(9), 630-

National Academies of Sciences. (2005). The development of scientific reasoning: What

psychologists contribute to an understanding of elementary science learning. Final

Draft of a Report to the National Research Council’s Board of Science Education,

Consensus Study on Learning Science, Kindergarten through Eighth Grade. Normal,

II: Zimmerman, C.

National Center on Education and the Economy. (2007). Tough Choices or Tough Times:

The Report of the New Commission on the Skills of the American Workforce. San

Francisco, CA: Jossey-Bass.

National Research Council (NRC), (1996). National science education standards.

Washington, DC: National Academies Press.

National Research Council. (2012). Education for Life and Work: Developing

Transferable Knowledge and Skills in the 21st Century. Washington, DC: The

National Academies Press.

North Central Regional Educational Laboratory and the Metiri Group. (2003). EnGauge

21st Century Skills: Literacy in the Digital Age. Chicago, IL: NCREL.

Norton, M. J. (1999). Knowledge discovery in databases. Library Trends, 48(1), 9-21.

Organization for Economic Co-operation and Development (OECD). (2004). Innovation

in the Knowledge Economy: Implications for Education and Learning. Paris, France.

199 P21 Framework Definitions. (2009). Retrieved from

http://www.p21.org/storage/documents/P21_Framework_Definitions.pdf

Pacific Policy Research Center. (2010). 21st Century Skills for Students and Teachers.

Honolulu: Kamehameha Schools, Research & Evaluation.

Paige, J. (2009).n“The 21stCentury Skills Movement.” Educational Leadership, 9 (67),

Partnership for 21st Century Skills. (2006a). A State Leader’s Action Guide to 21st

Century Skills: A New Vision for Education. Tucson, AZ.

Paul, R., & L. Elder. (2006). Critical thinking: the nature of critical and creative thought.

Journal of Developmental Education, 30(2), 34-5.

Penner, D. E., & Klahr, D. (1996). The interaction of domain-specific knowledge and

domain-general discovery strategies: a study with sinking objects. Child

Development 67(6), 2709-2727

Perry, William G., Jr. (1970). Forms of Intellectual and Ethical Development in the

College Years: A Scheme. New York, NJ: Holt, Rinehart, and Winston.

Pratt, C., & Hacker, R.G. (1984). Is lawson’s classroom test of formal reasoning valid?

Educational and Psychological Measurement, 44.

Reconceptualizing Technology Integration to Meet the Challenges of Educational

Transformation. (2011). Journal of Curriculum and Instruction, 5(1), 4-16.

200 Roadrangka V., Yeany, R. H., & Padilla M. J. (1982). GALT.Group test of logical

thinking. University of Georgia, Athens, GA.

Roth, W.-M., & Roychoudhury, A. (1993).The development of science process skills in

authentic contexts. Journal of Research in Science Teaching, 30(2), 127-152.

Rotherham, Andrew J., & Willingham, D. (2009). 21st Century Skills: The Challenges

Ahead. Educational Leadership, 67(1), 16-21.

Schwartz, R. S., Lederman, N. G., & Crawford, B. A. (2004). Developing views of nature

of science in an authentic context: an explicit approach to bridging the gap between

nature of science and scientific inquiry. Science Education, 88, 610-645.

Shaklee, H., & Tucker, D. (1980). A rule analysis of judgments of covariation between

events. Memory and Cognition, 8, 459-467.

Shayer, M., & Adey, P. (1981). Towards a science of science teaching. London:

Shayer, M., & Adey, P. S. (1993). Accelerating the development of formal thinking in

middle and high school students IV: three years after a two-year intervention. Journal

of Research in Science Teaching, 30(4), 351-366.

Silva, E. (2008). Measuring Skills for the 21st Century. Washington, DC: Education

Sector. Retrieved from http://www.educationsector.org/usr_doc/MeasuringSkills.pdf

201 Sokoloff, D. R., Laws, P. W., & Thornton, R. K. (2004). RealTime Physics: Active

Learning Laboratories, Modules 1-4. (2nd ed.). Hoboken, NJ: Wiley.

Songer, N. B., Kelcey, B., & Gotwals, A. W. (2009). How and when does complex

reasoning occur? Empirically driven development of a learning progression focused

on complex reasoning about biodiversity. Journal of Research in Science Teaching,

46(6), 610-631.

Steedle, J. T. & Shavelson, R. J. (2009). Supporting valid interpretations of learning

progression level diagnoses. Journal of Research in Science Teaching, 46(6), 699-715.

Stone, Kyle B., Karen Kaminski, and Gene Gloeckner. (2009). Closing the Gap:

Education Requirements of the 21st Century Production Workforce. Journal of

Industrial Teaching Education, 45(3), 5-33.

Technological Supports for Acquiring 21st Century Skills. (2010a). International

encyclopedia of education, (3rd ed.). E. Baker, & B. McGaw (Eds.). Oxford, England:

The Intellectual and Policy Foundations of the 21st Century Skills Framework. (2007).

Retrieved from http://youngspirit.org/docs/21stcentury.pdf

Thoma, George A. (1993). The Perry Framework and Tactics for Teaching Critical

Thinking in Economics. The Journal of Economic Education, 24(2), 128-136.

202 Thoman, E., & Jolls, T. (2003). Literacy for the 21st Century: An Overview and

Orientation Guide to Media Literacy Education. Los Angeles, CA: Center for Media

Tisher, R.P., & Dale, L.G. (1975). Understanding in science test. Victoria: Australian

Council for Educational Research.

Tobin, K. G. & Capie, W. (1981). Development and validation of a group test of logical

thinking. Educational and Psychological Measurement, 41(2), 413-414.

Toth E. E., Klahr. D., & Chen, Z. (2000). Bridging research and practice: a cognitively

based classroom intervention for teaching experimentation skills to elementary school

children. Cognition and Instruction, 18(4), 423-459.

Toth, E. E., Klahr, D., & Chen, Z. (2000). Bridging research and practice: a cognitively

Treagust, D. F. (1995). Diagnostic assessment of students' science knowledge. Learning

science in the schools: research reforming practice. S. M. Glynn & R. Duit (Eds.).

327-346. Mahwah, NJ: Lawrence Erlbaum Associates.

Trilling, B., & Fadel. C. (2009). 21st Century Skills: Learning for Life in Our Times.

Publisher: Jossey-Bass.

203 U.S. Department of Education. (2010). Transforming American education: Learning

powered by technology [National Educational Technology Plan 2010]. Washington,

DC: Office of Educational Technology, U.S. Department of Education. Retrieved

from http://www.ed.gov/technology/netp-2010

Vahey, P., Enyedy, N., & Gifford, B. (2000). Learning probability through the use of a

collaborative, inquiry-based simulation environment. Journal of Interactive Learning

Research, 11(1).

Voogt, J., & N. ParejaRoblin. (2010). 21st Century Skills. Enschede: University of

Vygotsky, L. S. (1978). Mind and society: the development of higher psychological

processes. Cambridge, MA: Harvard University Press.

Wells M., Hestenes, D., & Swackhamer, G. (1995). A Modeling Method for High School

Physics Instruction, Am. J. Phys. 63(7), 606-619.

Willard R. Daggett. (2012). Jobs and the Skills Gap. Retrieved from

http://www.leadered.com/pdf/Job-Skills%20Gap%20White%20PaperPDF.pdf

Wilson, J. M. (1994). The CUPLE physics studio. The Physics Teacher, 32(9), 518-523.

Yung, B. H. W. (2001). Three views of fairness in a school-based assessment scheme of

practical work in biology. International Journal of Science Education, 23(10), 985-

204 Zakaras, Michael. (2012). Global Competency Isn’t Just a Buzz Word. Retrieved from

http://startempathy.org/blog/2012/08/global-competency-isnt-just-buzz-word

Zimmerman, C. (2007). The development of scientific thinking skills in elementary and

middle school. Developmental Review 27, 172-223.

Zohar, A., & Dori. Y.J. (2012). Metacognition in Science Education: Trends in Current

Research. Science & Technology Education Library 40. doi: 10.1007/978-94-007-

2132-6. Retrieved from http://www.springerlink.com/content/978-94-007-2131-

9/?MUD=MP&sort=p_OnlineDate&sortorder=desc

Appendix A: Group Assessment of Logical Thinking (GALT) (An online version)

Item 1 - Piece of Clay Tom has two balls of clay. They are the same size and shape. When he places them on the balance, they weigh the same.

The balls of clay are removed from the balance pans. Clay 2 is flattened like a pancake.

Which of these statements is true?

The pancake-shaped clay weighs more. The two pieces weigh the same. The ball weighs more. What is your reason for this Answer?

1. You did not add or take away any clay. 2. When clay 2 was flattened like a pancakke, it had a greater area. 3. When something is flattened, it loses weight. 4. Because of its density, the round ball had more clay in it.

Item 3 - Glass Size The drawing shows two glasses, a small one and a large one. It also shows two jars, a small one and a large one.

It takes 15 small glasses of water or 9 large glasses of water to fill the large jar. It takes 10 small glasses of water to fill the small jar. How many large glasses of water does it take to fill the same small jar?

other How many large glasses of water does it take to fill the same small jar?

It takes five less small glasses of water to fill the small jar. So it will take five less large glasses of water to fill the same jar.

The ratio of small to large glasses will always be 5 to 3.

The small glass is half the size of the large glass. So it will take about half the number of small glasses of water to fill up the same small jar.

There is no way of predicting

Item 5 - Pendulum Length Three strings are hung from a bar. String #1 and #3 are of equal length. String #2 is lonnger. Charlie attaches a 5-unit weight at the end of string #2 and at the end of #3. A 10 unit weight is attached at the end of string #1. Each string with a weight can be swung.

Charlie wants to find out if the length of the strring has an effect on the amount of time it takes the string to swing back and forth. Where would he hang a 5-unit weight to make the scale balance again?

strings #1 and #2

strings #1 and #3

strings #2 and #3

strings #1, #2, and #3

string #2 only What is your reason for this Answer?

The length of the strings should be the same. The weights should be different

Different lengths with different weights shouuld be tested

All strings and their weights should be tested against all others.

Only the longest string should be tested. The experiment is concerned with length not weight.

Everything needs to be the same except the length so you can tell if length makes a difference.

Item 7 - Squares and Diamonds In a cloth sack there are

All of the square pieces are the same size and shape. The diamond pieces are also the same size and shape. One piece is pulled out of the sack What are the chances that it is a spotted piece?

1 out of 21

other What is your reason for this Answer?

There are twenty-one pieces in the cloth sack One spotted piece must be chosen from these.

One spotted piece needs to be selected from a total of seven spotted pieces.

Seven of the twenty-one pieces are spotted pieces

There are three sets in the cloth sack One- of them is spotted

One fourth of the square pieces and 4/9 of the diamond pieces are spotted.

Item 9 - The Mice A farmer observed the mice that lived in his field. He found that the mice were either fat or thin. Also, the mice had either black or white tails. This made him wonder if there might be a relatiion between the size mouse and the color of its tail. So he decided to capture all of the mice in one part of his field and observe them. The mice that he captured are shown below.

Do you think there is a relation between the size of the mice and the color of their tails (That is, is one size of mouse more likely to have a certain color tail and vice versa)? yes no What is your reason for this Answer? 8/11 of the fat mice have black tails and 3/4 of the thin mice have white tails. Fat and thin mice can have either a black or a white tail. Not all fat mice have black tails. Not all thin mice have white tails. 18 mice have black tails and 12 have white tails. 22 mice are fat and 8 mice are thin.

Item 11 - The Dance After supper, some students decide to go dancing. There are three young men: ALBERT (A), BOB (B), and CHARLES (C), and three young women: LOUISE (L), MARY (M), AND NANCY (N).

Albert Bob Charles Louise Mary Nancy

(A) (B) (C) (L) (M) (N)

One possible pair of dance partners is AL which means ALBERT and LOUlSE. In the box below, list all of the possible man-woman couples of dancers. Only man- woman dance couples are allowed. The first possible couple is done for you.

Well, that’s it! You may want to carefully review your answers and maake sure that you answered ALL the questions. Remember, in order for an item to be scored as correct, both the answer and the reason must be correct.

Appendix B: The Test of Logical Thinking (TOLT) Questions and Reasoning A series of eight problems is presented. Each problem will lead to a question. Record the answer you have chosen and reason for selecting that answer.

1. Orange Juice Four large oranges are squeezed to make six glasses of juice. How much juice can be made from six oranges? a. 7 glasses b. 8 glasses c. 9 glasses d. 10 e. other Reason: 1. The number of glasses compared to the number of oranges will always be in the ratio 3 to 2. 2. With more oranges, the difference will be less. 3. The difference in the numbers will always be two. 4. With four oranges the difference was 2. With six oranges the difference would be two more. 5. There is no way of predicting.

2. Orange Juice How many oranges are needed to make 13 glasses of juice? a. 6 1/2 oranges b. 8 2/3 oranges c. 9 oranges d. 11 oranges e. other

Reason: 1. The number of oranges compared to the number of glasses will always be in the ratio of 2 to 3 2. If there are seven more glasses, then five more oranges are needed. 3. The difference in the numbers will always be two. 4. The number of oranges will always be half the number of glasses. 5. There is no way of predicting the number of oranges.

3. The Pendulum's Length

Suppose you wanted to do an experiment to finnd out if changing the length of a pendulum changed the amount of time it takes to swing back and forth. Which pendulums would you use for the experiment? a. 1 and 4 b. 2 and 4 c. 1 aand 3 d. 2 and 5 e. all Reason 1. The longest pendulum should be tested against the shortest pendulum. 2. All pendulums need to be tested against one another. 3. As the length is increased the number of washers should be decreased. 4. The pendulums should be the same length but the number of washers should be different. 5. The pendulums should be different lengths but the numbers of washers should be the same.

4. The Pendulum's Weight Suppose you wanted to do an experiment to find out if changing the weight on the end of the string changed the amount of time the pendulum takes to swing back and forth. Which pendulums would you use for the experiment? a. 1 and 4 b. 2 and 4 c. 1 and 3 d. 2 and 5 e. all

Reason: 1. The heaviest weight should be compared to the lightest weight. 2. All pendulums need to be tested against one another. 3. As the number of washers is increased the pendulum should be shortened. 4. The number of washers should be different but the pendulums should be the same length. 5. The number of washers should be the same but the pendulums should be different lengths.

5. The Vegetable Seeds A gardener bought a package containing 3 squash seeds and 3 bean seeds. If just one seed is selected from the package, what are the chances that it is a bean seed? a. 1 out of 2 b. 1 out of 3 c. 1 out of 4 d. 1 out of 6 e. 4 out of 6 Reason: 1. Four selections are needed because the three squash seeds could have been chosen in a row. 2. There are six seeds from which one bean seed must be chosen. 3. One bean seed needs to be selected from a total of three. 4. One half of the seeds are bean seeds. 5. In addition to a bean seed, three squash seeds could be selected from a total of six.

6. The Flower Seeds A gardener bought a package of 21 mixed seeds. The package contents listed: 3 short red flowers 4 short yellow flowers 5 short orange flowers 4 tall red flowers 2 tall yellow flowers 3 tall orange flowers If just one seed is planted, what are the chances that the plant that grows will have red flowers? a. 1 out of 2 b. 1 out of 3 c. 1 out of 7 d. 1 out of 21 e. other

Reason: 1. One seed has to be chosen from among those that grow red, yellow or orange flowers. 2. 1/4 of the short and 4/9 of the tall are red. 3. It does not matter whether a tall or a short is piicked. One red seed needs to be picked from a total of seven red seeds. 4. One red seed must be selected from a total of 21 seeds. 5. Seven of the twenty one seeds will produce red flowers.

7. The Mice The mice shown represent a sample of mice captured from a part of a field. Are fat mice more likely to have black tails and thin mice more likkely to have white tails? a. Yes b. No

Reason: 1. 8/11 of the fat mice have black tails and 3/4 of the thin mice have white tails. 2. Some of the fat mice have white tails and some of the thin mice have white tails. 3. 18 mice out of thirty have black tails and 12 have white tails. 4. Not all of the fat mice have black tails and not all of the thin mice have white tails. 5. 6/12 of the white tailed mice are fat.

8. The Fish Are fat fish more likely to have broad stripes than thin fish? a. Yes b. No

Reason: 1. Some fat fish have broad stripes and some have narrow stripes. 2. 3/7 of the fat fish have broad stripes. 3. 12/28 are broad striped and 16/28 are narrow striped. 4. 3/7 of the fat fish have broad stripes and 9/21 of the thin fish have broad stripes. 5. Some fish with broad stripes are thin and somee are fat.

9. The Student Council Three students from grades 10, 11, 12 were elected to the student council. A three member committee is to be formed with one person from each grade. All possible combinations must be considered before a decision can be made. Two possible combinations are Tom, Jerry and Dan (TJD) and Sally, Anne and Martha (SAM). List all other possible combinations in the spaces provided. More spaces are provided on the answer sheet than you will need. STUDENT COUNCIL

Grade 10. . . . Grade 11. . . . . Grade 12 Tom (T). . . . Jerry (J). . . . . Dan (D) Sally (S). . . . Anne (A). . . . Martha (M) Bill (B). . . . Connie (C). . . . Gwen (G)

10. The Shopping Center In a new shopping center, 4 store locations are going to be opened on the ground level. A BARBER SHOP (B), a DISCOUNT STORE (D),a GROCERY STORE (G), and a COFFEE SHOP (C) want to move in there. Each one of the stores can choose any one of four locations. One way that the stores could occupy the 4 locations is BDGC. List all other possible ways that the stores can occupy the 4 locations. More spaces are provided on the answer sheet than you will need.

Appendix C: Lawson’s Class Room Test of Scientific Reasoning

Web Analytics

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • CBE Life Sci Educ
  • v.17(1); Spring 2018

Understanding the Complex Relationship between Critical Thinking and Science Reasoning among Undergraduate Thesis Writers

Jason e. dowd.

† Department of Biology, Duke University, Durham, NC 27708

Robert J. Thompson, Jr.

‡ Department of Psychology and Neuroscience, Duke University, Durham, NC 27708

Leslie A. Schiff

§ Department of Microbiology and Immunology, University of Minnesota, Minneapolis, MN 55455

Julie A. Reynolds

Associated data.

This study empirically examines the relationship between students’ critical-thinking skills and scientific reasoning as reflected in undergraduate thesis writing in biology. Writing offers a unique window into studying this relationship, and the findings raise potential implications for instruction.

Developing critical-thinking and scientific reasoning skills are core learning objectives of science education, but little empirical evidence exists regarding the interrelationships between these constructs. Writing effectively fosters students’ development of these constructs, and it offers a unique window into studying how they relate. In this study of undergraduate thesis writing in biology at two universities, we examine how scientific reasoning exhibited in writing (assessed using the Biology Thesis Assessment Protocol) relates to general and specific critical-thinking skills (assessed using the California Critical Thinking Skills Test), and we consider implications for instruction. We find that scientific reasoning in writing is strongly related to inference , while other aspects of science reasoning that emerge in writing (epistemological considerations, writing conventions, etc.) are not significantly related to critical-thinking skills. Science reasoning in writing is not merely a proxy for critical thinking. In linking features of students’ writing to their critical-thinking skills, this study 1) provides a bridge to prior work suggesting that engagement in science writing enhances critical thinking and 2) serves as a foundational step for subsequently determining whether instruction focused explicitly on developing critical-thinking skills (particularly inference ) can actually improve students’ scientific reasoning in their writing.

INTRODUCTION

Critical-thinking and scientific reasoning skills are core learning objectives of science education for all students, regardless of whether or not they intend to pursue a career in science or engineering. Consistent with the view of learning as construction of understanding and meaning ( National Research Council, 2000 ), the pedagogical practice of writing has been found to be effective not only in fostering the development of students’ conceptual and procedural knowledge ( Gerdeman et al. , 2007 ) and communication skills ( Clase et al. , 2010 ), but also scientific reasoning ( Reynolds et al. , 2012 ) and critical-thinking skills ( Quitadamo and Kurtz, 2007 ).

Critical thinking and scientific reasoning are similar but different constructs that include various types of higher-order cognitive processes, metacognitive strategies, and dispositions involved in making meaning of information. Critical thinking is generally understood as the broader construct ( Holyoak and Morrison, 2005 ), comprising an array of cognitive processes and dispostions that are drawn upon differentially in everyday life and across domains of inquiry such as the natural sciences, social sciences, and humanities. Scientific reasoning, then, may be interpreted as the subset of critical-thinking skills (cognitive and metacognitive processes and dispositions) that 1) are involved in making meaning of information in scientific domains and 2) support the epistemological commitment to scientific methodology and paradigm(s).

Although there has been an enduring focus in higher education on promoting critical thinking and reasoning as general or “transferable” skills, research evidence provides increasing support for the view that reasoning and critical thinking are also situational or domain specific ( Beyer et al. , 2013 ). Some researchers, such as Lawson (2010) , present frameworks in which science reasoning is characterized explicitly in terms of critical-thinking skills. There are, however, limited coherent frameworks and empirical evidence regarding either the general or domain-specific interrelationships of scientific reasoning, as it is most broadly defined, and critical-thinking skills.

The Vision and Change in Undergraduate Biology Education Initiative provides a framework for thinking about these constructs and their interrelationship in the context of the core competencies and disciplinary practice they describe ( American Association for the Advancement of Science, 2011 ). These learning objectives aim for undergraduates to “understand the process of science, the interdisciplinary nature of the new biology and how science is closely integrated within society; be competent in communication and collaboration; have quantitative competency and a basic ability to interpret data; and have some experience with modeling, simulation and computational and systems level approaches as well as with using large databases” ( Woodin et al. , 2010 , pp. 71–72). This framework makes clear that science reasoning and critical-thinking skills play key roles in major learning outcomes; for example, “understanding the process of science” requires students to engage in (and be metacognitive about) scientific reasoning, and having the “ability to interpret data” requires critical-thinking skills. To help students better achieve these core competencies, we must better understand the interrelationships of their composite parts. Thus, the next step is to determine which specific critical-thinking skills are drawn upon when students engage in science reasoning in general and with regard to the particular scientific domain being studied. Such a determination could be applied to improve science education for both majors and nonmajors through pedagogical approaches that foster critical-thinking skills that are most relevant to science reasoning.

Writing affords one of the most effective means for making thinking visible ( Reynolds et al. , 2012 ) and learning how to “think like” and “write like” disciplinary experts ( Meizlish et al. , 2013 ). As a result, student writing affords the opportunities to both foster and examine the interrelationship of scientific reasoning and critical-thinking skills within and across disciplinary contexts. The purpose of this study was to better understand the relationship between students’ critical-thinking skills and scientific reasoning skills as reflected in the genre of undergraduate thesis writing in biology departments at two research universities, the University of Minnesota and Duke University.

In the following subsections, we discuss in greater detail the constructs of scientific reasoning and critical thinking, as well as the assessment of scientific reasoning in students’ thesis writing. In subsequent sections, we discuss our study design, findings, and the implications for enhancing educational practices.

Critical Thinking

The advances in cognitive science in the 21st century have increased our understanding of the mental processes involved in thinking and reasoning, as well as memory, learning, and problem solving. Critical thinking is understood to include both a cognitive dimension and a disposition dimension (e.g., reflective thinking) and is defined as “purposeful, self-regulatory judgment which results in interpretation, analysis, evaluation, and inference, as well as explanation of the evidential, conceptual, methodological, criteriological, or contextual considera­tions upon which that judgment is based” ( Facione, 1990, p. 3 ). Although various other definitions of critical thinking have been proposed, researchers have generally coalesced on this consensus: expert view ( Blattner and Frazier, 2002 ; Condon and Kelly-Riley, 2004 ; Bissell and Lemons, 2006 ; Quitadamo and Kurtz, 2007 ) and the corresponding measures of critical-­thinking skills ( August, 2016 ; Stephenson and Sadler-McKnight, 2016 ).

Both the cognitive skills and dispositional components of critical thinking have been recognized as important to science education ( Quitadamo and Kurtz, 2007 ). Empirical research demonstrates that specific pedagogical practices in science courses are effective in fostering students’ critical-thinking skills. Quitadamo and Kurtz (2007) found that students who engaged in a laboratory writing component in the context of a general education biology course significantly improved their overall critical-thinking skills (and their analytical and inference skills, in particular), whereas students engaged in a traditional quiz-based laboratory did not improve their critical-thinking skills. In related work, Quitadamo et al. (2008) found that a community-based inquiry experience, involving inquiry, writing, research, and analysis, was associated with improved critical thinking in a biology course for nonmajors, compared with traditionally taught sections. In both studies, students who exhibited stronger presemester critical-thinking skills exhibited stronger gains, suggesting that “students who have not been explicitly taught how to think critically may not reach the same potential as peers who have been taught these skills” ( Quitadamo and Kurtz, 2007 , p. 151).

Recently, Stephenson and Sadler-McKnight (2016) found that first-year general chemistry students who engaged in a science writing heuristic laboratory, which is an inquiry-based, writing-to-learn approach to instruction ( Hand and Keys, 1999 ), had significantly greater gains in total critical-thinking scores than students who received traditional laboratory instruction. Each of the four components—inquiry, writing, collaboration, and reflection—have been linked to critical thinking ( Stephenson and Sadler-McKnight, 2016 ). Like the other studies, this work highlights the value of targeting critical-thinking skills and the effectiveness of an inquiry-based, writing-to-learn approach to enhance critical thinking. Across studies, authors advocate adopting critical thinking as the course framework ( Pukkila, 2004 ) and developing explicit examples of how critical thinking relates to the scientific method ( Miri et al. , 2007 ).

In these examples, the important connection between writing and critical thinking is highlighted by the fact that each intervention involves the incorporation of writing into science, technology, engineering, and mathematics education (either alone or in combination with other pedagogical practices). However, critical-thinking skills are not always the primary learning outcome; in some contexts, scientific reasoning is the primary outcome that is assessed.

Scientific Reasoning

Scientific reasoning is a complex process that is broadly defined as “the skills involved in inquiry, experimentation, evidence evaluation, and inference that are done in the service of conceptual change or scientific understanding” ( Zimmerman, 2007 , p. 172). Scientific reasoning is understood to include both conceptual knowledge and the cognitive processes involved with generation of hypotheses (i.e., inductive processes involved in the generation of hypotheses and the deductive processes used in the testing of hypotheses), experimentation strategies, and evidence evaluation strategies. These dimensions are interrelated, in that “experimentation and inference strategies are selected based on prior conceptual knowledge of the domain” ( Zimmerman, 2000 , p. 139). Furthermore, conceptual and procedural knowledge and cognitive process dimensions can be general and domain specific (or discipline specific).

With regard to conceptual knowledge, attention has been focused on the acquisition of core methodological concepts fundamental to scientists’ causal reasoning and metacognitive distancing (or decontextualized thinking), which is the ability to reason independently of prior knowledge or beliefs ( Greenhoot et al. , 2004 ). The latter involves what Kuhn and Dean (2004) refer to as the coordination of theory and evidence, which requires that one question existing theories (i.e., prior knowledge and beliefs), seek contradictory evidence, eliminate alternative explanations, and revise one’s prior beliefs in the face of contradictory evidence. Kuhn and colleagues (2008) further elaborate that scientific thinking requires “a mature understanding of the epistemological foundations of science, recognizing scientific knowledge as constructed by humans rather than simply discovered in the world,” and “the ability to engage in skilled argumentation in the scientific domain, with an appreciation of argumentation as entailing the coordination of theory and evidence” ( Kuhn et al. , 2008 , p. 435). “This approach to scientific reasoning not only highlights the skills of generating and evaluating evidence-based inferences, but also encompasses epistemological appreciation of the functions of evidence and theory” ( Ding et al. , 2016 , p. 616). Evaluating evidence-based inferences involves epistemic cognition, which Moshman (2015) defines as the subset of metacognition that is concerned with justification, truth, and associated forms of reasoning. Epistemic cognition is both general and domain specific (or discipline specific; Moshman, 2015 ).

There is empirical support for the contributions of both prior knowledge and an understanding of the epistemological foundations of science to scientific reasoning. In a study of undergraduate science students, advanced scientific reasoning was most often accompanied by accurate prior knowledge as well as sophisticated epistemological commitments; additionally, for students who had comparable levels of prior knowledge, skillful reasoning was associated with a strong epistemological commitment to the consistency of theory with evidence ( Zeineddin and Abd-El-Khalick, 2010 ). These findings highlight the importance of the need for instructional activities that intentionally help learners develop sophisticated epistemological commitments focused on the nature of knowledge and the role of evidence in supporting knowledge claims ( Zeineddin and Abd-El-Khalick, 2010 ).

Scientific Reasoning in Students’ Thesis Writing

Pedagogical approaches that incorporate writing have also focused on enhancing scientific reasoning. Many rubrics have been developed to assess aspects of scientific reasoning in written artifacts. For example, Timmerman and colleagues (2011) , in the course of describing their own rubric for assessing scientific reasoning, highlight several examples of scientific reasoning assessment criteria ( Haaga, 1993 ; Tariq et al. , 1998 ; Topping et al. , 2000 ; Kelly and Takao, 2002 ; Halonen et al. , 2003 ; Willison and O’Regan, 2007 ).

At both the University of Minnesota and Duke University, we have focused on the genre of the undergraduate honors thesis as the rhetorical context in which to study and improve students’ scientific reasoning and writing. We view the process of writing an undergraduate honors thesis as a form of professional development in the sciences (i.e., a way of engaging students in the practices of a community of discourse). We have found that structured courses designed to scaffold the thesis-­writing process and promote metacognition can improve writing and reasoning skills in biology, chemistry, and economics ( Reynolds and Thompson, 2011 ; Dowd et al. , 2015a , b ). In the context of this prior work, we have defined scientific reasoning in writing as the emergent, underlying construct measured across distinct aspects of students’ written discussion of independent research in their undergraduate theses.

The Biology Thesis Assessment Protocol (BioTAP) was developed at Duke University as a tool for systematically guiding students and faculty through a “draft–feedback–revision” writing process, modeled after professional scientific peer-review processes ( Reynolds et al. , 2009 ). BioTAP includes activities and worksheets that allow students to engage in critical peer review and provides detailed descriptions, presented as rubrics, of the questions (i.e., dimensions, shown in Table 1 ) upon which such review should focus. Nine rubric dimensions focus on communication to the broader scientific community, and four rubric dimensions focus on the accuracy and appropriateness of the research. These rubric dimensions provide criteria by which the thesis is assessed, and therefore allow BioTAP to be used as an assessment tool as well as a teaching resource ( Reynolds et al. , 2009 ). Full details are available at www.science-writing.org/biotap.html .

Theses assessment protocol dimensions

In previous work, we have used BioTAP to quantitatively assess students’ undergraduate honors theses and explore the relationship between thesis-writing courses (or specific interventions within the courses) and the strength of students’ science reasoning in writing across different science disciplines: biology ( Reynolds and Thompson, 2011 ); chemistry ( Dowd et al. , 2015b ); and economics ( Dowd et al. , 2015a ). We have focused exclusively on the nine dimensions related to reasoning and writing (questions 1–9), as the other four dimensions (questions 10–13) require topic-specific expertise and are intended to be used by the student’s thesis supervisor.

Beyond considering individual dimensions, we have investigated whether meaningful constructs underlie students’ thesis scores. We conducted exploratory factor analysis of students’ theses in biology, economics, and chemistry and found one dominant underlying factor in each discipline; we termed the factor “scientific reasoning in writing” ( Dowd et al. , 2015a , b , 2016 ). That is, each of the nine dimensions could be understood as reflecting, in different ways and to different degrees, the construct of scientific reasoning in writing. The findings indicated evidence of both general and discipline-specific components to scientific reasoning in writing that relate to epistemic beliefs and paradigms, in keeping with broader ideas about science reasoning discussed earlier. Specifically, scientific reasoning in writing is more strongly associated with formulating a compelling argument for the significance of the research in the context of current literature in biology, making meaning regarding the implications of the findings in chemistry, and providing an organizational framework for interpreting the thesis in economics. We suggested that instruction, whether occurring in writing studios or in writing courses to facilitate thesis preparation, should attend to both components.

Research Question and Study Design

The genre of thesis writing combines the pedagogies of writing and inquiry found to foster scientific reasoning ( Reynolds et al. , 2012 ) and critical thinking ( Quitadamo and Kurtz, 2007 ; Quitadamo et al. , 2008 ; Stephenson and Sadler-­McKnight, 2016 ). However, there is no empirical evidence regarding the general or domain-specific interrelationships of scientific reasoning and critical-thinking skills, particularly in the rhetorical context of the undergraduate thesis. The BioTAP studies discussed earlier indicate that the rubric-based assessment produces evidence of scientific reasoning in the undergraduate thesis, but it was not designed to foster or measure critical thinking. The current study was undertaken to address the research question: How are students’ critical-thinking skills related to scientific reasoning as reflected in the genre of undergraduate thesis writing in biology? Determining these interrelationships could guide efforts to enhance students’ scientific reasoning and writing skills through focusing instruction on specific critical-thinking skills as well as disciplinary conventions.

To address this research question, we focused on undergraduate thesis writers in biology courses at two institutions, Duke University and the University of Minnesota, and examined the extent to which students’ scientific reasoning in writing, assessed in the undergraduate thesis using BioTAP, corresponds to students’ critical-thinking skills, assessed using the California Critical Thinking Skills Test (CCTST; August, 2016 ).

Study Sample

The study sample was composed of students enrolled in courses designed to scaffold the thesis-writing process in the Department of Biology at Duke University and the College of Biological Sciences at the University of Minnesota. Both courses complement students’ individual work with research advisors. The course is required for thesis writers at the University of Minnesota and optional for writers at Duke University. Not all students are required to complete a thesis, though it is required for students to graduate with honors; at the University of Minnesota, such students are enrolled in an honors program within the college. In total, 28 students were enrolled in the course at Duke University and 44 students were enrolled in the course at the University of Minnesota. Of those students, two students did not consent to participate in the study; additionally, five students did not validly complete the CCTST (i.e., attempted fewer than 60% of items or completed the test in less than 15 minutes). Thus, our overall rate of valid participation is 90%, with 27 students from Duke University and 38 students from the University of Minnesota. We found no statistically significant differences in thesis assessment between students with valid CCTST scores and invalid CCTST scores. Therefore, we focus on the 65 students who consented to participate and for whom we have complete and valid data in most of this study. Additionally, in asking students for their consent to participate, we allowed them to choose whether to provide or decline access to academic and demographic background data. Of the 65 students who consented to participate, 52 students granted access to such data. Therefore, for additional analyses involving academic and background data, we focus on the 52 students who consented. We note that the 13 students who participated but declined to share additional data performed slightly lower on the CCTST than the 52 others (perhaps suggesting that they differ by other measures, but we cannot determine this with certainty). Among the 52 students, 60% identified as female and 10% identified as being from underrepresented ethnicities.

In both courses, students completed the CCTST online, either in class or on their own, late in the Spring 2016 semester. This is the same assessment that was used in prior studies of critical thinking ( Quitadamo and Kurtz, 2007 ; Quitadamo et al. , 2008 ; Stephenson and Sadler-McKnight, 2016 ). It is “an objective measure of the core reasoning skills needed for reflective decision making concerning what to believe or what to do” ( Insight Assessment, 2016a ). In the test, students are asked to read and consider information as they answer multiple-choice questions. The questions are intended to be appropriate for all users, so there is no expectation of prior disciplinary knowledge in biology (or any other subject). Although actual test items are protected, sample items are available on the Insight Assessment website ( Insight Assessment, 2016b ). We have included one sample item in the Supplemental Material.

The CCTST is based on a consensus definition of critical thinking, measures cognitive and metacognitive skills associated with critical thinking, and has been evaluated for validity and reliability at the college level ( August, 2016 ; Stephenson and Sadler-McKnight, 2016 ). In addition to providing overall critical-thinking score, the CCTST assesses seven dimensions of critical thinking: analysis, interpretation, inference, evaluation, explanation, induction, and deduction. Scores on each dimension are calculated based on students’ performance on items related to that dimension. Analysis focuses on identifying assumptions, reasons, and claims and examining how they interact to form arguments. Interpretation, related to analysis, focuses on determining the precise meaning and significance of information. Inference focuses on drawing conclusions from reasons and evidence. Evaluation focuses on assessing the credibility of sources of information and claims they make. Explanation, related to evaluation, focuses on describing the evidence, assumptions, or rationale for beliefs and conclusions. Induction focuses on drawing inferences about what is probably true based on evidence. Deduction focuses on drawing conclusions about what must be true when the context completely determines the outcome. These are not independent dimensions; the fact that they are related supports their collective interpretation as critical thinking. Together, the CCTST dimensions provide a basis for evaluating students’ overall strength in using reasoning to form reflective judgments about what to believe or what to do ( August, 2016 ). Each of the seven dimensions and the overall CCTST score are measured on a scale of 0–100, where higher scores indicate superior performance. Scores correspond to superior (86–100), strong (79–85), moderate (70–78), weak (63–69), or not manifested (62 and below) skills.

Scientific Reasoning in Writing

At the end of the semester, students’ final, submitted undergraduate theses were assessed using BioTAP, which consists of nine rubric dimensions that focus on communication to the broader scientific community and four additional dimensions that focus on the exhibition of topic-specific expertise ( Reynolds et al. , 2009 ). These dimensions, framed as questions, are displayed in Table 1 .

Student theses were assessed on questions 1–9 of BioTAP using the same procedures described in previous studies ( Reynolds and Thompson, 2011 ; Dowd et al. , 2015a , b ). In this study, six raters were trained in the valid, reliable use of BioTAP rubrics. Each dimension was rated on a five-point scale: 1 indicates the dimension is missing, incomplete, or below acceptable standards; 3 indicates that the dimension is adequate but not exhibiting mastery; and 5 indicates that the dimension is excellent and exhibits mastery (intermediate ratings of 2 and 4 are appropriate when different parts of the thesis make a single category challenging). After training, two raters independently assessed each thesis and then discussed their independent ratings with one another to form a consensus rating. The consensus score is not an average score, but rather an agreed-upon, discussion-based score. On a five-point scale, raters independently assessed dimensions to be within 1 point of each other 82.4% of the time before discussion and formed consensus ratings 100% of the time after discussion.

In this study, we consider both categorical (mastery/nonmastery, where a score of 5 corresponds to mastery) and numerical treatments of individual BioTAP scores to better relate the manifestation of critical thinking in BioTAP assessment to all of the prior studies. For comprehensive/cumulative measures of BioTAP, we focus on the partial sum of questions 1–5, as these questions relate to higher-order scientific reasoning (whereas questions 6–9 relate to mid- and lower-order writing mechanics [ Reynolds et al. , 2009 ]), and the factor scores (i.e., numerical representations of the extent to which each student exhibits the underlying factor), which are calculated from the factor loadings published by Dowd et al. (2016) . We do not focus on questions 6–9 individually in statistical analyses, because we do not expect critical-thinking skills to relate to mid- and lower-order writing skills.

The final, submitted thesis reflects the student’s writing, the student’s scientific reasoning, the quality of feedback provided to the student by peers and mentors, and the student’s ability to incorporate that feedback into his or her work. Therefore, our assessment is not the same as an assessment of unpolished, unrevised samples of students’ written work. While one might imagine that such an unpolished sample may be more strongly correlated with critical-thinking skills measured by the CCTST, we argue that the complete, submitted thesis, assessed using BioTAP, is ultimately a more appropriate reflection of how students exhibit science reasoning in the scientific community.

Statistical Analyses

We took several steps to analyze the collected data. First, to provide context for subsequent interpretations, we generated descriptive statistics for the CCTST scores of the participants based on the norms for undergraduate CCTST test takers. To determine the strength of relationships among CCTST dimensions (including overall score) and the BioTAP dimensions, partial-sum score (questions 1–5), and factor score, we calculated Pearson’s correlations for each pair of measures. To examine whether falling on one side of the nonmastery/mastery threshold (as opposed to a linear scale of performance) was related to critical thinking, we grouped BioTAP dimensions into categories (mastery/nonmastery) and conducted Student’s t tests to compare the means scores of the two groups on each of the seven dimensions and overall score of the CCTST. Finally, for the strongest relationship that emerged, we included additional academic and background variables as covariates in multiple linear-regression analysis to explore questions about how much observed relationships between critical-thinking skills and science reasoning in writing might be explained by variation in these other factors.

Although BioTAP scores represent discreet, ordinal bins, the five-point scale is intended to capture an underlying continuous construct (from inadequate to exhibiting mastery). It has been argued that five categories is an appropriate cutoff for treating ordinal variables as pseudo-continuous ( Rhemtulla et al. , 2012 )—and therefore using continuous-variable statistical methods (e.g., Pearson’s correlations)—as long as the underlying assumption that ordinal scores are linearly distributed is valid. Although we have no way to statistically test this assumption, we interpret adequate scores to be approximately halfway between inadequate and mastery scores, resulting in a linear scale. In part because this assumption is subject to disagreement, we also consider and interpret a categorical (mastery/nonmastery) treatment of BioTAP variables.

We corrected for multiple comparisons using the Holm-Bonferroni method ( Holm, 1979 ). At the most general level, where we consider the single, comprehensive measures for BioTAP (partial-sum and factor score) and the CCTST (overall score), there is no need to correct for multiple comparisons, because the multiple, individual dimensions are collapsed into single dimensions. When we considered individual CCTST dimensions in relation to comprehensive measures for BioTAP, we accounted for seven comparisons; similarly, when we considered individual dimensions of BioTAP in relation to overall CCTST score, we accounted for five comparisons. When all seven CCTST and five BioTAP dimensions were examined individually and without prior knowledge, we accounted for 35 comparisons; such a rigorous threshold is likely to reject weak and moderate relationships, but it is appropriate if there are no specific pre-existing hypotheses. All p values are presented in tables for complete transparency, and we carefully consider the implications of our interpretation of these data in the Discussion section.

CCTST scores for students in this sample ranged from the 39th to 99th percentile of the general population of undergraduate CCTST test takers (mean percentile = 84.3, median = 85th percentile; Table 2 ); these percentiles reflect overall scores that range from moderate to superior. Scores on individual dimensions and overall scores were sufficiently normal and far enough from the ceiling of the scale to justify subsequent statistical analyses.

Descriptive statistics of CCTST dimensions a

a Scores correspond to superior (86–100), strong (79–85), moderate (70–78), weak (63–69), or not manifested (62 and lower) skills.

The Pearson’s correlations between students’ cumulative scores on BioTAP (the factor score based on loadings published by Dowd et al. , 2016 , and the partial sum of scores on questions 1–5) and students’ overall scores on the CCTST are presented in Table 3 . We found that the partial-sum measure of BioTAP was significantly related to the overall measure of critical thinking ( r = 0.27, p = 0.03), while the BioTAP factor score was marginally related to overall CCTST ( r = 0.24, p = 0.05). When we looked at relationships between comprehensive BioTAP measures and scores for individual dimensions of the CCTST ( Table 3 ), we found significant positive correlations between the both BioTAP partial-sum and factor scores and CCTST inference ( r = 0.45, p < 0.001, and r = 0.41, p < 0.001, respectively). Although some other relationships have p values below 0.05 (e.g., the correlations between BioTAP partial-sum scores and CCTST induction and interpretation scores), they are not significant when we correct for multiple comparisons.

Correlations between dimensions of CCTST and dimensions of BioTAP a

a In each cell, the top number is the correlation, and the bottom, italicized number is the associated p value. Correlations that are statistically significant after correcting for multiple comparisons are shown in bold.

b This is the partial sum of BioTAP scores on questions 1–5.

c This is the factor score calculated from factor loadings published by Dowd et al. (2016) .

When we expanded comparisons to include all 35 potential correlations among individual BioTAP and CCTST dimensions—and, accordingly, corrected for 35 comparisons—we did not find any additional statistically significant relationships. The Pearson’s correlations between students’ scores on each dimension of BioTAP and students’ scores on each dimension of the CCTST range from −0.11 to 0.35 ( Table 3 ); although the relationship between discussion of implications (BioTAP question 5) and inference appears to be relatively large ( r = 0.35), it is not significant ( p = 0.005; the Holm-Bonferroni cutoff is 0.00143). We found no statistically significant relationships between BioTAP questions 6–9 and CCTST dimensions (unpublished data), regardless of whether we correct for multiple comparisons.

The results of Student’s t tests comparing scores on each dimension of the CCTST of students who exhibit mastery with those of students who do not exhibit mastery on each dimension of BioTAP are presented in Table 4 . Focusing first on the overall CCTST scores, we found that the difference between those who exhibit mastery and those who do not in discussing implications of results (BioTAP question 5) is statistically significant ( t = 2.73, p = 0.008, d = 0.71). When we expanded t tests to include all 35 comparisons—and, like above, corrected for 35 comparisons—we found a significant difference in inference scores between students who exhibit mastery on question 5 and students who do not ( t = 3.41, p = 0.0012, d = 0.88), as well as a marginally significant difference in these students’ induction scores ( t = 3.26, p = 0.0018, d = 0.84; the Holm-Bonferroni cutoff is p = 0.00147). Cohen’s d effect sizes, which reveal the strength of the differences for statistically significant relationships, range from 0.71 to 0.88.

The t statistics and effect sizes of differences in ­dimensions of CCTST across dimensions of BioTAP a

a In each cell, the top number is the t statistic for each comparison, and the middle, italicized number is the associated p value. The bottom number is the effect size. Correlations that are statistically significant after correcting for multiple comparisons are shown in bold.

Finally, we more closely examined the strongest relationship that we observed, which was between the CCTST dimension of inference and the BioTAP partial-sum composite score (shown in Table 3 ), using multiple regression analysis ( Table 5 ). Focusing on the 52 students for whom we have background information, we looked at the simple relationship between BioTAP and inference (model 1), a robust background model including multiple covariates that one might expect to explain some part of the variation in BioTAP (model 2), and a combined model including all variables (model 3). As model 3 shows, the covariates explain very little variation in BioTAP scores, and the relationship between inference and BioTAP persists even in the presence of all of the covariates.

Partial sum (questions 1–5) of BioTAP scores ( n = 52)

** p < 0.01.

*** p < 0.001.

The aim of this study was to examine the extent to which the various components of scientific reasoning—manifested in writing in the genre of undergraduate thesis and assessed using BioTAP—draw on general and specific critical-thinking skills (assessed using CCTST) and to consider the implications for educational practices. Although science reasoning involves critical-thinking skills, it also relates to conceptual knowledge and the epistemological foundations of science disciplines ( Kuhn et al. , 2008 ). Moreover, science reasoning in writing , captured in students’ undergraduate theses, reflects habits, conventions, and the incorporation of feedback that may alter evidence of individuals’ critical-thinking skills. Our findings, however, provide empirical evidence that cumulative measures of science reasoning in writing are nonetheless related to students’ overall critical-thinking skills ( Table 3 ). The particularly significant roles of inference skills ( Table 3 ) and the discussion of implications of results (BioTAP question 5; Table 4 ) provide a basis for more specific ideas about how these constructs relate to one another and what educational interventions may have the most success in fostering these skills.

Our results build on previous findings. The genre of thesis writing combines pedagogies of writing and inquiry found to foster scientific reasoning ( Reynolds et al. , 2012 ) and critical thinking ( Quitadamo and Kurtz, 2007 ; Quitadamo et al. , 2008 ; Stephenson and Sadler-McKnight, 2016 ). Quitadamo and Kurtz (2007) reported that students who engaged in a laboratory writing component in a general education biology course significantly improved their inference and analysis skills, and Quitadamo and colleagues (2008) found that participation in a community-based inquiry biology course (that included a writing component) was associated with significant gains in students’ inference and evaluation skills. The shared focus on inference is noteworthy, because these prior studies actually differ from the current study; the former considered critical-­thinking skills as the primary learning outcome of writing-­focused interventions, whereas the latter focused on emergent links between two learning outcomes (science reasoning in writing and critical thinking). In other words, inference skills are impacted by writing as well as manifested in writing.

Inference focuses on drawing conclusions from argument and evidence. According to the consensus definition of critical thinking, the specific skill of inference includes several processes: querying evidence, conjecturing alternatives, and drawing conclusions. All of these activities are central to the independent research at the core of writing an undergraduate thesis. Indeed, a critical part of what we call “science reasoning in writing” might be characterized as a measure of students’ ability to infer and make meaning of information and findings. Because the cumulative BioTAP measures distill underlying similarities and, to an extent, suppress unique aspects of individual dimensions, we argue that it is appropriate to relate inference to scientific reasoning in writing . Even when we control for other potentially relevant background characteristics, the relationship is strong ( Table 5 ).

In taking the complementary view and focusing on BioTAP, when we compared students who exhibit mastery with those who do not, we found that the specific dimension of “discussing the implications of results” (question 5) differentiates students’ performance on several critical-thinking skills. To achieve mastery on this dimension, students must make connections between their results and other published studies and discuss the future directions of the research; in short, they must demonstrate an understanding of the bigger picture. The specific relationship between question 5 and inference is the strongest observed among all individual comparisons. Altogether, perhaps more than any other BioTAP dimension, this aspect of students’ writing provides a clear view of the role of students’ critical-thinking skills (particularly inference and, marginally, induction) in science reasoning.

While inference and discussion of implications emerge as particularly strongly related dimensions in this work, we note that the strongest contribution to “science reasoning in writing in biology,” as determined through exploratory factor analysis, is “argument for the significance of research” (BioTAP question 2, not question 5; Dowd et al. , 2016 ). Question 2 is not clearly related to critical-thinking skills. These findings are not contradictory, but rather suggest that the epistemological and disciplinary-specific aspects of science reasoning that emerge in writing through BioTAP are not completely aligned with aspects related to critical thinking. In other words, science reasoning in writing is not simply a proxy for those critical-thinking skills that play a role in science reasoning.

In a similar vein, the content-related, epistemological aspects of science reasoning, as well as the conventions associated with writing the undergraduate thesis (including feedback from peers and revision), may explain the lack of significant relationships between some science reasoning dimensions and some critical-thinking skills that might otherwise seem counterintuitive (e.g., BioTAP question 2, which relates to making an argument, and the critical-thinking skill of argument). It is possible that an individual’s critical-thinking skills may explain some variation in a particular BioTAP dimension, but other aspects of science reasoning and practice exert much stronger influence. Although these relationships do not emerge in our analyses, the lack of significant correlation does not mean that there is definitively no correlation. Correcting for multiple comparisons suppresses type 1 error at the expense of exacerbating type 2 error, which, combined with the limited sample size, constrains statistical power and makes weak relationships more difficult to detect. Ultimately, though, the relationships that do emerge highlight places where individuals’ distinct critical-thinking skills emerge most coherently in thesis assessment, which is why we are particularly interested in unpacking those relationships.

We recognize that, because only honors students submit theses at these institutions, this study sample is composed of a selective subset of the larger population of biology majors. Although this is an inherent limitation of focusing on thesis writing, links between our findings and results of other studies (with different populations) suggest that observed relationships may occur more broadly. The goal of improved science reasoning and critical thinking is shared among all biology majors, particularly those engaged in capstone research experiences. So while the implications of this work most directly apply to honors thesis writers, we provisionally suggest that all students could benefit from further study of them.

There are several important implications of this study for science education practices. Students’ inference skills relate to the understanding and effective application of scientific content. The fact that we find no statistically significant relationships between BioTAP questions 6–9 and CCTST dimensions suggests that such mid- to lower-order elements of BioTAP ( Reynolds et al. , 2009 ), which tend to be more structural in nature, do not focus on aspects of the finished thesis that draw strongly on critical thinking. In keeping with prior analyses ( Reynolds and Thompson, 2011 ; Dowd et al. , 2016 ), these findings further reinforce the notion that disciplinary instructors, who are most capable of teaching and assessing scientific reasoning and perhaps least interested in the more mechanical aspects of writing, may nonetheless be best suited to effectively model and assess students’ writing.

The goal of the thesis writing course at both Duke University and the University of Minnesota is not merely to improve thesis scores but to move students’ writing into the category of mastery across BioTAP dimensions. Recognizing that students with differing critical-thinking skills (particularly inference) are more or less likely to achieve mastery in the undergraduate thesis (particularly in discussing implications [question 5]) is important for developing and testing targeted pedagogical interventions to improve learning outcomes for all students.

The competencies characterized by the Vision and Change in Undergraduate Biology Education Initiative provide a general framework for recognizing that science reasoning and critical-thinking skills play key roles in major learning outcomes of science education. Our findings highlight places where science reasoning–related competencies (like “understanding the process of science”) connect to critical-thinking skills and places where critical thinking–related competencies might be manifested in scientific products (such as the ability to discuss implications in scientific writing). We encourage broader efforts to build empirical connections between competencies and pedagogical practices to further improve science education.

One specific implication of this work for science education is to focus on providing opportunities for students to develop their critical-thinking skills (particularly inference). Of course, as this correlational study is not designed to test causality, we do not claim that enhancing students’ inference skills will improve science reasoning in writing. However, as prior work shows that science writing activities influence students’ inference skills ( Quitadamo and Kurtz, 2007 ; Quitadamo et al. , 2008 ), there is reason to test such a hypothesis. Nevertheless, the focus must extend beyond inference as an isolated skill; rather, it is important to relate inference to the foundations of the scientific method ( Miri et al. , 2007 ) in terms of the epistemological appreciation of the functions and coordination of evidence ( Kuhn and Dean, 2004 ; Zeineddin and Abd-El-Khalick, 2010 ; Ding et al. , 2016 ) and disciplinary paradigms of truth and justification ( Moshman, 2015 ).

Although this study is limited to the domain of biology at two institutions with a relatively small number of students, the findings represent a foundational step in the direction of achieving success with more integrated learning outcomes. Hopefully, it will spur greater interest in empirically grounding discussions of the constructs of scientific reasoning and critical-thinking skills.

This study contributes to the efforts to improve science education, for both majors and nonmajors, through an empirically driven analysis of the relationships between scientific reasoning reflected in the genre of thesis writing and critical-thinking skills. This work is rooted in the usefulness of BioTAP as a method 1) to facilitate communication and learning and 2) to assess disciplinary-specific and general dimensions of science reasoning. The findings support the important role of the critical-thinking skill of inference in scientific reasoning in writing, while also highlighting ways in which other aspects of science reasoning (epistemological considerations, writing conventions, etc.) are not significantly related to critical thinking. Future research into the impact of interventions focused on specific critical-thinking skills (i.e., inference) for improved science reasoning in writing will build on this work and its implications for science education.

Supplementary Material

Acknowledgments.

We acknowledge the contributions of Kelaine Haas and Alexander Motten to the implementation and collection of data. We also thank Mine Çetinkaya-­Rundel for her insights regarding our statistical analyses. This research was funded by National Science Foundation award DUE-1525602.

  • American Association for the Advancement of Science. (2011). Vision and change in undergraduate biology education: A call to action . Washington, DC: Retrieved September 26, 2017, from https://visionandchange.org/files/2013/11/aaas-VISchange-web1113.pdf . [ Google Scholar ]
  • August D. (2016). California Critical Thinking Skills Test user manual and resource guide . San Jose: Insight Assessment/California Academic Press. [ Google Scholar ]
  • Beyer C. H., Taylor E., Gillmore G. M. (2013). Inside the undergraduate teaching experience: The University of Washington’s growth in faculty teaching study . Albany, NY: SUNY Press. [ Google Scholar ]
  • Bissell A. N., Lemons P. P. (2006). A new method for assessing critical thinking in the classroom . BioScience , ( 1 ), 66–72. https://doi.org/10.1641/0006-3568(2006)056[0066:ANMFAC]2.0.CO;2 . [ Google Scholar ]
  • Blattner N. H., Frazier C. L. (2002). Developing a performance-based assessment of students’ critical thinking skills . Assessing Writing , ( 1 ), 47–64. [ Google Scholar ]
  • Clase K. L., Gundlach E., Pelaez N. J. (2010). Calibrated peer review for computer-assisted learning of biological research competencies . Biochemistry and Molecular Biology Education , ( 5 ), 290–295. [ PubMed ] [ Google Scholar ]
  • Condon W., Kelly-Riley D. (2004). Assessing and teaching what we value: The relationship between college-level writing and critical thinking abilities . Assessing Writing , ( 1 ), 56–75. https://doi.org/10.1016/j.asw.2004.01.003 . [ Google Scholar ]
  • Ding L., Wei X., Liu X. (2016). Variations in university students’ scientific reasoning skills across majors, years, and types of institutions . Research in Science Education , ( 5 ), 613–632. https://doi.org/10.1007/s11165-015-9473-y . [ Google Scholar ]
  • Dowd J. E., Connolly M. P., Thompson R. J., Jr., Reynolds J. A. (2015a). Improved reasoning in undergraduate writing through structured workshops . Journal of Economic Education , ( 1 ), 14–27. https://doi.org/10.1080/00220485.2014.978924 . [ Google Scholar ]
  • Dowd J. E., Roy C. P., Thompson R. J., Jr., Reynolds J. A. (2015b). “On course” for supporting expanded participation and improving scientific reasoning in undergraduate thesis writing . Journal of Chemical Education , ( 1 ), 39–45. https://doi.org/10.1021/ed500298r . [ Google Scholar ]
  • Dowd J. E., Thompson R. J., Jr., Reynolds J. A. (2016). Quantitative genre analysis of undergraduate theses: Uncovering different ways of writing and thinking in science disciplines . WAC Journal , , 36–51. [ Google Scholar ]
  • Facione P. A. (1990). Critical thinking: a statement of expert consensus for purposes of educational assessment and instruction. Research findings and recommendations . Newark, DE: American Philosophical Association; Retrieved September 26, 2017, from https://philpapers.org/archive/FACCTA.pdf . [ Google Scholar ]
  • Gerdeman R. D., Russell A. A., Worden K. J., Gerdeman R. D., Russell A. A., Worden K. J. (2007). Web-based student writing and reviewing in a large biology lecture course . Journal of College Science Teaching , ( 5 ), 46–52. [ Google Scholar ]
  • Greenhoot A. F., Semb G., Colombo J., Schreiber T. (2004). Prior beliefs and methodological concepts in scientific reasoning . Applied Cognitive Psychology , ( 2 ), 203–221. https://doi.org/10.1002/acp.959 . [ Google Scholar ]
  • Haaga D. A. F. (1993). Peer review of term papers in graduate psychology courses . Teaching of Psychology , ( 1 ), 28–32. https://doi.org/10.1207/s15328023top2001_5 . [ Google Scholar ]
  • Halonen J. S., Bosack T., Clay S., McCarthy M., Dunn D. S., Hill G. W., Whitlock K. (2003). A rubric for learning, teaching, and assessing scientific inquiry in psychology . Teaching of Psychology , ( 3 ), 196–208. https://doi.org/10.1207/S15328023TOP3003_01 . [ Google Scholar ]
  • Hand B., Keys C. W. (1999). Inquiry investigation . Science Teacher , ( 4 ), 27–29. [ Google Scholar ]
  • Holm S. (1979). A simple sequentially rejective multiple test procedure . Scandinavian Journal of Statistics , ( 2 ), 65–70. [ Google Scholar ]
  • Holyoak K. J., Morrison R. G. (2005). The Cambridge handbook of thinking and reasoning . New York: Cambridge University Press. [ Google Scholar ]
  • Insight Assessment. (2016a). California Critical Thinking Skills Test (CCTST) Retrieved September 26, 2017, from www.insightassessment.com/Products/Products-Summary/Critical-Thinking-Skills-Tests/California-Critical-Thinking-Skills-Test-CCTST .
  • Insight Assessment. (2016b). Sample thinking skills questions. Retrieved September 26, 2017, from www.insightassessment.com/Resources/Teaching-Training-and-Learning-Tools/node_1487 .
  • Kelly G. J., Takao A. (2002). Epistemic levels in argument: An analysis of university oceanography students’ use of evidence in writing . Science Education , ( 3 ), 314–342. https://doi.org/10.1002/sce.10024 . [ Google Scholar ]
  • Kuhn D., Dean D., Jr. (2004). Connecting scientific reasoning and causal inference . Journal of Cognition and Development , ( 2 ), 261–288. https://doi.org/10.1207/s15327647jcd0502_5 . [ Google Scholar ]
  • Kuhn D., Iordanou K., Pease M., Wirkala C. (2008). Beyond control of variables: What needs to develop to achieve skilled scientific thinking? . Cognitive Development , ( 4 ), 435–451. https://doi.org/10.1016/j.cogdev.2008.09.006 . [ Google Scholar ]
  • Lawson A. E. (2010). Basic inferences of scientific reasoning, argumentation, and discovery . Science Education , ( 2 ), 336–364. https://doi.org/­10.1002/sce.20357 . [ Google Scholar ]
  • Meizlish D., LaVaque-Manty D., Silver N., Kaplan M. (2013). Think like/write like: Metacognitive strategies to foster students’ development as disciplinary thinkers and writers . In Thompson R. J. (Ed.), Changing the conversation about higher education (pp. 53–73). Lanham, MD: Rowman & Littlefield. [ Google Scholar ]
  • Miri B., David B.-C., Uri Z. (2007). Purposely teaching for the promotion of higher-order thinking skills: A case of critical thinking . Research in Science Education , ( 4 ), 353–369. https://doi.org/10.1007/s11165-006-9029-2 . [ Google Scholar ]
  • Moshman D. (2015). Epistemic cognition and development: The psychology of justification and truth . New York: Psychology Press. [ Google Scholar ]
  • National Research Council. (2000). How people learn: Brain, mind, experience, and school . Expanded ed. Washington, DC: National Academies Press. [ Google Scholar ]
  • Pukkila P. J. (2004). Introducing student inquiry in large introductory genetics classes . Genetics , ( 1 ), 11–18. https://doi.org/10.1534/genetics.166.1.11 . [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Quitadamo I. J., Faiola C. L., Johnson J. E., Kurtz M. J. (2008). Community-based inquiry improves critical thinking in general education biology . CBE—Life Sciences Education , ( 3 ), 327–337. https://doi.org/10.1187/cbe.07-11-0097 . [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Quitadamo I. J., Kurtz M. J. (2007). Learning to improve: Using writing to increase critical thinking performance in general education biology . CBE—Life Sciences Education , ( 2 ), 140–154. https://doi.org/10.1187/cbe.06-11-0203 . [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Reynolds J. A., Smith R., Moskovitz C., Sayle A. (2009). BioTAP: A systematic approach to teaching scientific writing and evaluating undergraduate theses . BioScience , ( 10 ), 896–903. https://doi.org/10.1525/bio.2009.59.10.11 . [ Google Scholar ]
  • Reynolds J. A., Thaiss C., Katkin W., Thompson R. J. (2012). Writing-to-learn in undergraduate science education: A community-based, conceptually driven approach . CBE—Life Sciences Education , ( 1 ), 17–25. https://doi.org/10.1187/cbe.11-08-0064 . [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Reynolds J. A., Thompson R. J. (2011). Want to improve undergraduate thesis writing? Engage students and their faculty readers in scientific peer review . CBE—Life Sciences Education , ( 2 ), 209–215. https://doi.org/­10.1187/cbe.10-10-0127 . [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Rhemtulla M., Brosseau-Liard P. E., Savalei V. (2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions . Psychological Methods , ( 3 ), 354–373. https://doi.org/­10.1037/a0029315 . [ PubMed ] [ Google Scholar ]
  • Stephenson N. S., Sadler-McKnight N. P. (2016). Developing critical thinking skills using the science writing heuristic in the chemistry laboratory . Chemistry Education Research and Practice , ( 1 ), 72–79. https://doi.org/­10.1039/C5RP00102A . [ Google Scholar ]
  • Tariq V. N., Stefani L. A. J., Butcher A. C., Heylings D. J. A. (1998). Developing a new approach to the assessment of project work . Assessment and Evaluation in Higher Education , ( 3 ), 221–240. https://doi.org/­10.1080/0260293980230301 . [ Google Scholar ]
  • Timmerman B. E. C., Strickland D. C., Johnson R. L., Payne J. R. (2011). Development of a “universal” rubric for assessing undergraduates’ scientific reasoning skills using scientific writing . Assessment and Evaluation in Higher Education , ( 5 ), 509–547. https://doi.org/10.1080/­02602930903540991 . [ Google Scholar ]
  • Topping K. J., Smith E. F., Swanson I., Elliot A. (2000). Formative peer assessment of academic writing between postgraduate students . Assessment and Evaluation in Higher Education , ( 2 ), 149–169. https://doi.org/10.1080/713611428 . [ Google Scholar ]
  • Willison J., O’Regan K. (2007). Commonly known, commonly not known, totally unknown: A framework for students becoming researchers . Higher Education Research and Development , ( 4 ), 393–409. https://doi.org/10.1080/07294360701658609 . [ Google Scholar ]
  • Woodin T., Carter V. C., Fletcher L. (2010). Vision and Change in Biology Undergraduate Education: A Call for Action—Initial responses . CBE—Life Sciences Education , ( 2 ), 71–73. https://doi.org/10.1187/cbe.10-03-0044 . [ PMC free article ] [ PubMed ] [ Google Scholar ]
  • Zeineddin A., Abd-El-Khalick F. (2010). Scientific reasoning and epistemological commitments: Coordination of theory and evidence among college science students . Journal of Research in Science Teaching , ( 9 ), 1064–1093. https://doi.org/10.1002/tea.20368 . [ Google Scholar ]
  • Zimmerman C. (2000). The development of scientific reasoning skills . Developmental Review , ( 1 ), 99–149. https://doi.org/10.1006/drev.1999.0497 . [ Google Scholar ]
  • Zimmerman C. (2007). The development of scientific thinking skills in elementary and middle school . Developmental Review , ( 2 ), 172–223. https://doi.org/10.1016/j.dr.2006.12.001 . [ Google Scholar ]

  • Foundations
  • Write Paper

Search form

  • Experiments
  • Anthropology
  • Self-Esteem
  • Social Anxiety
  • Foundations >
  • Reasoning >

Scientific Reasoning

Scientific reasoning is the foundation supporting the entire structure of logic underpinning scientific research.

This article is a part of the guide:

  • Falsifiability
  • Inductive Reasoning
  • Deductive Reasoning
  • Hypothetico-Deductive Method
  • Testability

Browse Full Outline

  • 1 Scientific Reasoning
  • 2.1 Falsifiability
  • 2.2 Verification Error
  • 2.3 Testability
  • 2.4 Post Hoc Reasoning
  • 3 Deductive Reasoning
  • 4.1 Raven Paradox
  • 5 Causal Reasoning
  • 6 Abductive Reasoning
  • 7 Defeasible Reasoning

It is impossible to explore the entire process, in any detail, because the exact nature varies between the various scientific disciplines.

Despite these differences, there are four basic foundations that underlie the idea, pulling together the cycle of scientific reasoning.

scientific reasoning research

Observation

Most research has real world observation as its initial foundation. Looking at natural phenomena is what leads a researcher to question what is going on, and begin to formulate scientific questions and hypotheses .

Any theory, and prediction, will need to be tested against observable data.

scientific reasoning research

Theories and Hypotheses

This is where the scientist proposes the possible reasons behind the phenomenon, the laws of nature governing the behavior.

Scientific research uses various scientific reasoning processes to arrive at a viable research problem and hypothesis. A theory is generally broken down into individual hypotheses, or problems, and tested gradually.

Predictions

A good researcher has to predict the results of their research, stating their idea about the outcome of the experiment, often in the form of an alternative hypothesis .

Scientists usually test the predictions of a theory or hypothesis, rather than the theory itself. If the predictions are found to be incorrect, then the theory is incorrect, or in need of refinement.

Data is the applied part of science, and the results of real world observations are tested against the predictions.

If the observations match the predictions, the theory is strengthened. If not, the theory needs to be changed. A range of statistical tests is used to test predictions, although many observation based scientific disciplines cannot use statistics .

The Virtuous Cycle

This process is cyclical: as experimental results accept or refute hypotheses, these are applied to the real world observations, and future scientists can build upon these observations to generate further theories.

Differences

Whilst the scientific reasoning process is a solid foundation to the scientific method , there are variations between various disciplines.

For example, social science, with its reliance on case studies , tends to emphasis the observation phase, using this to define research problems and questions.

Physical sciences, on the other hand, tend to start at the theory stage, building on previous studies, and observation is probably the least important stage of the cycle.

Many theoretical physicists spend their entire career building theories, without leaving their office. Observation is, however, always used as the final proof.

  • Psychology 101
  • Flags and Countries
  • Capitals and Countries

Martyn Shuttleworth (May 7, 2008). Scientific Reasoning. Retrieved Apr 23, 2024 from Explorable.com: https://explorable.com/scientific-reasoning

You Are Allowed To Copy The Text

The text in this article is licensed under the Creative Commons-License Attribution 4.0 International (CC BY 4.0) .

This means you're free to copy, share and adapt any parts (or all) of the text in the article, as long as you give appropriate credit and provide a link/reference to this page.

That is it. You don't need our permission to copy the article; just include a link/reference back to this page. You can use it freely (with some kind of link), and we're also okay with people reprinting in publications like books, blogs, newsletters, course-material, papers, wikipedia and presentations (with clear attribution).

Want to stay up to date? Follow us!

Save this course for later.

Don't have time for it all now? No problem, save it as a course and come back to it later.

Footer bottom

  • Privacy Policy

scientific reasoning research

  • Subscribe to our RSS Feed
  • Like us on Facebook
  • Follow us on Twitter

Advertisement

Advertisement

A new framework for teaching scientific reasoning to students from application-oriented sciences

  • Paper in General Philosophy of Science
  • Open access
  • Published: 02 June 2021
  • Volume 11 , article number  56 , ( 2021 )

Cite this article

You have full access to this open access article

scientific reasoning research

  • Krist Vaesen   ORCID: orcid.org/0000-0002-7496-7463 1 &
  • Wybo Houkes 1  

3286 Accesses

4 Citations

Explore all metrics

About three decades ago, the late Ronald Giere introduced a new framework for teaching scientific reasoning to science students. Giere’s framework presents a model-based alternative to the traditional statement approach—in which scientific inferences are reconstructed as explicit arguments, composed of (single-sentence) premises and a conclusion. Subsequent research in science education has shown that model-based approaches are particularly effective in teaching science students how to understand and evaluate scientific reasoning. One limitation of Giere’s framework, however, is that it covers only one type of scientific reasoning, namely the reasoning deployed in hypothesis-driven research practices. In this paper, we describe an extension of the framework. More specifically, we develop an additional model-based scheme that captures reasoning in application-oriented practices (which are very well represented in contemporary science). Our own teaching experience suggests that this extended framework is able to engage a wider audience than Giere’s original. With an eye on going beyond such anecdotal evidence, we invite our readers to test out the framework in their own teaching.

Similar content being viewed by others

scientific reasoning research

Social Learning Theory—Albert Bandura

Discovery learning—jerome bruner.

scientific reasoning research

Unpacking Epistemic Insights of Artificial Intelligence (AI) in Science Education: A Systematic Review

Kason Ka Ching Cheung, Yun Long, … Ho-Yin Chan

Avoid common mistakes on your manuscript.

1 Introduction

The late Ronald Giere wrote a widely used textbook, entitled Understanding Scientific Reasoning , meant to introduce lower-division students to scientific reasoning. Throughout its four editions, the book was designed to impart to students the ability to understand and evaluate bits of scientific reasoning, as instantiated in popular press articles, semi-professional technical reports and scholarly publications. Given this aim, the book avoids in-depth historical reflection on the philosophy of science, or on the evaluative framework it adopts. Rather, in every edition, Giere simply introduces his framework, and then moves on to how it can be used.

Giere’s framework changed over time, though. In the first ( 1979 ) and second (1984) editions of the book, it fits the traditional statement approach, which Giere traces back to Mill’s A System of Logic ( 1843 ). This was in line, as he reported afterwards (Giere, 2001 ), with what he took to be the approach in the vast majority of textbooks in logic and reasoning. The statement approach assumes that

“the evaluation of any particular bit of reasoning is done by first reconstructing that reasoning as an explicit argument , with premises and a conclusion, and then examining the reconstructed argument to see if it exhibits the characteristic form of a good argument, whether deductive or inductive ” (Giere, 2001 , p. 21, italics added).

The basic aim of the statement approach is to determine whether one or more statements or linguistic expressions (viz., the conclusion of an explicit argument) are true or false or, at least, to determine whether it is reasonable to take the statements to be true or false on the basis of other statements. In the third (1991) and fourth (2005) editions, Giere abandons this approach in favour of a model-based approach. This reflects a then growing concern among philosophers of science that modern scientific claims simply do not translate well into statements, leading to ill-fitting or impoverished reconstructions. For instance, the behavior of complex systems such as coupled harmonic oscillators or of randomly breeding populations of predators and prey is typically represented by mathematical models; brain processes or processes within organizations are commonly represented by diagrams; and the study of turbulence and energy systems tends to be informed by the study of scale models. Even if one were to succeed in turning these different types of models into sets of linguistic expressions, it would, according to Giere, be pointless to assess the truth of such expressions: such expressions are true, by definition, of the models, but not of the world. Also, Giere contends that, since models are abstract objects, the relationship between model and world is not one of truth, but rather one of fit. Scientists are primarily engaged in assessing the fit between models and target systems in the real world, i.e., in assessing whether their models are sufficiently similar to target systems to study the behavior of the latter by means of the former.

Giere indicates that his model-based approach better resonates with students. This matches our own experience in teaching scientific reasoning. There is also more systematic evidence for the advantages of model-based approaches. For one, there is widespread consensus among researchers that model-based reasoning more accurately describes actual scientific cognition and practice than argumentative reasoning (Clement, 2008 ; Gilbert et al., 1998 ; Halloun, 2004 ; Justi & Gilbert, 1999 ; Passmore & Stewart, 2002 ; Taylor et al., 2003 ). Cognitive scientists even have proposed that human cognition in general (not just scientific cognition) is best described in terms of mental modelling (Johnson-Laird, 1983 , 2006 ; Nersessian, 2002 , 2008 ). Furthermore, in their typical science courses, students are introduced to theories by means of models rather than arguments. Model-based approaches, thus, tap into customary modes of thinking among science students and, accordingly, appear effective in science instruction (Böttcher & Meisert, 2011 ; Gobert & Clement, 1999 ; Gobert & Pallant, 2004 ; Gobert, 2005 ; Matthews, 2007 ). Finally, from a more evaluative perspective, statement approaches struggle to accommodate all the information that is relevant to evaluating a piece of scientific reasoning; model-based assessments fare much better in comparison. The principal object of analysis in a statement approach is a hypothesis, which is typically expressed in a single statement. In Giere’s framework, in contrast, the object of analysis is a model. Associated with a model is not just one or more hypotheses, but also crucial background information, such as auxiliary assumptions (i.e., assumptions that are assumed to hold but that are secondary to the hypotheses under investigation) and boundary conditions (i.e., the conditions that need to be satisfied for a proper empirical test of the model and its hypotheses). Additionally, in Giere’s framework a model is explicitly evaluated relative to competing models. Here, Giere does not distinguish between different types of models: he presents a framework that is meant to apply to mathematical models, scale models, and diagrams alike, focusing on their shared role in scientific reasoning.

In Section  2 , we will discuss Giere’s model-based framework in more detail, focusing on its role as an instrument to instruct students how to go about evaluating instances of scientific reasoning. In doing so, we will identify a serious limitation: the framework captures only one mode of reasoning, namely the reasoning employed in hypothesis-driven research. As teachers at a technical university, we have experienced that this makes Giere’s framework unsuitable for our particular audience. In the research practices which most of our students are educated in, hypothesis-testing is typically embedded in application-oriented epistemic activities. To capture this embedding and thus improve the usefulness to our audience, we developed an extension of Giere’s framework. Section  3 introduces this extended model-based framework for assessing application-oriented research. Section  4 discusses the wider applicability of our extended framework. Since much of contemporary science is application-oriented rather than hypothesis-driven, we submit that our framework will also benefit teachers that work outside the confines of a technical university.

2 Giere’s framework

In Giere’s model-based alternative to the reconstructive statement approach, the primary purpose of observation, experimentation and scientific reasoning is to assess the fit between models and real-world target systems. Giere developed a representation of this decision process (see Fig.  1 , and the caption that accompanies it) to aid students in evaluating (popular) scientific reports; and here, we will use this representation together with an example to outline his framework.

figure 1

Steps in analysing hypothesis-driven approaches; Step 1—Real world: identification of the target system, i.e., the aspect of the world that is to be captured by the model. Step 2—Model: development of a model, which is to be assessed for its fit with the target system. Step 3—Prediction: deduction of predictions from the model. Step 4—Data: data collection from the target system, in order to establish the (non-)agreement between data and predictions. Step 5/6—Negative/Positive evidence: evaluation of model-target fit, based on the (non-)agreement between data and prediction

Consider the following example, which is accessible for a broad audience. Epidemiologists might, as per Step 1 , identify an aspect of a real-world system that they want to better understand, for instance, the development over time of COVID-19 infections, recoveries and deaths. In Step 2 , they develop an epidemiological model that they hope adequately captures these trends. Figure  2 presents the graphical and mathematical expressions of one such model. The graph shows various independent variables, and their interactions, and how these drive the dependent variables, viz., number of individuals susceptible to infection (S), number of infected individuals (I), number of recovered individuals (R), number of deaths (D). The mathematical expressions summarize in which ways S, I, R and D are dependent on the independent variables.

figure 2

Epidemiological model of COVID-19 (taken from Vega, 2020 ). The graph shows the (interactions among) independent variables, and how they affect the dependent variables (“Susceptible”, “Infected”, “Recovered” and “Deaths”). The equations (top-right) express these interdependencies in a mathematical form

In addition to this graphical representation and mathematical expression, the model comprises auxiliary assumptions and boundary conditions. As an example of the former, the model assumes that the strain that the pandemic puts on hospitals (Hospital Capacity Strain Index) is determined by the number of serious infection cases and the hospital capacity (expressed in number of beds), where the latter is held constant . The model thus ignores other factors that, arguably, affect the strain on hospitals, including, lower hospital capacity due to infections among or strain on hospital personnel, the duration and frequency of pandemic waves, additional care capacity through governmental or private investment, the occurrence of an additional pandemic or epidemic, and so forth. Another auxiliary assumption is that the model population is unstructured (e.g., in terms of contact clusters, age cohorts, and so forth). As to the model’s boundary conditions, the model will fit the target system to the extent that its initial conditions reflect the target system’s initial conditions (e.g., that population sizes in the model and target system are 100,000, that the fractions of the populations that are susceptible to infection are at 13%).

Ultimately, the epidemiologists wish to assess the fit between the real-world target (as identified in Step 1 ) and the model (as developed in Step 2 ). In order to do so, they, in Step 3 , derive testable hypotheses or predictions from the model. A couple of predictions are presented in Fig.  3 .

figure 3

Number of infections over time, as predicted by the model of Vega ( 2020 ). The green line represents infections in a scenario without lockdown; the red and blue lines capture what would happen under different lockdown scenarios

Subsequently, in Step 4 , the predictions are empirically tested, using data from the real-world target system. Here the epidemiologists might use as evidence the infection records of the first wave of COVID-19. Finally, in Steps 5 and 6 , the agreement between this evidence and predictions (of Fig.  3 ) informs the epidemiologists’ evaluation of the fit between the model and real-world COVID-19 infection patterns. Negative evidence suggests a poor fit (Step 5); positive evidence, in the absence of plausible competing models, suggests a good fit (Step 6).

Having reconstructed the decision-making process of the epidemiologists along these lines, students are in a good position to evaluate it. They may formulate critical questions concerning the deduction of predictions from the model, and the inductive inferences in Step 5 and 6. Regarding the former, students should evaluate whether the predictions indeed follow from the model. Is it really the case that the prediction should hold, given the model’s auxiliary assumptions and boundary conditions? And is the prediction sufficiently surprising, precise and singular? As to the inductive inferences, to what extent are the epidemiologists’ conclusions derived in accordance with the decision tree of Steps 5 and 6? Are the epidemiologists concluding too much or too little? Do they sufficiently acknowledge remaining uncertainties, uncertainties resulting from, e.g., observational biases, low number of observations, deviations of observations from predictions, and the plausibility of competing models?

Giere’s and our own experience in using this framework confirms what research in science education has suggested about model-based approaches in general (Böttcher & Meisert, 2011 ; Gobert & Clement, 1999 ; Gobert & Pallant, 2004 ; Gobert, 2005 ; Matthews, 2007 ): many students find it relatively easy to internalize model-based frameworks (such as Giere’s) for evaluating reports of scientific findings. But, our own experience in teaching scientific reasoning to students in the specific context of a technical university indicates Giere’s framework doesn’t cater to everyone: for students from some programs (e.g., chemistry), internalizing the framework appears easier than for students from other programs (e.g., mechanical engineering); and in explaining the framework, we find it easier to evaluate relevant examples from one field of inquiry than from others.

This differentiation in comprehensibility of the framework brings to mind a conventional distinction between ‘fundamental’ and ‘applied’ fields of inquiry, where it is maintained that the former produce most of the theoretical knowledge that is utilized for practical problem-solving in the latter (e.g., Bunge, 1966 ). Since pitching this distinction at the level of fields or disciplines seems unsustainable (following criticisms reviewed in, e.g., Kant & Kerr, 2019 and Houkes & Meijers, 2021 ), we opt for a differentiation at the level of research practices instead.

This differentiation builds on Chang’s ( 2011 , 2012 ) practice-oriented view, which distinguishes several hierarchically ordered levels of analysis: mental and physical operations , which constitute epistemic activities that in turn make up a research practice . Here, research practices are characterized by their aim, while epistemic activities are rule-governed, routinized sets of operations that contribute to knowledge production in light of these aims. Lavoisier’s revolutionary way of doing chemistry, for example, can be understood as an innovative practice in its time, constituted by activities such as collecting gases, classifying compounds, and measuring weights, which comprise various operations.

In line with Chang’s practice-oriented analysis, hypothesis testing may be taken as an epistemic activity that is more central to some research practices—such as the episode from epidemiology that was reconstructed earlier in this section—than to others. Some practices are aimed at generating precisely the theoretical knowledge that may be gained through systematic hypothesis-testing; in other practices, however, the results of hypothesis testing are instrumental to more encompassing aims. Giere’s original framework may fit the mental model of students who have been primarily exposed to or educated in the former type of practices; but it is at best an incomplete fit for the latter, more encompassing type of practice. This diagnosis suggests a natural solution: to capture these more encompassing practices, we need to extend Giere’s framework.

3 A framework for assessing application-oriented research

3.1 reconstruction of application-oriented scientific reasoning.

Research in the engineering sciences has received only limited attention from philosophers of science. Conventionally, it is understood as research that applies scientific methods and/or theories to the attainment of practical goals (Bunge, 1966 ). Consequently, also in the self-understanding and -presentation of many practitioners, these fields of inquiry go by labels such as ‘instrumental’ or ‘applied’ research, or are characterized in terms of ‘design’ rather than research. However, in-depth analyses of historical and contemporary episodes (e.g., Constant, 1999 ; Kroes, 1992 ; Vincenti, 1990 ) reveal that they involve more than merely deriving solutions to specific practical problems from fundamental theories. In particular, they can be understood as knowledge-producing activities in their own right. Some have claimed that application-oriented research involves a special type of ‘designerly’ or ‘engineering’ knowledge (as in, e.g., Cross, 1982 ). This knowledge has been characterized as ‘prescriptive’ (Meijers & Kroes, 2013 ), as consisting of ‘technical norms’ (Niiniluoto, 1993 ), or ‘technological rules’ (Houkes & Meijers, 2021 )—but it seems fair to say that these characterizations require further specification, in their content, scope and impact on differentiating application-oriented research (see, e.g., Kant & Kerr, 2019 ).

For our present purposes, we only need to assume that practices in the engineering sciences, in which most of our students are trained, involve epistemic activities such as hypothesis testing and are therefore genuinely knowledge-producing. Furthermore, we submit that the knowledge thus produced often has a wider scope than specific, ‘local’ practical problems (e.g., Adam et al. 2006 ; Wilholt, 2006 ), although they might be strongly geared towards solving such problems. Given this, we label such practices ‘application-oriented’. We submit that fields of inquiry such as mechanical engineering or fusion research frequently involve application-oriented practices, without thereby expressing commitment to any of the characterizations mentioned above. Still, the framework presented below is compatible with these characterizations: it can be supplemented with (suitably developed) analyses of, e.g., technical norms or prescriptive knowledge in application-oriented practices.

Figure  4 represents our application-oriented framework. Whereas Giere’s framework starts with a real-world phenomenon, which the researcher then wishes to capture with a suitable model, application-oriented approaches typically start with a model, which serves as a stand-in for a not-yet-existent target system (e.g., a software package, a novel protein). Or, whereas in hypothesis-driven approaches models are supposed to be descriptive of the world and to deliver theoretical knowledge, in application-oriented research, models intend to describe how the world might or should be and to yield practical knowledge. Our general assumption is that, in the epistemic activities captured by the application-oriented framework, researchers do not, without prior study, start building their real-world target system (or artifact for short). Rather, they first develop a model of the artifact (e.g., a blueprint, scale model, or computer model) and study the behavior of the model ( Model Phase in Fig.  4 ). Only if the model behaves as it should, the researcher takes the next step of actually producing and testing ( Artifact Phase in Fig.  4 ) the artifact for which the model served as a stand-in.

figure 4

Steps in analysing application-oriented research. Problem Definition Phase: Step 0—Design specs: definition of the design specifications the artifact has to meet. Model Phase: Step 1—Model: development of model that acts as a stand-in for the artifact to be produced. Step 2—Model: derivation of predictions from the model, where predictions align with the design specs identified in Step 0. Step 3—Model data: collection of model data, and assessment of the (non-)agreement between model data and predictions. In case of agreement, and in case of reasonable analogy between model and artifact, Artifact Phase starts: Step 4—Artifact: developing artifact based on the model. Step 5: Deduction of predictions from the artifact, where predictions are identical to design specs of Step 0. Step 6—Artifact data: Collection of artifact data, and assessment of (non-)agreement between artifact data and predictions/design specs. The “New” symbols refer to procedures that are not shared with hypothesis-driven approaches

As can be seen in Fig.  4 , application-oriented research in fact involves a phase prior to model-building, denoted by “Problem definition phase”. In this phase ( Step 0 ), the design specs are determined, i.e., the properties or dispositions that the artifact ultimately ought to exhibit (e.g., the intended cost, functionalities and efficiency of a software program; the intended structure of a protein).

The purpose of the Model Phase is developing one or more models that meet the researcher’s predefined specs. To the extent they do, and to the extent the models are an adequate analogue for the artifact yet to be built (see Section  3.2 ), the researcher moves to the Artifact Phase . Here one goes through the same building and testing cycle as in the Model Phase , but this time the object of analysis is the artifact rather than the model. Frameworks in design methodology, such as the ‘basic design cycle’ (Roozenburg & Eekels, 1995 ) and the ‘Function-Behavior-Structure’ model (Gero, 1990 ) also represent an iterative process of determining and satisfying design specs, but without bringing out the role of model-building and thus also without explicitly distinguishing the Model Phase and Artifact Phase .

Each of these cycles bears large similarity with the cycle in Giere’s framework. In the Model Phase , a model is developed ( Step 1 ) from which various predictions are derived ( Step 2 ). Arguably, the most salient predictions are those that pertain to the design specs identified in Step 0 , i.e., predictions concerning the extent to which the model will satisfy these specs. In order to assess this, one collects relevant data from the model ( Step 3 ), and evaluates whether the data agree with the predictions (i.e., design specs). If they don’t, one might reiterate the cycle; if they do, the artifact is built based on the model ( Step 4 ).

At Step 4 , one enters the Artifact Stage , characterized by a similar cycle: the artifact is produced ( Step 4 ); one formulates predictions about the artifact, viz., whether it exhibits the desired design specs ( Step 5) ; and one collects data that allow one to test these predictions of ( Step 6 ). In case the data agree with the design specs, the artifact is accepted; otherwise, it is adjusted or steps 1–5 are reiterated. Note that the design specs of the model ( Step 2 ) and the artifact ( Step 5 ) might be quantitatively or qualitatively different. While the latter might simply be taken from Step 0 , the former should fit the specific context of the model world.

To illustrate the application-oriented framework, consider another example from the biomedical sciences, one that also pertains to COVID-19. Let us assume that a researcher’s task is to design a novel protein that is able to bind to SARS-Cov-2. Given the known structure of the binding configuration of the virus, one can define, in the Problem Definition Phase , the structure that the new protein ought to have and the types of (energetic) conditions under which the protein needs to remain stable (Step 0). During the Model Phase , in Step 1, a computer model of a candidate protein is developed (see Fig.  5 ). Next, in Step 2, some testable predictions are derived concerning the candidate protein’s structure and stability. Tests of its structure and stability rely on model data (Step 3). Regarding stability, for instance, the researcher/computer needs to calculate the protein’s thermodynamic free energy; a configuration that has low thermodynamic free energy is more likely to effectively exhibit the requisite structure than a configuration that has high thermodynamic free energy.

figure 5

A computer model for designing new proteins

In case there is agreement between model data and model predictions, and the analogy between model and artifact (real-world protein) is adequate (see below), the researcher moves to the Artifact Phase , and will actually produce the protein (Step 4), typically by means of gene manipulation techniques. Step (5) and (6) then carry out a test of the real-world protein’s compliance with the design specs (e.g., structure, stability under real-life conditions, etc.).

3.2 Evaluation of application-oriented scientific reasoning

Part of the value of a reconstruction along the lines of Fig.  4 lies in the critical questions to which it gives rise. It does so at four levels: the Problem Definition Phase , the Model Phase , the Artifact Phase , and at the level of the analogical inference connecting these last two phases. The evaluation of the Model Phase and the Artifact Phase are virtually identical to the evaluation involved in Giere’s hypothesis-driven framework; genuinely new evaluative steps (as indicated in Fig.  4 ) pertain to the Problem Definition Phase and the analogical inference that connects the Model and the Artifact Phases .

3.2.1 Problem definition phase (new evaluative step)

An application-oriented approach might fail as an epistemic activity well before any model or artifact is built. There are plenty of frameworks that may be used to judge the rationality of the researcher’s design specs. The researcher’s intended design specs might be—in terms of the well-known SMART-framework—insufficiently Specific, Measurable, Acceptable, Realistic or Time-related. Alternatively, students could assess the design specs along the lines of the evaluative framework of Edvardsson and Hansson ( 2005 ). This framework resembles the SMART framework, but is different in some of its details. According to it, design specs are rational, i.e., achievement-inducing, just in case they meet four criteria: precision, evaluability, approachability and motivity. Precision and evaluability are very similar to, respectively, the SMART criteria Specific and Measurable; and Approachability is a combination of SMART’s Realistic and Time-Related. Motivity, finally, refers to the degree to which a goal/design spec induces commitment in those involved in reaching it. Decision theory and the work of Millgram and Thagard ( 1996 ) suggest another evaluative criterion: the (in)coherence among design specs. Teachers might find still other frameworks useful; but in any case, the merit of our proposed teaching approach is that it forces students to reflect on the rationality of the problem definition phase.

3.2.2 Model phase

Given the strong parallel between Giere’s framework and our Model Phase, the latter can largely be evaluated according to Giere’s evaluation principles. Students first should determine whether the prediction of Step 2 indeed reasonably follows from the model. Is it feasible for the model to meet the design specs, given its auxiliary assumptions and boundary conditions? For instance, the researcher in the example might choose to redesign an existing protein rather than to model a protein from scratch. Accordingly, is it reasonable to think that the existing protein, given its structure and other properties (auxiliary assumptions), can ever be redesigned in such a way that it will behave as desired? Further, the prediction has to be assessed in terms of surprisingness, i.e., the degree to which the prediction goes beyond common sense; precision, i.e., the degree to which the prediction is specific rather than vague; and singularity, i.e., the degree to which the prediction is not a conjunction of predictions.

Next, students are to evaluate the (lack of) agreement between predicted design specs ( Step 2 ) and the data from Step 3 . Such evaluation involves the assessment of the quality of the data (e.g., number of observations, (in)variance of data across different conditions, deviations of data from predicted values), and informs the decision as to build a new model (in case of non-agreement) or to move to the Artifact Phase (in case of agreement). The latter decision is informed also by analogical reasoning.

3.2.3 Analogical reasoning (new evaluative step)

The model only forms a proper basis for the development of the artifact if the two are sufficiently similar in relevant respects. This gives rise to assessing an analogical inference of the following form: one observes that, in virtue of model’s properties a,b,c , the model meets design spec x ; accordingly, an artifact that, analogously to the model, has properties a,b,c , will probably also meet design spec x .

The strength of this inference depends on the extent and relevance of the similarities and dissimilarities between model and artifact. Students should first identify all relevant similarities and dissimilarities, where the criteria of relevance are set by the design specs. For instance, in order to justify the translation from model to real-world protein, a comparison between the surrounding environment of the model protein and the surrounding environment of the real-world protein is clearly relevant. Similarity in color, in contrast, says nothing about the real-world protein’s stability. Furthermore, students must assess the degree and number of these relevant similarities, and do the same for relevant dis similarities.

Finally, students need to identify other, independent models (e.g., scale model, other computer models) of the artifact to be produced, and assess the relevant (dis)similarities between these models and the artifact. It would strengthen the analogical inference if such existing models point in the same direction as the model under study. Conversely, their confidence in the analogical inference should decrease when other models that, in the relevant ways, are similar to the artifact do not satisfy the design specs.

3.2.4 Artifact phase

Assessment of the Artifact Phase largely follows the same procedure as the procedure described under the bullet Model Phase . Only the goals of the phases are different. Whereas the goal of the Model Phase is producing information that is relevant for further steps in the research process, the final step of the Artifact Phase simply ends the entire exercise. Ideally, the latter phase results in an artifact that meets the design specs identified in Step 0 .

4 Conclusion and discussion

We familiarize all our science students with both frameworks, for the simple reason that students typically encounter both types of research in their curriculum and later professional career. After all, parts of hypothesis-driven research are in fact application-oriented (e.g., design of experiments and equipment, tailor-making computer code), and parts of application-oriented research are hypothesis-driven (e.g., incorporation of hypothesis-driven theories into the design exercise). Further, in many disciplines both approaches peacefully coexist. As our examples of COVID-19 research show, the biomedical sciences comprise different research practices and corresponding epistemic activities; some of these are more hypothesis-driven, others more application-oriented.

Teaching the two frameworks together is also very efficient. There are a number of elements that recur in both approaches; learning about one approach facilitates learning about the other. Given such similarities, it is also easier for students to see and appreciate the crucial differences between the frameworks. These features are a great plus. Standard reconstructions of the research processes in the sciences and the engineering sciences give the impression that there is only limited overlap between the two (see, e.g., Fig.  6 ). We have seen, however, that application-oriented research relies on scientific tools, methods and modes of reasoning. Our framework makes this explicit.

figure 6

Hill’s ( 1970 ) comparison of scientific reasoning (left) and reasoning in engineering sciences (Figure taken from Hughes, 2009 )

On a final note, our new framework has been mainly developed in response to perceived shortcomings for students at a technical university. There are, however, reasons to suppose that it could be deployed in courses for students from a wide range of disciplines. It has been noted by many that research in many disciplines has undergone a shift away from ‘fundamental’ issues to more ‘applied’ ones; that research efforts have become more interdisciplinary in response to financial and societal incentives; and that new fields of inquiry (e.g., biomedical research, education research, social and economic policy research, management science, intervention research, experimental design research), tend to be oriented towards particular areas of application. Many different interpretations have been given, for instance in terms of changing ‘modes’ of knowledge production (Gibbons et al., 1994 ); a ‘triple helix’ of private, public and academic partners (Etzkowitz & Leydesdorff, 2000 ); or a ‘commodification’ of academic research (Radder, 2010 ). So if we accept that it is taking place in some form, it entails that an increasing number of students would in the course of their educational programs be primarily exposed to application-oriented research practices and epistemic activities. Thus, if our experiences would generalize, philosophy of science teachers might well find that our extended version of Giere’s model-based framework is more comprehensible and useful for ever more students than Giere’s original version, let alone a statement-based approach. Surely, whether our experiences in fact generalize remains to be seen. In building on Giere's original framework, we have provisionally adopted his (implicit) assumption that different types of models – e.g., mathematical models, scale models, and diagrams – play sufficiently similar roles in scientific reasoning to be treated alike. A more differentiated approach may well be called for. Likewise, we have supposed that our framework for reconstructing application-oriented research is compatible with different proposals regarding the distinctive knowledge generated by such research (e,g., in the form of technical norms or other prescriptive knowledge). This supposition may well be incorrect, i.e., may turn out to gloss over distinctive features that would allow students from some programs to gain insight into research activities in their chosen disciplines. Such shortcomings can, however, be best identified in practice rather than discussed in the abstract. We therefore invite other teachers to more systematically study the merits and downsides of our application-oriented version of Giere's model-based framework.

Data availability

Not applicable.

Code availability

Adam, M., Carrier, M., & Wilholt, T. (2006). Moderate emergentism. Science and Public Policy, 33 , 435–444.

Article   Google Scholar  

Böttcher, F., & Meisert, A. (2011). Argumentation in science education: A model-based framework. Science & Education, 20 , 103–140.

Bunge, M. (1966). Technology as applied science. Technology and Culture, 7 (3), 329–347.

Cross, N. (1982). Designerly Ways of Knowing. Design Studies, 3 (4), 221–227.

Chang, H. (2011). The philosophical grammar of scientific practice. International Studies in the Philosophy of Science, 25 , 205–221.

Chang, H. (2012). Is Water H2O? Springer.

Clement, J. J. (2008). Creative model construction in scientists and students: The role of imagery, analogy, and mental simulation . Springer.

Constant, E. W. (1999). Reliable knowledge and unreliable stuff. Technology and Culture, 40 , 324–357.

Edvardsson, K., & Hansson, S.O. (2005). When is a goal rational? Social Choice and Welfare, 24 , 2, 343–361.

Etzkowitz, H., & Leydesdorff, L. (2000). The dynamics of innovation. Research Policy, 29 , 109–123.

Gero, J. S. (1990). Design prototypes: A knowledge representation scheme for design. AI Magazine, 11 (4), 26–36.

Google Scholar  

Gibbons, M., Limoges, C., Nowotny, H., Schwartzman, S., Scott, P., & Trow, M. (1994). The New Production of Knowledge . SAGE.

Giere, R.N. (1979, 1984, 1991, 2005). Understanding scientific reasoning (Eds. 1–4). Holt, Rinehart, and Winston.

Giere, R. N. (2001). A new framework for teaching scientific reasoning. Argumentation, 15 (1), 21–33.

Gilbert, J. K., Boulter, C., & Rutherford, M. (1998). Models in explanations, part 1: Horses for courses? International Journal of Science Education, 20 (1), 83–97.

Gobert, J. D., & Clement, J. J. (1999). Effect of student-generated diagram versus student-generated summaries on conceptual understanding of causal and dynamic knowledge in plate tectonics. Journal of Research in Science Teaching, 26 (1), 39–53.

Gobert, J.D., & Pallant, A. (2004). Fostering students’ epistemologies of models via authentic model-based tasks. Journal of Science Education and Technology, 13 , 1.

Gobert, J. D. (2005). The effects of different learning tasks on model-building in plate tectonics: Diagramming versus explaining. Journal of Geoscience Education, 53 (4), 444–455.

Halloun, I. A. (2004). Modeling theory in science education . Kluwer Academic Publishers.

Hill, P. (1970). The Science of Engineering Design . Holt, Rinehart and Winston.

Houkes, W., & Meijers, A.W.M. (2021). Engineering knowledge. Forthcoming in Vallor S (ed.), The Oxford Handbook of Philosophy of Technology . Oxford University.

Hughes, J. (2009). Practical reasoning and engineering. In Meijers A (ed) Handbook of the Philosophy of Science. Volume 9: Philosophy of Technology and Engineering Sciences . Elsevier.

Johnson-Laird, P. N. (1983). Mental models: Towards a cognitive science of language, inference, and consciousness . Cambridge University Press.

Johnson-Laird, P.N. (2006). Mental models, sentential reasoning, and illusory inferences. In Held C, Knauff M, Vosgerau G, et al. (eds.) Mental models and the mind . Elsevier.

Justi, R., & Gilbert, J. K. (1999). History and philosophy of science through models: The case of chemical kinetics. Science & Education, 8 , 287–307.

Kant, V., & Kerr, E. (2019). Taking stock of engineering epistemology: Multidisciplinary perspectives. Philosophy & Technology, 32 , 685–726.

Kroes, P. A. (1992). On the role of design in engineering theories. In P. A. Kroes & M. Bakker (Eds.), Technological development and science in the industrial age (pp. 69–98). Kluwer.

Matthews, M. R. (2007). Models in science and in science education: An introduction. Science & Education, 16 , 647–652.

Meijers, A. W. M., & Kroes, P. A. (2013). Extending the scope of the theory of knowledge. In M. De Vries, S. O. Hansson, & A. W. M. Meijers (Eds.), Norms in Technology (pp. 15–34). Springer.

Mill, J. S. (1843). A system of logic, ratiocinative and inductive, being a connected view of the principles of evidence, and the methods of scientific investigation . Harper & Brothers.

Millgram, E., & Thagard, P. (1996). Deliberative coherence. Synthese, 108 , 63–88.

Nersessian, N.J. (2002). The cognitive basis of model-based reasoning in science. In Carruthers P, Stich S, Siegal M (Eds.), The cognitive basis of science . Cambridge University Press.

Nersessian, N. J. (2008). Creating scientific concepts . MIT Press.

Niiniluoto, I. (1993). The aim and structure of applied research. Erkenntnis, 38 , 1–21.

Passmore, C., & Stewart, J. (2002). A modeling approach to teaching evolutionary biology in high schools. Journal of Research in Science Teaching, 39 (3), 185–204.

Radder, H. (Ed.) (2010). The commodification of scientific research . University of Pittsburgh Press.

Roozenburg, N. F. M., & Eekels, J. (1995). Product design: Fundamentals and methods . Wiley.

Taylor, I., Barker, M., & Jones, A. (2003). Promoting mental model building in astronomy education. International Journal of Science Education, 25 (10), 1205–1225.

Vega, D.I. (2020). Lockdown, one, two, none, or smart. Modeling containing COVID-19 infection. A conceptual model. Science of the Total Environment 730: 138917.

Vincenti, W. G. (1990). What engineers know and how they know it . Johns Hopkins University Press.

Wilholt, T. (2006). Design rules: Industrial research and epistemic merit. Philosophy of Science, 73 (1), 66–89.

Download references

Author information

Authors and affiliations.

Philosophy & Ethics, School of Innovation Sciences, Eindhoven University of Technology, P.O. Box 513, 5600 MB, Eindhoven, The Netherlands

Krist Vaesen & Wybo Houkes

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Krist Vaesen .

Ethics declarations

Conflicts of interest/competing interests, additional information, publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This article belongs to the Topical Collection: Teaching philosophy of science to students from other disciplines

Guest Editors: Sara Green, Joeri Witteveen

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Vaesen, K., Houkes, W. A new framework for teaching scientific reasoning to students from application-oriented sciences. Euro Jnl Phil Sci 11 , 56 (2021). https://doi.org/10.1007/s13194-021-00379-0

Download citation

Received : 03 December 2020

Accepted : 13 May 2021

Published : 02 June 2021

DOI : https://doi.org/10.1007/s13194-021-00379-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Science education
  • Model-based reasoning
  • Hypothesis-driven research
  • Application-oriented research
  • Epistemic activities
  • Ronald Giere
  • Find a journal
  • Publish with us
  • Track your research

Library homepage

  • school Campus Bookshelves
  • menu_book Bookshelves
  • perm_media Learning Objects
  • login Login
  • how_to_reg Request Instructor Account
  • hub Instructor Commons
  • Download Page (PDF)
  • Download Full Book (PDF)
  • Periodic Table
  • Physics Constants
  • Scientific Calculator
  • Reference & Cite
  • Tools expand_more
  • Readability

selected template will load here

This action is not available.

Biology LibreTexts

1.2: The Science of Biology - Scientific Reasoning

  • Last updated
  • Save as PDF
  • Page ID 12644

Learning Objectives

  • Compare and contrast theories and hypotheses

The Process of Science

Science (from the Latin scientia, meaning “knowledge”) can be defined as knowledge that covers general truths or the operation of general laws, especially when acquired and tested by the scientific method. The steps of the scientific method will be examined in detail later, but one of the most important aspects of this method is the testing of hypotheses (testable statements) by means of repeatable experiments. Although using the scientific method is inherent to science, it is inadequate in determining what science is. This is because it is relatively easy to apply the scientific method to disciplines such as physics and chemistry, but when it comes to disciplines like archaeology, paleoanthropology, psychology, and geology, the scientific method becomes less applicable as it becomes more difficult to repeat experiments.

These areas of study are still sciences, however. Consider archaeology: even though one cannot perform repeatable experiments, hypotheses may still be supported. For instance, an archaeologist can hypothesize that an ancient culture existed based on finding a piece of pottery. Further hypotheses could be made about various characteristics of this culture. These hypotheses may be found to be plausible (supported by data) and tentatively accepted, or may be falsified and rejected altogether (due to contradictions from data and other findings). A group of related hypotheses, that have not been disproven, may eventually lead to the development of a verified theory. A theory is a tested and confirmed explanation for observations or phenomena that is supported by a large body of evidence. Science may be better defined as fields of study that attempt to comprehend the nature of the universe.

Scientific Reasoning

One thing is common to all forms of science: an ultimate goal “to know.” Curiosity and inquiry are the driving forces for the development of science. Scientists seek to understand the world and the way it operates. To do this, they use two methods of logical thinking: inductive reasoning and deductive reasoning.

image

Inductive reasoning is a form of logical thinking that uses related observations to arrive at a general conclusion. This type of reasoning is common in descriptive science. A life scientist such as a biologist makes observations and records them. These data can be qualitative or quantitative and the raw data can be supplemented with drawings, pictures, photos, or videos. From many observations, the scientist can infer conclusions (inductions) based on evidence. Inductive reasoning involves formulating generalizations inferred from careful observation and the analysis of a large amount of data. Brain studies provide an example. In this type of research, many live brains are observed while people are doing a specific activity, such as viewing images of food. The part of the brain that “lights up” during this activity is then predicted to be the part controlling the response to the selected stimulus; in this case, images of food. The “lighting up” of the various areas of the brain is caused by excess absorption of radioactive sugar derivatives by active areas of the brain. The resultant increase in radioactivity is observed by a scanner. Then researchers can stimulate that part of the brain to see if similar responses result.

Deductive reasoning or deduction is the type of logic used in hypothesis-based science. In deductive reason, the pattern of thinking moves in the opposite direction as compared to inductive reasoning. Deductive reasoning is a form of logical thinking that uses a general principle or law to forecast specific results. From those general principles, a scientist can extrapolate and predict the specific results that would be valid as long as the general principles are valid. Studies in climate change can illustrate this type of reasoning. For example, scientists may predict that if the climate becomes warmer in a particular region, then the distribution of plants and animals should change. These predictions have been written and tested, and many such predicted changes have been observed, such as the modification of arable areas for agriculture correlated with changes in the average temperatures.

Both types of logical thinking are related to the two main pathways of scientific study: descriptive science and hypothesis-based science. Descriptive (or discovery) science, which is usually inductive, aims to observe, explore, and discover, while hypothesis-based science, which is usually deductive, begins with a specific question or problem and a potential answer or solution that can be tested. The boundary between these two forms of study is often blurred and most scientific endeavors combine both approaches. The fuzzy boundary becomes apparent when thinking about how easily observation can lead to specific questions. For example, a gentleman in the 1940s observed that the burr seeds that stuck to his clothes and his dog’s fur had a tiny hook structure. Upon closer inspection, he discovered that the burrs’ gripping device was more reliable than a zipper. He eventually developed a company and produced the hook-and-loop fastener popularly known today as Velcro. Descriptive science and hypothesis-based science are in continuous dialogue.

image

  • A hypothesis is a statement/prediction that can be tested by experimentation.
  • A theory is an explanation for a set of observations or phenomena that is supported by extensive research and that can be used as the basis for further research.
  • Inductive reasoning draws on observations to infer logical conclusions based on the evidence.
  • Deductive reasoning is hypothesis-based logical reasoning that deduces conclusions from test results.
  • theory : a well-substantiated explanation of some aspect of the natural world based on knowledge that has been repeatedly confirmed through observation and experimentation
  • hypothesis : a tentative conjecture explaining an observation, phenomenon, or scientific problem that can be tested by further observation, investigation, and/or experimentation
  • Student Research Toolkit

Styles of Scientific Reasoning

Science is a process for making meaning and deriving understanding about our world and ourselves. It is fundamentally a human endeavor , and thus is fallible in all of the ways humans are fallible. Also like humans, science is a multifaceted, creative, collaborative, and more… Look for examples of science that has been done using each of these approaches, and think about how you make meaning in the world around you!

Consider this…

scientific reasoning research

What do you think caused Townville to flood?

To answer this question you are using skills of observation . As this is merely a cartoon, you are limited to what you can see of what is shown. You may want to imagine what additional information you could gather if you were answering this question on site — What other senses could you use? Where would you look more closely? What would you want to know about the bigger picture? etc…

To state what you think caused the flood is a form of stating a hypothesis . While there are many ways of conducting scientific exploration, scientists usually have some expectation of what they’re studying and what they hope to find.

How could you study this?

To move from a conjecture (just a statement you think could be true) into a testable expectation or hypothesis, consider what about this scenario you could examine and support with evidence. What would you expect to find if your expectation/hypothesis is true? AND What would you expect to find if your conjecture/hypothesis is false? This second question is important as it can guide you toward questions that will give you new information about the world regardless of whether you were right or wrong in your thinking. Be humble in your thinking. Leave space to be surprised and to experience awe at what you may uncover about the way the world works.

Try to make a list for yourself of several ways you could study your conjecture/hypothesis. Can you draw on different skills that you have to approach the problem in multiple ways? Imagine for a moment that you have unlimited resources. Also consider what you could do with just the resources and knowledge you have right now. You don’t always need money and fancy equipment to  make meaning about the world around you!

Ways of Knowing

If you learned that a controlled experiment or null hypothesis was essential for good science, I’m here to burst your bubble — they’re not. In fact, there are many ways of knowing and deriving understanding in science. We are exploring a framework of six styles of scientific reasoning, based off of the work of Per Kind and Jonathan Osborne ( journal link  or download here ). While sometimes overlapping, these highlight distinct approaches to making meaning about our world and ourselves.

scientific reasoning research

Consider the ways you would study your conjecture/hypothesis for the flooding of Townville.

Did you propose strategies that all fit into one style of scientific reasoning?

Now seeing this list, can you propose additional approaches that use other styles? Consider whether it is helpful to consider the various styles, or whether it’s still hard to think about making meaning in new ways.

Why do you think you stayed within one style? This could be because you’ve identified a lens through which you like to study the world, or it could be a lack of experience with other styles of reasoning? This awareness about yourself can help you in your scientific studies going forward. Find mentors who can help you deepen your skills in a particular style of interest  as well as finding mentors who can broaden your thinking into new styles and approaches.

Did you propose strategies that span a wide range of styles of scientific reasoning?

Great—you’re demonstrating breadth and creativity in your thinking. Now, how would you decide which experiment to actually pursue? In this contrived example, you may find yourself drawn to a particular approach that is ethically or realistically feasible (no time travel or flooding of real towns, for example) or to something that requires fewer resources (probably no building of a scale model of the flood plain, though it has been done before !). These are real constraints scientists face and are helpful to recognize and practice—often these constraints help to spur creativity within the realm of the most possible. If you still can’t decide where to start, here are a few more practical tips:

  • Start small so that you can fail fast , reflect and learn from your failure, and then iterate to something more meaningful for answering the question. Don’t expect your first idea will be your best idea. It never is!
  • Consider what specific question you’d be asking and trying to answer in your work. Make sure the answer you’re looking for will help to answer a question you actually want/need to know the answer to.
  • Consider designing experiments/models/etc. that will give you information toward answering your question regardless of whether you achieve your expected result or not. Plan ahead for different possible outcomes, observations, or scenarios and consider how each could add to your current understanding. And know that sometimes you will still be surprised!
  • D4P: People and Medicine
  • D4P: Social Behavior
  • Fruit Fly Behavior
  • Everyday Emulsions
  • What is a Virus, Anyway?
  • Fermentation of Foodstuffs
  • Learn about the Styles of Scientific Reasoning
  • RockEDU Mentorship Handbook
  • Continued Learning for Mentors
  • Guide to LGBTQIA+ Allyship for Mentors
  • In-Reach Infrastructure
  • Perspectives
  • Student Spotlight
  • BIOME Blurbs
  • Call for Contributions
  • About The Incubator
  • For Teachers
  • For Students
  • For Scientists
  • For Outreach Professionals
  • Our Online Resources
  • Partners and Funders for RockEDU Online
  • Our Programs in NYC
  • Who are we?

Update your browser to view this website correctly. Update my browser now

SEP home page

  • Table of Contents
  • Random Entry
  • Chronological
  • Editorial Information
  • About the SEP
  • Editorial Board
  • How to Cite the SEP
  • Special Characters
  • Advanced Tools
  • Support the SEP
  • PDFs for SEP Friends
  • Make a Donation
  • SEPIA for Libraries
  • Entry Contents

Bibliography

Academic tools.

  • Friends PDF Preview
  • Author and Citation Info
  • Back to Top

Scientific Objectivity

Scientific objectivity is a property of various aspects of science. It expresses the idea that scientific claims, methods, results—and scientists themselves—are not, or should not be, influenced by particular perspectives, value judgments, community bias or personal interests, to name a few relevant factors. Objectivity is often considered to be an ideal for scientific inquiry, a good reason for valuing scientific knowledge, and the basis of the authority of science in society.

Many central debates in the philosophy of science have, in one way or another, to do with objectivity: confirmation and the problem of induction; theory choice and scientific change; realism; scientific explanation; experimentation; measurement and quantification; statistical evidence; reproducibility; evidence-based science; feminism and values in science. Understanding the role of objectivity in science is therefore integral to a full appreciation of these debates. As this article testifies, the reverse is true too: it is impossible to fully appreciate the notion of scientific objectivity without touching upon many of these debates.

The ideal of objectivity has been criticized repeatedly in philosophy of science, questioning both its desirability and its attainability. This article focuses on the question of how scientific objectivity should be defined , whether the ideal of objectivity is desirable , and to what extent scientists can achieve it.

1. Introduction

2.1 the view from nowhere, 2.2 theory-ladenness and incommensurability, 2.3 underdetermination, values, and the experimenters’ regress, 3.1 epistemic and contextual values, 3.2 acceptance of scientific hypotheses and value neutrality, 3.3 science, policy and the value-free ideal, 4.1 measurement and quantification, 4.2.1 bayesian inference, 4.2.2 frequentist inference, 4.3 feyerabend: the tyranny of the rational method, 5.1 reproducibility and the meta-analytic perspective, 5.2 feminist and standpoint epistemology, 6.1 max weber and objectivity in the social sciences, 6.2 contemporary rational choice theory, 6.3 evidence-based medicine and social policy, 7. the unity and disunity of scientific objectivity, 8. conclusions, other internet resources, related entries.

Objectivity is a value. To call a thing objective implies that it has a certain importance to us and that we approve of it. Objectivity comes in degrees. Claims, methods, results, and scientists can be more or less objective, and, other things being equal, the more objective, the better. Using the term “objective” to describe something often carries a special rhetorical force with it. The admiration of science among the general public and the authority science enjoys in public life stems to a large extent from the view that science is objective or at least more objective than other modes of inquiry. Understanding scientific objectivity is therefore central to understanding the nature of science and the role it plays in society.

If what is so great about science is its objectivity, then objectivity should be worth defending. The close examinations of scientific practice that philosophers of science have undertaken in the past fifty years have shown, however, that several conceptions of the ideal of objectivity are either questionable or unattainable. The prospects for a science providing a non-perspectival “view from nowhere” or for proceeding in a way uninformed by human goals and values are fairly slim, for example.

This article discusses several proposals to characterize the idea and ideal of objectivity in such a way that it is both strong enough to be valuable, and weak enough to be attainable and workable in practice. We begin with a natural conception of objectivity: faithfulness to facts . We motivate the intuitive appeal of this conception, discuss its relation to scientific method and discuss arguments challenging both its attainability as well as its desirability. We then move on to a second conception of objectivity as absence of normative commitments and value-freedom , and once more we contrast arguments in favor of such a conception with the challenges it faces. A third conception of objectivity which we discuss at length is the idea of absence of personal bias .

Finally there is the idea that objectivity is anchored in scientific communities and their practices . After discussing three case studies from economics, social science and medicine, we address the conceptual unity of scientific objectivity : Do the various conceptions have a common valid core, such as promoting trust in science or minimizing relevant epistemic risks? Or are they rivaling and only loosely related accounts? Finally we present some conjectures about what aspects of objectivity remain defensible and desirable in the light of the difficulties we have encountered.

2. Objectivity as Faithfulness to Facts

The basic idea of this first conception of objectivity is that scientific claims are objective in so far as they faithfully describe facts about the world. The philosophical rationale underlying this conception of objectivity is the view that there are facts “out there” in the world and that it is the task of scientists to discover, analyze, and systematize these facts. “Objective” then becomes a success word: if a claim is objective, it correctly describes some aspect of the world.

In this view, science is objective to the degree that it succeeds at discovering and generalizing facts, abstracting from the perspective of the individual scientist. Although few philosophers have fully endorsed such a conception of scientific objectivity, the idea figures recurrently in the work of prominent twentieth-century philosophers of science such as Carnap, Hempel, Popper, and Reichenbach.

Humans experience the world from a perspective. The contents of an individual’s experiences vary greatly with his perspective, which is affected by his personal situation, and the details of his perceptual apparatus, language and culture. While the experiences vary, there seems to be something that remains constant. The appearance of a tree will change as one approaches it but—according to common sense and most philosophers—the tree itself doesn’t. A room may feel hot or cold for different persons, but its temperature is independent of their experiences. The object in front of me does not disappear just because the lights are turned off.

These examples motivate a distinction between qualities that vary with one’s perspective, and qualities that remain constant through changes of perspective. The latter are the objective qualities. Thomas Nagel explains that we arrive at the idea of objective qualities in three steps (Nagel 1986: 14). The first step is to realize (or postulate) that our perceptions are caused by the actions of things around us, through their effects on our bodies. The second step is to realize (or postulate) that since the same qualities that cause perceptions in us also have effects on other things and can exist without causing any perceptions at all, their true nature must be detachable from their perspectival appearance and need not resemble it. The final step is to form a conception of that “true nature” independently of any perspective. Nagel calls that conception the “view from nowhere”, Bernard Williams the “absolute conception” (Williams 1985 [2011]). It represents the world as it is, unmediated by human minds and other “distortions”.

This absolute conception lies at the basis of scientific realism (for a detailed discussion, see the entry on scientific realism ) and it is attractive in so far as it provides a basis for arbitrating between conflicting viewpoints (e.g., two different observations). Moreover, the absolute conception provides a simple and unified account of the world. Theories of trees will be very hard to come by if they use predicates such as “height as seen by an observer” and a hodgepodge if their predicates track the habits of ordinary language users rather than the properties of the world. To the extent, then, that science aims to provide explanations for natural phenomena, casting them in terms of the absolute conception would help to realize this aim. A scientific account cast in the language of the absolute conception may not only be able to explain why a tree is as tall as it is but also why we see it in one way when viewed from one standpoint and in a different way when viewed from another. As Williams (1985 [2011: 139]) puts it,

[the absolute conception] nonvacuously explain[s] how it itself, and the various perspectival views of the world, are possible.

A third reason to find the view from nowhere attractive is that if the world came in structures as characterized by it and we did have access to it, we could use our knowledge of it to ground predictions (which, to the extent that our theories do track the absolute structures, will be borne out). A fourth and related reason is that attempts to manipulate and control phenomena can similarly be grounded in our knowledge of these structures. To attain any of the four purposes—settling disagreements, explaining the world, predicting phenomena, and manipulation and control—the absolute conception is at best sufficient but not necessary. We can, for instance, settle disagreements by imposing the rule that the person with higher social rank or greater experience is always right. We can explain the world and our image of it by means of theories that do not represent absolute structures and properties, and there is no need to get things (absolutely) right in order to predict successfully. Nevertheless, there is something appealing in the idea that factual disagreements can be settled by the very facts themselves, that explanations and predictions grounded in what’s really there rather than in a distorted image of it.

No matter how desirable, our ability to use scientific claims to represent facts about the world depends on whether these claims can unambiguously be established on the basis of evidence, and of evidence alone. Alas, the relation between evidence and scientific hypothesis is not straightforward. Subsection 2.2 and subsection 2.3 will look at two challenges of the idea that even the best scientific method will yield claims that describe an aperspectival view from nowhere. Section 5.2 will deal with socially motivated criticisms of the view from nowhere.

According to a popular picture, all scientific theories are false and imperfect. Yet, as we add true and eliminate false beliefs, our best scientific theories become more truthlike (e.g., Popper 1963, 1972). If this picture is correct, then scientific knowledge grows by gradually approaching the truth and it will become more objective over time, that is, more faithful to facts. However, scientific theories often change, and sometimes several theories compete for the place of the best scientific account of the world.

It is inherent in the above picture of scientific objectivity that observations can, at least in principle, decide between competing theories. If they did not, the conception of objectivity as faithfulness would be pointless to have as we would not be in a position to verify it. This position has been adopted by Karl R. Popper, Rudolf Carnap and other leading figures in (broadly) empiricist philosophy of science. Many philosophers have argued that the relation between observation and theory is way more complex and that influences can actually run both ways (e.g., Duhem 1906 [1954]; Wittgenstein 1953 [2001]). The most lasting criticism, however, was delivered by Thomas S. Kuhn (1962 [1970]) in his book “The Structure of Scientific Revolutions”.

Kuhn’s analysis is built on the assumption that scientists always view research problems through the lens of a paradigm, defined by set of relevant problems, axioms, methodological presuppositions, techniques, and so forth. Kuhn provided several historical examples in favor of this claim. Scientific progress—and the practice of normal, everyday science—happens within a paradigm that guides the individual scientists’ puzzle-solving work and that sets the community standards.

Can observations undermine such a paradigm, and speak for a different one? Here, Kuhn famously stresses that observations are “theory-laden” (cf. also Hanson 1958): they depend on a body of theoretical assumptions through which they are perceived and conceptualized. This hypothesis has two important aspects.

First, the meaning of observational concepts is influenced by theoretical assumptions and presuppositions. For example, the concepts “mass” and “length” have different meanings in Newtonian and relativistic mechanics; so does the concept “temperature” in thermodynamics and statistical mechanics (cf. Feyerabend 1962). In other words, Kuhn denies that there is a theory-independent observation language. The “faithfulness to reality” of an observation report is always mediated by a theoretical überbau , disabling the role of observation reports as an impartial, merely fact-dependent arbiter between different theories.

Second, not only the observational concepts, but also the perception of a scientist depends on the paradigm she is working in.

Practicing in different worlds, the two groups of scientists [who work in different paradigms, J.R./J.S.] see different things when they look from the same point in the same direction. (Kuhn 1962 [1970: 150])

That is, our own sense data are shaped and structured by a theoretical framework, and may be fundamentally distinct from the sense data of scientists working in another one. Where a Ptolemaic astronomer like Tycho Brahe sees a sun setting behind the horizon, a Copernican astronomer like Johannes Kepler sees the horizon moving up to a stationary sun. If this picture is correct, then it is hard to assess which theory or paradigm is more faithful to the facts, that is, more objective.

The thesis of the theory-ladenness of observation has also been extended to the incommensurability of different paradigms or scientific theories , problematized independently by Thomas S. Kuhn (1962 [1970]) and Paul Feyerabend (1962). Literally, this concept means “having no measure in common”, and it figures prominently in arguments against a linear and standpoint-independent picture of scientific progress. For instance, the Special Theory of Relativity appears to be more faithful to the facts and therefore more objective than Newtonian mechanics because it reduces, for low speeds, to the latter, and it accounts for some additional facts that are not predicted correctly by Newtonian mechanics. This picture is undermined, however, by two central aspects of incommensurability. First, not only do the observational concepts in both theories differ, but the principles for specifying their meaning may be inconsistent with each other (Feyerabend 1975: 269–270). Second, scientific research methods and standards of evaluation change with the theories or paradigms. Not all puzzles that could be tackled in the old paradigm will be solved by the new one—this is the phenomenon of “Kuhn loss”.

A meaningful use of objectivity presupposes, according to Feyerabend, to perceive and to describe the world from a specific perspective, e.g., when we try to verify the referential claims of a scientific theory. Only within a peculiar scientific worldview, the concept of objectivity may be applied meaningfully. That is, scientific method cannot free itself from the particular scientific theory to which it is applied; the door to standpoint-independence is locked. As Feyerabend puts it:

our epistemic activities may have a decisive influence even upon the most solid piece of cosmological furniture—they make gods disappear and replace them by heaps of atoms in empty space. (1978: 70)

Kuhn and Feyerabend’s theses about theory-ladenness of observation, and their implications for the objectivity of scientific inquiry have been much debated afterwards, and have often been misunderstood in a social constructivist sense. Therefore Kuhn later returned to the topic of scientific objectivity, of which he gives his own characterization in terms of the shared cognitive values of a scientific community. We discuss Kuhn’s later view in section 3.1 . For a more thorough coverage, see the entries on theory and observation in science , the incommensurability of scientific theories and Thomas S. Kuhn .

Scientific theories are tested by comparing their implications with the results of observations and experiments. Unfortunately, neither positive results (when the theory’s predictions are borne out in the data) nor negative results (when they are not) allow unambiguous inferences about the theory. A positive result can obtain even though the theory is false, due to some alternative that makes the same predictions. Finding suspect Jones’ fingerprints on the murder weapon is consistent with his innocence because he might have used it as a kitchen knife. A negative result might be due not to the falsehood of the theory under test but due to the failing of one or more auxiliary assumptions needed to derive a prediction from the theory. Testing, let us say, the implications of Newton’s laws for movements in our planetary system against observations requires assumptions about the number of planets, the sun’s and the planets’ masses, the extent to which the earth’s atmosphere refracts light beams, how telescopes affect the results and so on. Any of these may be false, explaining an inconsistency. The locus classicus for these observations is Pierre Duhem’s The Aim and Structure of Physical Theory (Duhem 1906 [1954]). Duhem concluded that there was no “crucial experiment”, an experiment that conclusively decides between two alternative theories, in physics (1906 [1954: 188ff.]), and that physicists had to employ their expert judgment or what Duhem called “good sense” to determine what an experimental result means for the truth or falsehood of a theory (1906 [1954: 216ff.]).

In other words, there is a gap between the evidence and the theory supported by it. It is important to note that the alleged gap is more profound than the gap between the premisses of any inductive argument and its conclusion, say, the gap between “All hitherto observed ravens have been black” and “All ravens are black”. The latter gap could be bridged by an agreed upon rule of inductive reasoning. Alas, all attempts to find an analogous rule for theory choice have failed (e.g., Norton 2003). Various philosophers, historians, and sociologists of science have responded that theory appraisal is “a complex form of value judgment” (McMullin 1982: 701; see also Kuhn 1977; Hesse 1980; Bloor 1982).

In section 3.1 below we will discuss the nature of the value judgments in more detail. For now the important lesson is that if these philosophers, historians, and sociologists are correct, the “faithfulness to facts” ideal is untenable. As the scientific image of the world is a joint product of the facts and scientists’ value judgments, that image cannot be said to be aperspectival. Science does not eschew the human perspective. There are of course ways to escape this conclusion. If, as John Norton (2003; ms.—see Other Internet Resources) has argued, it is material facts that power and justify inductive inferences, and not value judgments, we can avoid the negative conclusion regarding the view from nowhere. Unsurprisingly, Norton is also critical of the idea that evidence generally underdetermines theory (Norton 2008). However, there are good reasons to mistrust Norton’s optimism regarding the ineliminability of values and other non-factual elements in inductive inferences (Reiss 2020).

There is another, closely related concern. Most of the earlier critics of “objective” verification or falsification focused on the relation between evidence and scientific theories. There is a sense in which the claim that this relation is problematic is not so surprising. Scientific theories contain highly abstract claims that describe states of affairs far removed from the immediacy of sense experience. This is for a good reason: sense experience is necessarily perspectival, so to the extent to which scientific theories are to track the absolute conception, they must describe a world different from that of sense experience. But surely, one might think, the evidence itself is objective. So even if we do have reasons to doubt that abstract theories faithfully represent the world, we should stand on firmer grounds when it comes to the evidence against which we test abstract theories.

Theories are seldom tested against brute observations, however. Simple generalizations such as “all swans are white” are directly learned from observations (say, of the color of swans) but they do not represent the view from nowhere (for one thing, the view from nowhere doesn’t have colors). Genuine scientific theories are tested against experimental facts or phenomena, which are themselves unobservable to the unaided senses. Experimental facts or phenomena are instead established using intricate procedures of measurement and experimentation.

We therefore need to ask whether the results of scientific measurements and experiments can be aperspectival. In an important debate in the 1980s and 1990s some commentators answered that question with a resounding “no”, which was then rebutted by others. The debate concerns the so-called “experimenter’s regress” (Collins 1985). Collins, a prominent sociologist of science, claims that in order to know whether an experimental result is correct, one first needs to know whether the apparatus producing the result is reliable. But one doesn’t know whether the apparatus is reliable unless one knows that it produces correct results in the first place and so on and so on ad infinitum . Collins’ main case concerns attempts to detect gravitational waves, which were very controversially discussed among physicists in the 1970s.

Collins argues that the circle is eventually broken not by the “facts” themselves but rather by factors having to do with the scientist’s career, the social and cognitive interests of his community, and the expected fruitfulness for future work. It is important to note that in Collins’s view these factors do not necessarily make scientific results arbitrary. But what he does argue is that the experimental results do not represent the world according to the absolute conception. Rather, they are produced jointly by the world, scientific apparatuses, and the psychological and sociological factors mentioned above. The facts and phenomena of science are therefore necessarily perspectival.

In a series of contributions, Allan Franklin, a physicist-turned-philosopher of science, has tried to show that while there are indeed no algorithmic procedures for establishing experimental facts, disagreements can nevertheless be settled by reasoned judgment on the basis of bona fide epistemological criteria such as experimental checks and calibration, elimination of possible sources of error, using apparatuses based on well-corroborated theory and so on (Franklin 1994, 1997). Collins responds that “reasonableness” is a social category that is not drawn from physics (Collins 1994).

The main issue for us in this debate is whether there are any reasons to believe that experimental results provide an aperspectival view on the world. According to Collins, experimental results are co-determined by the facts as well as social and psychological factors. According to Franklin, whatever else influences experimental results other than facts is not arbitrary but instead based on reasoned judgment. What he has not shown is that reasoned judgment guarantees that experimental results reflect the facts alone and are therefore aperspectival in any interesting sense. Another important challenge for the aperspectival account comes from feminist epistemology and other accounts that stress the importance of the construction of scientific knowledge through epistemic communities. These accounts are reviewed in section 5 .

3. Objectivity as Absence of Normative Commitments and the Value-Free Ideal

In the previous section we have presented arguments against the view of objectivity as faithfulness to facts and an impersonal “view from nowhere”. An alternative view is that science is objective to the extent that it is value-free . Why would we identify objectivity with value-freedom or regard the latter as a prerequisite for the former? Part of the answer is empiricism. If science is in the business of producing empirical knowledge, and if differences about value judgments cannot be settled by empirical means, values should have no place in science. In the following we will try to make this intuition more precise.

Before addressing what we will call the “value-free ideal”, it will be helpful to distinguish four stages at which values may affect science. They are: (i) the choice of a scientific research problem; (ii) the gathering of evidence in relation to the problem; (iii) the acceptance of a scientific hypothesis or theory as an adequate answer to the problem on the basis of the evidence; (iv) the proliferation and application of scientific research results (Weber 1917 [1949]).

Most philosophers of science would agree that the role of values in science is contentious only with respect to dimensions (ii) and (iii): the gathering of evidence and the acceptance of scientific theories . It is almost universally accepted that the choice of a research problem is often influenced by interests of individual scientists, funding parties, and society as a whole. This influence may make science more shallow and slow down its long-run progress, but it has benefits, too: scientists will focus on providing solutions to those intellectual problems that are considered urgent by society and they may actually improve people’s lives. Similarly, the proliferation and application of scientific research results is evidently affected by the personal values of journal editors and end users, and little can be done about this. The real debate is about whether or not the “core” of scientific reasoning—the gathering of evidence and the assessment and acceptance scientific theories—is, and should be, value-free.

We have introduced the problem of the underdetermination of theory by evidence above. The problem does not stop, however, at values being required for filling the gap between theory and evidence. A further complication is that these values can conflict with each other. Consider the classical problem of fitting a mathematical function to a data set. The researcher often has the choice between using a complex function, which makes the relationship between the variables less simple but fits the data more accurately , or postulating a simpler relationship that is less accurate . Simplicity and accuracy are both important cognitive values, and trading them off requires a careful value judgment. However, philosophers of science tend to regard value-ladenness in this sense as benign. Cognitive values (sometimes also called “epistemic” or “constitutive” values) such as predictive accuracy, scope, unification, explanatory power, simplicity and coherence with other accepted theories are taken to be indicative of the truth of a theory and therefore provide reasons for preferring one theory over another (McMullin 1982, 2009; Laudan 1984; Steel 2010). Kuhn (1977) even claims that cognitive values define the shared commitments of science, that is, the standards of theory assessment that characterize the scientific approach as a whole. Note that not every philosopher entertains the same list of cognitive values: subjective differences in ranking and applying cognitive values do not vanish, a point Kuhn made emphatically.

In most views, the objectivity and authority of science is not threatened by cognitive values, but only by non-cognitive or contextual values . Contextual values are moral, personal, social, political and cultural values such as pleasure, justice and equality, conservation of the natural environment and diversity. The most notorious cases of improper uses of such values involve travesties of scientific reasoning, where the intrusion of contextual values led to an intolerant and oppressive scientific agenda with devastating epistemic and social consequences. In the Third Reich, a large part of contemporary physics, such as the theory of relativity, was condemned because its inventors were Jewish; in the Soviet Union, biologist Nikolai Vavilov was sentenced to death (and died in prison) because his theories of genetic inheritance did not match Marxist-Leninist ideology. Both states tried to foster a science that was motivated by political convictions (“Deutsche Physik” in Nazi Germany, Lysenko’s Lamarckian theory of inheritance and denial of genetics), leading to disastrous epistemic and institutional effects.

Less spectacular, but arguably more frequent are cases where research is biased toward the interests of the sponsors, such as tobacco companies, food manufacturers and large pharmaceutic firms (e.g., Resnik 2007; Reiss 2010). This preference bias , defined by Wilholt (2009) as the infringement of conventional standards of the research community with the aim of arriving at a particular result, is clearly epistemically harmful. Especially for sensitive high-stakes issues such as the admission of medical drugs or the consequences of anthropogenic global warming, it seems desirable that research scientists assess theories without being influenced by such considerations. This is the core idea of the

Value-Free Ideal (VFI): Scientists should strive to minimize the influence of contextual values on scientific reasoning, e.g., in gathering evidence and assessing/accepting scientific theories.

According to the VFI, scientific objectivity is characterized by absence of contextual values and by exclusive commitment to cognitive values in stages (ii) and (iii) of the scientific process. See Dorato (2004: 53–54), Ruphy (2006: 190) or Biddle (2013: 125) for alternative formulations.

For value-freedom to be a reasonable ideal, it must not be a goal beyond reach and be attainable at least to some degree. This claim is expressed by the

Value-Neutrality Thesis (VNT): Scientists can—at least in principle—gather evidence and assess/accept theories without making contextual value judgments.

Unlike the VFI, the VNT is not normative: its subject is whether the judgments that scientists make are, or could possibly be, free of contextual values. Similarly, Hugh Lacey (1999) distinguishes three principal components or aspects of value-free science: impartiality, neutrality and autonomy. Impartiality means that theories are solely accepted or appraised in virtue of their contribution to the cognitive values of science, such as truth, accuracy or explanatory power. This excludes the influence of contextual values, as stated above. Neutrality means that scientific theories make no value statements about the world: they are concerned with what there is, not with what there should be. Finally, scientific autonomy means that the scientific agenda is shaped by the desire to increase scientific knowledge, and that contextual values have no place in scientific method.

These three interpretations of value-free science can be combined with each other, or used individually. All of them, however, are subject to criticisms that we examine below. Denying the VNT, or the attainability of Lacey’s three criteria for value-free science, poses a challenge for scientific objectivity: one can either conclude that the ideal of objectivity should be rejected, or develop a conception of objectivity that differs from the VFI.

Lacey’s characterization of value-free science and the VNT were once mainstream positions in philosophy of science. Their widespread acceptance was closely connected to Reichenbach’s famous distinction between context of discovery and context of justification . Reichenbach first made this distinction with respect to the epistemology of mathematics:

the objective relation from the given entities to the solution, and the subjective way of finding it, are clearly separated for problems of a deductive character […] we must learn to make the same distinction for the problem of the inductive relation from facts to theories. (Reichenbach 1938: 36–37)

The standard interpretation of this statement marks contextual values, which may have contributed to the discovery of a theory, as irrelevant for justifying the acceptance of a theory, and for assessing how evidence bears on theory—the relation that is crucial for the objectivity of science. Contextual values are restricted to a matter of individual psychology that may influence the discovery, development and proliferation of a scientific theory, but not its epistemic status.

This distinction played a crucial role in post-World War II philosophy of science. It presupposes, however, a clear-cut distinction between cognitive values on the one hand and contextual values on the other. While this may be prima facie plausible for disciplines such as physics, there is an abundance of contextual values in the social sciences, for instance, in the conceptualization and measurement of a nation’s wealth, or in different ways to measure the inflation rate (cf. Dupré 2007; Reiss 2008). More generally, three major lines of criticism can be identified.

First, Helen Longino (1996) has argued that traditional cognitive values such as consistency, simplicity, breadth of scope and fruitfulness are not purely cognitive or epistemic after all, and that their use imports political and social values into contexts of scientific judgment. According to her, the use of cognitive values in scientific judgments is not always, not even normally, politically neutral. She proposes to juxtapose these values with feminist values such as novelty, ontological heterogeneity, mutuality of interaction, applicability to human needs and diffusion of power, and argues that the use of the traditional value instead of its alternative (e.g., simplicity instead of ontological heterogeneity) can lead to biases and adverse research results. Longino’s argument here is different from the one discussed in section 3.1 . It casts the very distinction between cognitive and contextual values into doubt.

The second argument against the possibility of value-free science is semantic and attacks the neutrality of scientific theories: fact and value are frequently entangled because of the use of so-called “thick” ethical concepts in science (Putnam 2002)—i.e., ethical concepts that have mixed descriptive and normative content. For example, a description such as “dangerous technology” involves a value judgment about the technology and the risks it implies, but it also has a descriptive content: it is uncertain and hard to predict whether using that technology will really trigger those risks. If the use of such terms, where facts and values are inextricably entangled, is inevitable in scientific reasoning, it is impossible to describe hypotheses and results in a value-free manner, undermining the value-neutrality thesis.

Indeed, John Dupré has argued that thick ethical terms are ineliminable from science, at least certain parts of it (Dupré 2007). Dupré’s point is essentially that scientific hypotheses and results concern us because they are relevant to human interests, and thus they will necessarily be couched in a language that uses thick ethical terms. While it will often be possible to translate ethically thick descriptions into neutral ones, the translation cannot be made without losses, and these losses obtain precisely because human interests are involved (see section 6.2 for a case study from social science). According to Dupré, then, many scientific statements are value-free only because their truth or falsity does not matter to us:

Whether electrons have a positive or a negative charge and whether there is a black hole in the middle of our galaxy are questions of absolutely no immediate importance to us. The only human interests they touch (and these they may indeed touch deeply) are cognitive ones, and so the only values that they implicate are cognitive values. (2007: 31)

A third challenge to the VNT, and perhaps the most influential one, was raised first by Richard Rudner in his influential article “The Scientist Qua Scientist Makes Value Judgments” (Rudner 1953). Rudner disputes the core of the VNT and the context of discovery/justification distinction: the idea that the acceptance of a scientific theory can in principle be value-free. First, Rudner argues that

no analysis of what constitutes the method of science would be satisfactory unless it comprised some assertion to the effect that the scientist as scientist accepts or rejects hypotheses . (1953: 2)

This assumption stems from industrial quality control and other application-oriented research. In such contexts, it is often necessary to accept or to reject a hypothesis (e.g., the efficacy of a drug) in order to make effective decisions.

Second, he notes that no scientific hypothesis is ever confirmed beyond reasonable doubt—some probability of error always remains. When we accept or reject a hypothesis, there is always a chance that our decision is mistaken. Hence, our decision is also “a function of the importance , in the typically ethical sense, of making a mistake in accepting or rejecting a hypothesis” (1953: 2): we are balancing the seriousness of two possible errors (erroneous acceptance/rejection of the hypothesis) against each other. This corresponds to type I and type II error in statistical inference.

The decision to accept or reject a hypothesis involves a value judgment (at least implicitly) because scientists have to judge which of the consequences of an erroneous decision they deem more palatable: (1) some individuals die of the side effects of a drug erroneously judged to be safe; or (2) other individuals die of a condition because they did not have access to a treatment that was erroneously judged to be unsafe. Hence, ethical judgments and contextual values necessarily enter the scientist’s core activity of accepting and rejecting hypotheses, and the VNT stands refuted. Closely related arguments can be found in Churchman (1948) and Braithwaite (1953). Hempel (1965: 91–92) gives a modified account of Rudner’s argument by distinguishing between judgments of confirmation , which are free of contextual values, and judgments of acceptance . Since even strongly confirming evidence cannot fully prove a universal scientific law, we have to live with a residual “inductive risk” in inferring that law. Contextual values influence scientific methods by determining the acceptable amount of inductive risk (see also Douglas 2000).

But how general are Rudner’s objections? Apparently, his result holds true of applied science, but not necessarily of fundamental research. For the latter domain, two major lines of rebuttals have been proposed. First, Richard Jeffrey (1956) notes that lawlike hypotheses in theoretical science (e.g., the gravitational law in Newtonian mechanics) are characterized by their general scope and not confined to a particular application. Obviously, a scientist cannot fine-tune her decisions to their possible consequences in a wide variety of different contexts. So she should just refrain from the essentially pragmatic decision to accept or reject hypotheses. By restricting scientific reasoning to gathering and interpreting evidence, possibly supplemented by assessing the probability of a hypothesis, Jeffrey tries to save the VNT in fundamental scientific research, and the objectivity of scientific reasoning.

Second, Isaac Levi (1960) observes that scientists commit themselves to certain standards of inference when they become a member of the profession. This may, for example, lead to the statistical rejection of a hypothesis when the observed significance level is smaller than 5%. These community standards may eliminate any room for contextual ethical judgment on behalf of the scientist: they determine when she should accept a hypothesis as established. Value judgments may be implicit in how a scientific community sets standards of inference (compare section 5.1 ), but not in the daily work of an individual scientist (cf. Wilholt 2013).

Both defenses of the VNT focus on the impact of values in theory choice, either by denying that scientists actually choose theories (Jeffrey), or by referring to community standards and restricting the VNT to the individual scientist (Levi). Douglas (2000: 563–565) points out, however, that the “acceptance” of scientific theories is only one of several places for values to enter scientific reasoning, albeit an especially prominent and explicit one. Many decisions in the process of scientific inquiry may conceal implicit value judgments: the design of an experiment, the methodology for conducting it, the characterization of the data, the choice of a statistical method for processing and analyzing data, the interpretational process findings, etc. None of these methodological decisions could be made without consideration of the possible consequences that could occur. Douglas gives, as a case study, a series of experiments where carcinogenic effects of dioxin exposure on rats were probed. Contextual values such as safety and risk aversion affected the conducted research at various stages: first, in the classification of pathological samples as benign or cancerous (over which a lot of expert disagreement occurred), second, in the extrapolation from the high-dose experimental conditions to the more realistic low-dose conditions. In both cases, the choice of a conservative classification or model had to be weighed against the adverse consequences for society that could result from underestimating the risks (see also Biddle 2013).

These diagnoses cast a gloomy light on attempts to divide scientific labor between gathering evidence and determining the degree of confirmation (value-free) on the one hand and accepting scientific theories (value-laden) on the other. The entire process of conceptualizing, gathering and interpreting evidence is so entangled with contextual values that no neat division, as Jeffrey envisions, will work outside the narrow realm of statistical inference—and even there, doubts may be raised ( see section 4.2 ).

Philip Kitcher (2011a: 31–40; see also Kitcher 2011b) gives an alternative argument, based on his idea of “significant truths”. There are simply too many truths that are of no interest whatsoever, such as the total number of offside positions in a low-level football competition. Science, then, doesn’t aim at truth simpliciter but rather at something more narrow: truth worth pursuing from the point of view of our cognitive, practical and social goals. Any truth that is worth pursuing in this sense is what he calls a “significant truth”. Clearly, it is value judgments that help us decide whether or not any given truth is significant.

Kitcher goes on to observing that the process of scientific investigation cannot neatly be divided into a stage in which the research question is chosen, one in which the evidence is gathered and one in which a judgment about the question is made on the basis of the evidence. Rather, the sequence is multiply iterated, and at each stage, the researcher has to decide whether previous results warrant pursuit of the current line of research, or whether she should switch to another avenue. Such choices are laden with contextual values.

Values in science also interact, according to Kitcher, in a non-trivial way. Assume we endorse predictive accuracy as an important goal of science. However, there may not be a convincing strategy to reach this goal in some domain of science, for instance because that domain is characterized by strong non-linear dependencies. In this case, predictive accuracy might have to yield to achieving other values, such as consistency with theories in neighbor domains. Conversely, changing social goals lead to re-evaluations of scientific knowledge and research methods.

Science, then, cannot be value-free because no scientist ever works exclusively in the supposedly value-free zone of assessing and accepting hypotheses. Evidence is gathered and hypotheses are assessed and accepted in the light of their potential for application and fruitful research avenues. Both cognitive and contextual value judgments guide these choices and are themselves influenced by their results.

The discussion so far has focused on the VNT, the practical attainability of the VFI, but little has been said about whether a value-free science is desirable in the first place. This subsection discusses this topic with special attention to informing and advising public policy from a scientific perspective. While the VFI, and many arguments for and against it, can be applied to science as a whole, the interface of science and public policy is the place where the intrusion of values into science is especially salient, and where it is surrounded by the greatest controversy. In the 2009 “Climategate” affair, leaked emails from climate scientists raised suspicions that they were pursuing a particular socio-political agenda that affected their research in an improper way. Later inquiries and reports absolved them from charges of misconduct, but the suspicions alone did much to damage the authority of science in the public arena.

Indeed, many debates at the interface of science and public policy are characterized by disagreements on propositions that combine a factual basis with specific goals and values. Take, for instance, the view that growing transgenic crops carries too much risk in terms of biosecurity, or addressing global warming by phasing out fossil energies immediately. The critical question in such debates is whether there are theses \(T\) such that one side in the debate endorses \(T\), the other side rejects it, the evidence is shared, and both sides have good reasons for their respective positions.

According to the VFI, scientists should uncover an epistemic, value-free basis for resolving such disagreements and restrict the dissent to the realm of value judgments. Even if the VNT should turn out to be untenable, and a strict separation to be impossible, the VFI may have an important function for guiding scientific research and for minimizing the impact of values on an objective science. In the philosophy of science, one camp of scholars defends the VFI as a necessary antidote to individual and institutional interests, such as Hugh Lacey (1999, 2002), Ernan McMullin (1982) and Sandra Mitchell (2004), while others adopt a critical attitude, such as Helen Longino (1990, 1996), Philip Kitcher (2011a) and Heather Douglas (2009). These criticisms we discuss mainly refer to the desirability or the conceptual (un)clarity of the VFI.

First, it has been argued that the VFI is not desirable at all. Feminist philosophers (e.g., Harding 1991; Okruhlik 1994; Lloyd 2005) have argued that science often carries a heavy androcentric values, for instance in biological theories about sex, gender and rape. The charge against these values is not so much that they are contextual rather than cognitive, but that they are unjustified. Moreover, if scientists did follow the VFI rigidly, policy-makers would pay even less attention to them, with a detrimental effect on the decisions they take (Cranor 1993). Given these shortcomings, the VFI has to be rethought if it is supposed to play a useful role for guiding scientific research and leading to better policy decisions. Section 4.3 and section 5.2 elaborate on this line of criticism in the context of scientific community practices, and a science in the service of society.

Second, the autonomy of science often fails in practice due to the presence of external stakeholders, such as funding agencies and industry lobbies. To save the epistemic authority of science, Douglas (2009: 7–8) proposes to detach it from its autonomy by reformulating the VFI and distinguishing between direct and indirect roles of values in science . Contextual values may legitimately affect the assessment of evidence by indicating the appropriate standard of evidence, the representation of complex processes, the severity of consequences of a decision, the interpretation of noisy datasets, and so on (see also Winsberg 2012). This concerns, above all, policy-related disciplines such as climate science or economics that routinely perform scientific risk analyses for real-world problems (cf. also Shrader-Frechette 1991). Values should, however, not be “reasons in themselves”, that is, evidence or defeaters for evidence (direct role, illegitimate) and as “helping to decide what should count as a sufficient reason for a choice” (indirect role, legitimate). This prohibition for values to replace or dismiss scientific evidence is called detached objectivity by Douglas, but it is complemented by various other aspects that relate to a reflective balancing of various perspectives and the procedural, social aspects of science (2009: ch. 6).

That said, Douglas’ proposal is not very concrete when it comes to implementation, e.g., regarding the way diverse values should be balanced. Compromising in the middle cannot be the solution (Weber 1917 [1949]). First, no standpoint is, just in virtue of being in the middle, evidentially supported vis-à-vis more extreme positions. Second, these middle positions are also, from a practical point of view, the least functional when it comes to advising policy-makers.

Moreover, the distinction between direct and indirect roles of values in science may not be sufficiently clear-cut to police the legitimate use of values in science, and to draw the necessary borderlines. Assume that a scientist considers, for whatever reason, the consequences of erroneously accepting hypothesis \(H\) undesirable. Therefore he uses a statistical model whose results are likely to favor ¬\(H\) over \(H\). Is this a matter of reasonable conservativeness? Or doesn’t it amount to reasoning to a foregone conclusion, and to treating values as evidence (cf. Elliott 2011: 320–321)?

The most recent literature on values and evidence in science presents us with a broad spectrum of opinions. Steele (2012) and Winsberg (2012) agree that probabilistic assessments of uncertainty involve contextual value judgments. While Steele defends this point by analyzing the role of scientists as policy advisors, Winsberg points to the influence of contextual values in the selection and representation of physical processes in climate modeling. Betz (2013) argues, by contrast, that scientists can largely avoid making contextual value judgments if they carefully express the uncertainty involved with their evidential judgments, e.g., by using a scale ranging from purely qualitative evidence (such as expert judgment) to precise probabilistic assessments. The issue of value judgments at earlier stages of inquiry is not addressed by this proposal; however, disentangling evidential judgments and judgments involving contextual values at the stage of theory assessment may be a good thing in itself.

Thus, should we or should we not worried about values in scientific reasoning? While the interplay of values and evidential considerations need not be pernicious, it is unclear why it adds to the success or the authority of science. How are we going to ensure that the permissive attitude towards values in setting evidential standards etc. is not abused? In the absence of a general theory about which contextual values are beneficial and which are pernicious, the VFI might as well be as a first-order approximation to a sound, transparent and objective science.

4. Objectivity as Freedom from Personal Biases

This section deals with scientific objectivity as a form of intersubjectivity—as freedom from personal biases. According to this view, science is objective to the extent that personal biases are absent from scientific reasoning, or that they can be eliminated in a social process. Perhaps all science is necessarily perspectival. Perhaps we cannot sensibly draw scientific inferences without a host of background assumptions, which may include assumptions about values. Perhaps all scientists are biased in some way. But objective scientific results do not, or so the argument goes, depend on researchers’ personal preferences or experiences—they are the result of a process where individual biases are gradually filtered out and replaced by agreed upon evidence. That, among other things, is what distinguishes science from the arts and other human activities, and scientific knowledge from a fact-independent social construction (e.g., Haack 2003).

Paradigmatic ways to achieve objectivity in this sense are measurement and quantification. What has been measured and quantified has been verified relative to a standard. The truth, say, that the Eiffel Tower is 324 meters tall is relative to a standard unit and conventions about how to use certain instruments, so it is neither aperspectival nor free from assumptions, but it is independent of the person making the measurement.

We will begin with a discussion of objectivity, so conceived, in measurement, discuss the ideal of “mechanical objectivity” and then investigate to what extent freedom from personal biases can be implemented in statistical evidence and inductive inference—arguably the core of scientific reasoning, especially in quantitatively oriented sciences. Finally, we discuss Feyerabend’s radical criticism of a rational scientific method that can be mechanically applied, and his defense of the epistemic and social benefits of personal “bias” and idiosyncrasy.

Measurement is often thought to epitomize scientific objectivity, most famously captured in Lord Kelvin’s dictum

when you cannot express it in numbers, your knowledge is of a meagre and unsatisfactory kind; it may be the beginning of knowledge, but you have scarcely, in your thoughts, advanced to the stage of science , whatever the matter may be. (Kelvin 1883, 73)

Measurement can certainly achieve some independence of perspective. Yesterday’s weather in Durham UK may have been “really hot” to the average North Eastern Brit and “very cold” to the average Mexican, but they’ll both accept that it was 21°C. Clearly, however, measurement does not result in a “view from nowhere”, nor are typical measurement results free from presuppositions. Measurement instruments interact with the environment, and so results will always be a product of both the properties of the environment we aim to measure as well as the properties of the instrument. Instruments, thus, provide a perspectival view on the world (cf. Giere 2006).

Moreover, making sense of measurement results requires interpretation. Consider temperature measurement. Thermometers function by relating an unobservable quantity, temperature, to an observable quantity, expansion (or length) of a fluid or gas in a glass tube; that is, thermometers measure temperature by assuming that length is a function of temperature: length = \(f\)(temperature). The function \(f\) is not known a priori , and it cannot be tested either (because it could in principle only be tested using a veridical thermometer, and the veridicality of the thermometer is just what is at stake here). Making a specific assumption, for instance that \(f\) is linear, solves that problem by fiat. But this “solution” does not take us very far because different thermometric substances (e.g., mercury, air or water) yield different results for the points intermediate between the two fixed points 0°C and 100°C, and so they can’t all expand linearly.

According to Hasok Chang’s account of early thermometry (Chang 2004), the problem was eventually solved by using a “principle of minimalist overdetermination”, the goal of which was to find a reliable thermometer while making as few substantial assumptions (e.g., about the form for \(f\)) as possible. It was argued that if a thermometer was to be reliable, different tokens of the same thermometer type should agree with each other, and the results of air thermometers agreed the most. “Minimal” doesn’t mean zero, however, and indeed this procedure makes an important presupposition (in this case a metaphysical assumption about the one-valuedness of a physical quantity). Moreover, the procedure yielded at best a reliable instrument, not necessarily one that was best at tracking the uniquely real temperature (if there is such a thing).

What Chang argues about early thermometry is true of measurements more generally: they are always made against a backdrop of metaphysical presuppositions, theoretical expectations and other kinds of belief. Whether or not any given procedure is regarded as adequate depends to a large extent on the purposes pursued by the individual scientist or group of scientists making the measurements. Especially in the social sciences, this often means that measurement procedures are laden with normative assumptions, i.e., values.

Julian Reiss (2008, 2013) has argued that economic indicators such as consumer price inflation, gross domestic product and the unemployment rate are value-laden in this sense. Consumer-price indices, for instance, assume that if a consumer prefers a bundle \(x\) over an alternative \(y\), then \(x\) is better for her than \(y\), which is as ethically charged as it is controversial. National income measures assume that nations that exchange a larger share of goods and services on markets are richer than nations where the same goods and services are provided by the government or within households, which too is ethically charged and controversial.

While not free of assumptions and values, the goal of many measurement procedures remains to reduce the influence of personal biases and idiosyncrasies. The Nixon administration, famously, indexed social security payments to the consumer-price index in order to eliminate the dependence of security recipients on the flimsiest of party politics: to make increases automatic instead of a result of political negotiations (Nixon 1969). Lorraine Daston and Peter Galison refer to this as mechanical objectivity . They write:

Finally, we come to the full-fledged establishment of mechanical objectivity as the ideal of scientific representation. What we find is that the image, as standard bearer of is objectivity is tied to a relentless search to replace individual volition and discretion in depiction by the invariable routines of mechanical reproduction. (Daston and Galison 1992: 98)

Mechanical objectivity reduces the importance of human contributions to scientific results to a minimum, and therefore enables science to proceed on a large scale where bonds of trust between individuals can no longer hold (Daston 1992). Trust in mechanical procedures thus replaces trust in individual scientists.

In his book Trust in Numbers , Theodore Porter pursues this line of thought in great detail. In particular, on the basis of case studies involving British actuaries in the mid-nineteenth century, of French state engineers throughout the century, and of the US Army Corps of Engineers from 1920 to 1960, he argues for two causal claims. First, measurement instruments and quantitative procedures originate in commercial and administrative needs and affect the ways in which the natural and social sciences are practiced, not the other way around. The mushrooming of instruments such as chemical balances, barometers, chronometers was largely a result of social pressures and the demands of democratic societies. Administering large territories or controlling diverse people and processes is not always possible on the basis of personal trust and thus “objective procedures” (which do not require trust in persons) took the place of “subjective judgments” (which do). Second, he argues that quantification is a technology of distrust and weakness, and not of strength. It is weak administrators who do not have the social status, political support or professional solidarity to defend their experts’ judgments. They therefore subject decisions to public scrutiny, which means that they must be made in a publicly accessible form.

This is the situation in which scientists who work in areas where the science/policy boundary is fluid find themselves:

The National Academy of Sciences has accepted the principle that scientists should declare their conflicts of interest and financial holdings before offering policy advice, or even information to the government. And while police inspections of notebooks remain exceptional, the personal and financial interests of scientists and engineers are often considered material, especially in legal and regulatory contexts. Strategies of impersonality must be understood partly as defenses against such suspicions […]. Objectivity means knowledge that does not depend too much on the particular individuals who author it. (Porter 1995: 229)

Measurement and quantification help to reduce the influence of personal biases and idiosyncrasies and they reduce the need to trust the scientist or government official, but often at a cost. Standardizing scientific procedures becomes difficult when their subject matters are not homogeneous, and few domains outside fundamental physics are. Attempts to quantify procedures for treatment and policy decisions that we find in evidence-based practices are currently transferred to a variety of sciences such as medicine, nursing, psychology, education and social policy. However, they often lack a certain degree of responsiveness to the peculiarities of their subjects and the local conditions to which they are applied (see also section 5.3 ).

Moreover, the measurement and quantification of characteristics of scientific interest is only half of the story. We also want to describe relations between the quantities and make inferences using statistical analysis. Statistics thus helps to quantify further aspects of scientific work. We will now examine whether or not statistical analysis can proceed in a way free from personal biases and idiosyncrasies—for more detail, see the entry on philosophy of statistics .

4.2 Statistical Evidence

The appraisal of scientific evidence is traditionally regarded as a domain of scientific reasoning where the ideal of scientific objectivity has strong normative force, and where it is also well-entrenched in scientific practice. Episodes such as Galilei’s observations of the Jupiter moons, Lavoisier’s calcination experiments, and Eddington’s observation of the 1919 eclipse are found in all philosophy of science textbooks because they exemplify how evidence can be persuasive and compelling to scientists with different backgrounds. The crucial question is therefore: can we identify an “objective” concept of scientific evidence that is independent of the personal biases of the experimenter and interpreter?

Inferential statistics—the field that investigates the validity of inferences from data to theory—tries to answer this question. It is extremely influential in modern science, pervading experimental research as well as the assessment and acceptance of our most fundamental theories. For instance, a statistical argument helped to establish the recent discovery of the Higgs Boson. We now compare the main theories of statistical evidence with respect to the objectivity of the claims they produce. They mainly differ with respect to the role of an explicitly subjective interpretation of probability.

Bayesian inference quantifies scientific evidence by means of probabilities that are interpreted as a scientist’s subjective degrees of belief. The Bayesian thus leaves behind Carnap’s (1950) idea that probability is determined by a logical relation between sentences. For example, the prior degree of belief in hypothesis \(H\), written \(p(H)\), can in principle take any value in the interval \([0,1]\). Simultaneously held degrees of belief in different hypotheses are, however, constrained by the laws of probability. After learning evidence E, the degree of belief in \(H\) is changed from its prior probability \(p(H)\) to the conditional degree of belief \(p(H \mid E)\), commonly called the posterior probability of \(H\). Both quantities can be related to each other by means of Bayes’ Theorem .

These days, the Bayesian approach is extremely influential in philosophy and rapidly gaining ground across all scientific disciplines. For quantifying evidence for a hypothesis, Bayesian statisticians almost uniformly use the Bayes factor , that is, the ratio of prior to posterior odds in favor of a hypothesis. The Bayes factor in favor of hypothesis \(H\) against its negation \(\neg\)\(H\) in the light of evidence \(E\) can be written as

or in other words, as the likelihood ratio between \(H\) and \(\neg\)\(H\). The Bayes factor reduces to the likelihoodist conception of evidence (Royall 1997) for the case of two competing point hypotheses. For further discussion of Bayesian measures of evidence, see Good (1950), Sprenger and Hartmann (2019: ch. 1) and the entry on confirmation and evidential support .

Unsurprisingly, the idea to measure scientific evidence in terms of subjective probability has met resistance. For example, the statistician Ronald A. Fisher (1935: 6–7) has argued that measuring psychological tendencies cannot be relevant for scientific inquiry and sustain claims to objectivity. Indeed, how should scientific objectivity square with subjective degree of belief? Bayesians have responded to this challenge in various ways:

Howson (2000) and Howson and Urbach (2006) consider the objection misplaced. In the same way that deductive logic does not judge the correctness of the premises but just advises you what to infer from them, Bayesian inductive logic provides rational rules for representing uncertainty and making inductive inferences. Choosing the premises (e.g., the prior distributions) “objectively” falls outside the scope of Bayesian analysis.

Convergence or merging-of-opinion theorems guarantee that under certain circumstances, agents with very different initial attitudes who observe the same evidence will obtain similar posterior degrees of belief in the long run. However, they are asymptotic results without direct implications for inference with real-life datasets (see also Earman 1992: ch. 6). In such cases, the choice of the prior matters, and it may be beset with idiosyncratic bias and manifest social values.

Adopting a more modest stance, Sprenger (2018) accepts that Bayesian inference does not achieve the goal of objectivity in the sense of intersubjective agreement (concordant objectivity), or being free of personal values, bias and subjective judgment. However, he argues that competing schools of inference such as frequentist inference face this problem to the same degree, perhaps even worse. Moreover, some features of Bayesian inference (e.g., the transparency about prior assumptions) fit recent, socially oriented conceptions of objectivity that we discuss in section 5 .

A radical Bayesian solution to the problem of personal bias is to adopt a principle that radically constrains an agent’s rational degrees of belief, such as the Principle of Maximum Entropy (MaxEnt—Jaynes 1968; Williamson 2010). According to MaxEnt, degrees of belief must be probabilistic and in sync with empirical constraints, but conditional on these constraints, they must be equivocal, that is, as middling as possible. This latter constraint amounts to maximizing the entropy of the probability distribution in question. The MaxEnt approach eliminates various sources of subjective bias at the expense of narrowing down the range of rational degrees of belief. An alternative objective Bayesian solution consists in so-called “objective priors” : prior probabilities that do not represent an agent’s factual attitudes, but are determined by principles of symmetry, mathematical convenience or maximizing the influence of the data on the posterior (e.g., Jeffreys 1939 [1980]; Bernardo 2012).

Thus, Bayesian inference, which analyzes statistical evidence from the vantage point of rational belief, provides only a partial answer to securing scientific objectivity from personal idiosyncrasy.

The frequentist conception of evidence is based on the idea of the statistical test of a hypothesis . Under the influence of the statisticians Jerzy Neyman and Egon Pearson, tests were often regarded as rational decision procedures that minimize the relative frequency of wrong decisions in a hypothetical series of repetitions of a test (hence the name “frequentism”). Rudner’s argument in section 3.2 has pointed out the limits of this conception of hypothesis tests: the choice of thresholds for acceptance and rejection (i.e., the acceptable type I and II error rates) may reflect contextual value judgments and personal bias. Moreover, the losses associated with erroneously accepting or rejecting that hypothesis depend on the context of application which may be unbeknownst to the experimenter.

Alternatively, scientists can restrict themselves to a purely evidential interpretation of hypothesis tests and leave decisions to policy-makers and regulatory agencies. The statistician and biologist R.A. Fisher (1935, 1956) proposed what later became the orthodox quantification of evidence in frequentist statistics. Suppose a “null” or default hypothesis \(H_0\) denotes that an intervention has zero effect. If the observed data are “extreme” under \(H_0\)—i.e., if it was highly likely to observe a result that agrees better with \(H_0\)—the data provide evidence against the null hypothesis and for the efficacy of the intervention. The epistemological rationale is connected to the idea of severe testing (Mayo 1996): if the intervention were ineffective, we would, in all likelihood, have found data that agree better with the null hypothesis. The strength of evidence against \(H_0\) is equal to the \(p\)-value : the lower it is, the stronger evidence \(E\) speaks against the null hypothesis \(H_0\).

Unlike Bayes factors, this concept of statistical evidence does not depend on personal degrees of belief. However, this does not necessarily mean that \(p\)-values are more objective. First, \(p\)-values are usually classified as “non-significant” (\(p > .05\)), “significant” (\(p < .05\)), “highly significant”, and so on. Not only that these thresholds and labels are largely arbitrary, they also promote publication bias : non-significant findings are often classified as “failed studies” (i.e., the efficacy of the intervention could not be shown), rarely published and end up in the proverbial “file drawer”. Much valuable research is suppressed. Conversely, significant findings may often occur when the null hypothesis is actually true, especially when researchers have been “hunting for significance”. In fact, researchers have an incentive to keep their \(p\)-values low: the stronger the evidence, the more convincing the narrative, the greater the impact—and the higher the chance for a good publication and career-relevant rewards. Moving the goalpost by “p-hacking” outcomes—for example by eliminating outliers, selective reporting or restricting the analysis to a subgroup—evidently biases the research results and compromises the objectivity of experimental research.

In particular, such questionable research practices (QRP) increase the type I error rate, which measures the rate at which false hypotheses are accepted, substantially over its nominal 5% level and contribute to publication bias (Bakker et al. 2012). Ioannidis (2005) concludes that “most published research findings are false”—they are the combined result of a low base rate of effective causal interventions, the file drawer effect and the widespread presence of questionable research practices. The frequentist logic of hypothesis testing aggravates the problem because it provides a framework where all these biases can easily enter (Ziliak and McCloskey 2008; Sprenger 2016). These radical conclusions are also confirmed by empirical findings: in many disciplines researchers fail to replicate findings by other scientific teams. See section 5.1 for more detail.

Summing up our findings, neither of the two major frameworks of statistical inference manages to eliminate all sources of personal bias and idiosyncrasy. The Bayesian considers subjective assumptions to be an irreducible part of scientific reasoning and sees no harm in making them explicit. The frequentist conception of evidence based on \(p\)-values avoids these explicitly subjective elements, but at the price of a misleading impression of objectivity and frequent abuse in practice. A defense of frequentist inference should, in our opinion, stress that the relatively rigid rules for interpreting statistical evidence facilitate communication and assessment of research results in the scientific community—something that is harder to achieve for a Bayesian. We now turn from specific methods for stating and interpreting evidence to a radical criticism of the idea that there is a rational scientific method.

In his writings of the 1970s, Paul Feyerabend launched a profound attack on the rationality and objectivity of scientific method. His position is exceptional in the philosophical literature since traditionally, the threat for objective and successful science is located in contextual rather than epistemic values. Feyerabend turns this view upside down: it is the “tyranny” of rational method, and the emphasis on epistemic rather than contextual values that prevents us from having a science in the service of society. Moreover, he welcomes a diversity of different personal, also idiosyncratic perspectives, thus denying the idea that freedom from personal “bias” is epistemically and socially beneficial.

The starting point of Feyerabend’s criticism of rational method is the thesis that strict epistemic rules such as those expressed by the VFI only suppress an open exchange of ideas, extinguish scientific creativity and prevent a free and truly democratic science. In his classic “Against Method” (1975: chs. 8–13), Feyerabend elaborates on this criticism by examining a famous episode in the history of science. When the Catholic Church objected to Galilean mechanics, it had the better arguments by the standards of seventeenth-century science. Their conservatism in their position was scientifically backed: Galilei’s telescopes were unreliable for celestial observations, and many well-established phenomena (no fixed star parallax, invariance of laws of motion) could not yet be explained in the heliocentric system. With hindsight, Galilei managed to achieve groundbreaking scientific progress just because he deliberately violated rules of scientific reasoning. Hence Feyerabend’s dictum “Anything goes”: no methodology whatsoever is able to capture the creative and often irrational ways by which science deepens our understanding of the world. Good scientific reasoning cannot be captured by rational method, as Carnap, Hempel and Popper postulated.

The drawbacks of an objective, value-free and method-bound view on science and scientific method are not only epistemic. Such a view narrows down our perspective and makes us less free, open-minded, creative, and ultimately, less human in our thinking (Feyerabend 1975: 154). It is therefore neither possible nor desirable to have an objective, value-free science (cf. Feyerabend 1978: 78–79). As a consequence, Feyerabend sees traditional forms of inquiry about our world (e.g., Chinese medicine) on a par with their Western competitors. He denounces appeals to “objective” standards as rhetorical tools for bolstering the epistemic authority of a small intellectual elite (=Western scientists), and as barely disguised statements of preference for one’s own worldview:

there is hardly any difference between the members of a “primitive” tribe who defend their laws because they are the laws of the gods […] and a rationalist who appeals to “objective” standards, except that the former know what they are doing while the latter does not. (1978: 82)

In particular, when discussing other traditions, we often project our own worldview and value judgments into them instead of making an impartial comparison (1978: 80–83). There is no purely rational justification for dismissing other perspectives in favor of the Western scientific worldview—the insistence on our Western approach may be as justified as insisting on absolute space and time after the Theory of Relativity.

The Galilei example also illustrates that personal perspective and idiosyncratic “bias” need not be bad for science. Feyerabend argues further that scientific research is accountable to society and should be kept in check by democratic institutions, and laymen in particular. Their particular perspectives can help to determine the funding agenda and to set ethical standards for scientific inquiry, but also be useful for traditionally value-free tasks such as choosing an appropriate research method and assessing scientific evidence. Feyerabend’s writings on this issue were much influenced by witnessing the Civil Rights Movement in the U.S. and the increasing emancipation of minorities, such as Blacks, Asians and Hispanics.

All this is not meant to say that truth loses its function as a normative concept, nor that all scientific claims are equally acceptable. Rather, Feyerabend advocates an epistemic pluralism that accepts diverse approaches to acquiring knowledge. Rather than defending a narrow and misleading ideal of objectivity, science should respect the diversity of values and traditions that drive our inquiries about the world (1978: 106–107). This would put science back into the role it had during the scientific revolution or the Enlightenment: as a liberating force that fought intellectual and political oppression by the sovereign, the nobility or the clergy. Objections to this view are discussed at the end of section 5.2 .

5. Objectivity as a Feature of Scientific Communities and Their Practices

This section addresses various accounts that regard scientific objectivity essentially as a function of social practices in science and the social organization of the scientific community. All these accounts reject the characterization of scientific objectivity as a function of correspondence between theories and the world, as a feature of individual reasoning practices, or as pertaining to individual studies and experiments (see also Douglas 2011). Instead, they evaluate the objectivity of a collective of studies, as well as the methods and community practices that structure and guide scientific research. More precisely, they adopt a meta-analytic perspective for assessing the reliability of scientific results (section 5.1), and they construct objectivity from a feminist perspective: as an open interchange of mutual criticism, or as being anchored in the “situatedness” of our scientific practices and the knowledge we gain ( section 5.2 ).

The collectivist perspective is especially useful when an entire discipline enters a stage of crisis: its members become convinced that a significant proportion of findings are not trustworthy. A contemporary example of such a situation is the replication crisis , which was briefly mentioned in the previous section and concerns the reproducibility of scientific knowledge claims in a variety of different fields (most prominently: psychology, biology, medicine). Large-scale replication projects have noticed that many findings which we considered as an integral part of scientific knowledge failed to replicate in settings that were designed to mimic the original experiment as closely as possible (e.g., Open Science Collaboration 2015). Successful attempts at replicating an experimental result have long been argued to provide evidence of freedom from particular kinds of artefacts and thus the trustworthiness of the result. Compare the entry on experiment in physics . Likewise, failure to replicate indicates that either the original finding, the result of the replication attempt, or both, are biased—though see John Norton’s (ms., ch. 3—see Other Internet Resources) arguments that the evidential value of (failed) replications crucially depends on researchers’ material background assumptions.

When replication failures in a discipline are particularly significant, one may conclude that the published literature lacks objectivity—at a minimum the discipline fails to inspire trust that its findings are more than artefacts of the researchers’ efforts. Conversely, when observed effects can be replicated in follow-up experiments, a kind of objectivity is reached that goes beyond the ideas of freedom from personal bias, mechanical objectivity, and subject-independent measurement, discussed in section 4.1 .

Freese and Peterson (2018) call this idea statistical objectivity . It grounds in the view that even the most scrupulous and diligent researchers cannot achieve full objectivity all by themselves. The term “objectivity” instead applies to a collection or population of studies, with meta-analysis (a formal method for aggregating the results from ranges of studies) as the “apex of objectivity” (Freese and Peterson 2018, 304; see also Stegenga 2011, 2018). In particular, aggregating studies from different researchers may provide evidence of systematic bias and questionable research practices (QRP) in the published literature. This diagnostic function of meta-analysis for detecting violations of objectivity is enhanced by statistical techniques such as the funnel plot and the \(p\)-curve (Simonsohn et al. 2014).

Apart from this epistemic dimension, research on statistical objectivity also has an activist dimension: methodologists urge researchers to make publicly available essential parts of their research before the data analysis starts, and to make their methods and data sources more transparent. For example, it is conjectured that the replicability (and thus objectivity) of science will increase by making all data available online, by preregistering experiments, and by using the registered reports model for journal articles (i.e., the journal decides on publication before data collection on the basis of the significance of the proposed research as well as the experimental design). The idea is that transparency about the data set and the experimental design will make it easier to stage a replication of an experiment and to assess its methodological quality. Moreover, publicly committing to a data analysis plan beforehand will lower the rate of QRPs and of attempts to accommodate data to hypotheses rather than making proper predictions.

All in all, statistical objectivity moves the discussion of objectivity to the level of population of studies. There, it takes up and modifies several conceptions of objectivity that we have seen before: most prominently, freedom of subjective bias, which is replaced with collective bias and pernicious conventions, and the subject-independent measurement of a physical quantity, which is replaced by reproducibility of effects.

Traditional notions of objectivity as faithfulness to facts or freedom of contextual values have also been challenged from a feminist perspective. These critiques can be grouped in three major research programs: feminist epistemology, feminist standpoint theory and feminist postmodernism (Crasnow 2013). The program of feminist epistemology explores the impact of sex and gender on the production of scientific knowledge. More precisely, feminist epistemology highlights the epistemic risks resulting from the systematic exclusion of women from the ranks of scientists, and the neglect of women as objects of study. Prominent case studies are the neglect of female orgasm in biology, testing medical drugs on male participants only, focusing on male specimen when studying the social behavior of primates, and explaining human mating patterns by means of imaginary neolithic societies (e.g., Hrdy 1977; Lloyd 1993, 2005). See also the entry on feminist philosophy of biology .

Often but not always, feminist epistemologists go beyond pointing out what they regard as androcentric bias and reject the value-free ideal altogether—with an eye on the social and moral responsibility of scientific inquiry. They try to show that a value-laden science can also meet important criteria for being epistemically reliable and objective (e.g., Anderson 2004; Kourany 2010). A classical representative of such efforts is Longino’s (1990) contextual empiricism . She reinforces Popper’s insistence that “the objectivity of scientific statements lies in the fact that they can be inter-subjectively tested” (1934 [2002]: 22), but unlike Popper, she conceives scientific knowledge essentially as a social product. Thus, our conception of scientific objectivity must directly engage with the social process that generates knowledge. Longino assigns a crucial function to social systems of criticism in securing the epistemic success of science. Specifically, she develops an epistemology which regards a method of inquiry as “objective to the degree that it permits transformative criticism ” (Longino 1990: 76). For an epistemic community to achieve transformative criticism, there must be:

avenues for criticism : criticism is an essential part of scientific institutions (e.g., peer review);

shared standards : the community must share a set of cognitive values for assessing theories (more on this in section 3.1 );

uptake of criticism : criticism must be able to transform scientific practice in the long run;

equality of intellectual authority : intellectual authority must be shared equally among qualified practitioners.

Longino’s contextual empiricism can be understood as a development of John Stuart Mill’s view that beliefs should never be suppressed, independently of whether they are true or false. Even the most implausible beliefs might be true, and even if they are false, they might contain a grain of truth which is worth preserving or helps to better articulate true beliefs (Mill 1859 [2003: 72]). The underlying intuition is supported by recent empirical research on the epistemic benefits of a diversity of opinions and perspectives (Page 2007). By stressing the social nature of scientific knowledge, and the importance of criticism (e.g., with respect to potential androcentric bias and inclusive practice), Longino’s account fits into the broader project of feminist epistemology.

Standpoint theory undertakes a more radical attack on traditional scientific objectivity. This view develops Marxist ideas to the effect that epistemic position is related to, and a product of, social position. Feminist standpoint theory builds on these ideas but focuses on gender, racial and other social relations. Feminist standpoint theorists and proponents of “situated knowledge” such as Donna Haraway (1988), Sandra Harding (1991, 2015a, 2015b) and Alison Wylie (2003) deny the internal coherence of a view from nowhere: all human knowledge is at base human knowledge and therefore necessarily perspectival. But they argue more than that. Not only is perspectivality the human condition, it is also a good thing to have. This is because perspectives, especially the perspectives of underprivileged classes and groups in society, come along with epistemic benefits. These ideas are controversial but they draw attention to the possibility that attempts to rid science of perspectives might not only be futile but also costly: they prevent scientists from having the epistemic benefits certain standpoints afford and from developing knowledge for marginalized groups in society. The perspectival stance can also explain why criteria for objectivity often vary with context: the relative importance of epistemic virtues is a matter of goals and interests—in other words, standpoint.

By endorsing a perspectival stance, feminist standpoint theory rejects classical elements of scientific objectivity such as neutrality and impartiality (see section 3.1 above). This is a notable difference to feminist epistemology, which is in principle (though not always in practice) compatible with traditional views of objectivity. Feminist standpoint theory is also a political project. For example, Harding (1991, 1993) demands that scientists, their communities and their practices—in other words, the ways through which knowledge is gained—be investigated as rigorously as the object of knowledge itself. This idea she refers to as “strong objectivity” replaces the “weak” conception of objectivity in the empiricist tradition: value-freedom, impartiality, rigorous adherence to methods of testing and inference. Like Feyerabend, Harding integrates a transformation of epistemic standards in science into a broader political project of rendering science more democratic and inclusive. On the other hand, she is exposed to similar objections (see also Haack 2003). Isn’t it grossly exaggerated to identify class, race and gender as important factors in the construction of physical theories? Doesn’t the feminist approach—like social constructivist approaches—lose sight of the particular epistemic qualities of science? Should non-scientists really have as much authority as trained scientists? To whom does the condition of equally shared intellectual authority apply? Nor is it clear—especially in times of fake news and filter bubbles—whether it is always a good idea to subject scientific results to democratic approval. There is no guarantee (arguably there are few good reasons to believe) that democratized or standpoint-based science leads to more reliable theories, or better decisions for society as a whole.

6. Issues in the Special Sciences

So far everything we discussed was meant to apply across all or at least most of the sciences. In this section we will look at a number of specific issues that arise in the social sciences, in economics, and in evidence-based medicine.

There is a long tradition in the philosophy of social science maintaining that there is a gulf in terms of both goals as well as methods between the natural and the social sciences. This tradition, associated with thinkers such as the neo-Kantians Heinrich Rickert and Wilhelm Windelband, the hermeneuticist Wilhelm Dilthey, the sociologist-economist Max Weber, and the twentieth-century hermeneuticists Hans-Georg Gadamer and Michael Oakeshott, holds that unlike the natural sciences whose aim it is to establish natural laws and which proceed by experimentation and causal analysis, the social sciences seek understanding (“ Verstehen ”) of social phenomena, the interpretive examination of the meanings individuals attribute to their actions (Weber 1904 [1949]; Weber 1917 [1949]; Dilthey 1910 [1986]; Windelband 1915; Rickert 1929; Oakeshott 1933; Gadamer 1960 [1989]). See also the entries on hermeneutics and Max Weber .

Understood this way, social science lacks objectivity in more than one sense. One of the more important debates concerning objectivity in the social sciences concerns the role value judgments play and, importantly, whether value-laden research entails claims about the desirability of actions. Max Weber held that the social sciences are necessarily value laden. However, they can achieve some degree of objectivity by keeping out the social researcher’s views about whether agents’ goals are commendable. In a similar vein, contemporary economics can be said to be value laden because it predicts and explains social phenomena on the basis of agents’ preferences. Nevertheless, economists are adamant that economists are not in the business of telling people what they ought to value. Modern economics is thus said to be objective in the Weberian sense of “absence of researchers’ values” —a conception that we discussed in detail in section 3 .

In his widely cited essay “‘Objectivity’ in Social Science and Social Policy” (Weber 1904 [1949]), Weber argued that the idea of an aperspectival social science was meaningless:

There is no absolutely objective scientific analysis of […] “social phenomena” independent of special and “one-sided” viewpoints according to which expressly or tacitly, consciously or unconsciously they are selected, analyzed and organized for expository purposes. (1904 [1949: 72]) All knowledge of cultural reality, as may be seen, is always knowledge from particular points of view. (1904 [1949:. 81])

The reason for this is twofold. First, social reality is too complex to admit of full description and explanation. So we have to select. But, perhaps in contraposition to the natural sciences, we cannot just select those aspects of the phenomena that fall under universal natural laws and treat everything else as “unintegrated residues” (1904 [1949: 73]). This is because, second, in the social sciences we want to understand social phenomena in their individuality, that is, in their unique configurations that have significance for us.

Values solve a selection problem. They tell us what research questions we ought to address because they inform us about the cultural importance of social phenomena:

Only a small portion of existing concrete reality is colored by our value-conditioned interest and it alone is significant to us. It is significant because it reveals relationships which are important to use due to their connection with our values. (1904 [1949: 76])

It is important to note that Weber did not think that social and natural science were different in kind, as Dilthey and others did. Social science too examines the causes of phenomena of interest, and natural science too often seeks to explain natural phenomena in their individual constellations. The role of causal laws is different in the two fields, however. Whereas establishing a causal law is often an end in itself in the natural sciences, in the social sciences laws play an attenuated and accompanying role as mere means to explain cultural phenomena in their uniqueness.

Nevertheless, for Weber social science remains objective in at least two ways. First, once research questions of interest have been settled, answers about the causes of culturally significant phenomena do not depend on the idiosyncrasies of an individual researcher:

But it obviously does not follow from this that research in the cultural sciences can only have results which are “subjective” in the sense that they are valid for one person and not for others. […] For scientific truth is precisely what is valid for all who seek the truth. (Weber 1904 [1949: 84], emphasis original)

The claims of social science can therefore be objective in our third sense ( see section 4 ). Moreover, by determining that a given phenomenon is “culturally significant” a researcher reflects on whether or not a practice is “meaningful” or “important”, and not whether or not it is commendable: “Prostitution is a cultural phenomenon just as much as religion or money” (1904 [1949: 81]). An important implication of this view came to the fore in the so-called “ Werturteilsstreit ” (quarrel concerning value judgments) of the early 1900s. In this debate, Weber maintained against the “socialists of the lectern” around Gustav Schmoller the position that social scientists qua scientists should not be directly involved in policy debates because it was not the aim of science to examine the appropriateness of ends. Given a policy goal, a social scientist could make recommendations about effective strategies to reach the goal; but social science was to be value-free in the sense of not taking a stance on the desirability of the goals themselves. This leads us to our conception of objectivity as freedom from value judgments.

Contemporary mainstream economists hold a view concerning objectivity that mirrors Max Weber’s (see above). On the one hand, it is clear that value judgments are at the heart of economic theorizing. “Preferences” are a key concept of rational choice theory, the main theory in contemporary mainstream economics. Preferences are evaluations. If an individual prefers \(A\) to \(B\), she values \(A\) higher than \(B\) (Hausman 2012). Thus, to the extent that economists predict and explain market behavior in terms of rational choice theory, they predict and explain market behavior in a way laden with value judgments.

However, economists are not themselves supposed to take a stance about whether or not whatever individuals value is also “objectively” good in a stronger sense:

[…] that an agent is rational from [rational choice theory]’s point of view does not mean that the course of action she will choose is objectively optimal. Desires do not have to align with any objective measure of “goodness”: I may want to risk swimming in a crocodile-infested lake; I may desire to smoke or drink even though I know it harms me. Optimality is determined by the agent’s desires, not the converse. (Paternotte 2011: 307–8)

In a similar vein, Gul and Pesendorfer write:

However, standard economics has no therapeutic ambition, i.e., it does not try to evaluate or improve the individual’s objectives. Economics cannot distinguish between choices that maximize happiness, choices that reflect a sense of duty, or choices that are the response to some impulse. Moreover, standard economics takes no position on the question of which of those objectives the agent should pursue. (Gul and Pesendorfer 2008: 8)

According to the standard view, all that rational choice theory demands is that people’s preferences are (internally) consistent; it has no business in telling people what they ought to prefer, whether their preferences are consistent with external norms or values. Economics is thus value-laden, but laden with the values of the agents whose behavior it seeks to predict and explain and not with the values of those who seek to predict and explain this behavior.

Whether or not social science, and economics in particular, can be objective in this—Weber’s and the contemporary economists’—sense is controversial. On the one hand, there are some reasons to believe that rational choice theory (which is at work not only in economics but also in political science and other social sciences) cannot be applied to empirical phenomena without referring to external norms or values (Sen 1993; Reiss 2013).

On the other hand, it is not clear that economists and other social scientists qua social scientists shouldn’t participate in a debate about social goals. For one thing, trying to do welfare analysis in the standard Weberian way tends to obscure rather than to eliminate normative commitments (Putnam and Walsh 2007). Obscuring value judgments can be detrimental to the social scientist as policy adviser because it will hamper rather than promote trust in social science. For another, economists are in a prime position to contribute to ethical debates, for a variety of reasons, and should therefore take this responsibility seriously (Atkinson 2001).

The same demands calling for “mechanical objectivity” in the natural sciences and quantification in the social and policy sciences in the nineteenth century and mid-twentieth century are responsible for a recent movement in biomedical research, which, even more recently, have swept to contemporary social science and policy. Early proponents of so-called “evidence-based medicine” made their pursuit of a downplay of the “human element” in medicine plain:

Evidence-based medicine de-emphasizes intuition, unsystematic clinical experience, and pathophysiological rationale as sufficient grounds for clinical decision making and stresses the examination of evidence from clinical research. (Guyatt et al. 1992: 2420)

To call the new movement “evidence-based” is a misnomer strictly speaking, as intuition, clinical experience and pathophysiological rationale can certainly constitute evidence. But proponents of evidence-based practices have a much narrower concept of evidence in mind: analyses of the results of randomized controlled trials (RCTs). This movement is now very strong in biomedical research, development economics and a number of areas of social science, especially psychology, education and social policy, and especially in the English speaking world.

The goal is to replace subjective (biased, error-prone, idiosyncratic) judgments by mechanically objective methods. But, as in other areas, attempting to mechanize inquiry can lead to reduced accuracy and utility of the results.

Causal relations in the social and biomedical sciences hold on account of highly complex arrangements of factors and conditions. Whether for instance a substance is toxic depends on details of the metabolic system of the population ingesting it, and whether an educational policy is effective on the constellation of factors that affect the students’ learning progress. If an RCT was conducted successfully, the conclusion about the effectiveness of the treatment (or toxicity of a substance) under test is certain for the particular arrangement of factors and conditions of the trial (Cartwright 2007). But unlike the RCT itself, many of whose aspects can be (relatively) mechanically implemented, applying the result to a new setting (recommending a treatment to a patient, for instance) always involves subjective judgments of the kind proponents of evidence-based practices seek to avoid—such as judgments about the similarity of the test to the target or policy population.

On the other hand, RCTs can be regarded as “debiasing procedure” because they prevent researchers from allocating treatments to patients according to their personal interests, so that the healthiest (or smartest or…) subjects get the researcher’s favorite therapy. While unbalanced allocations can certainly happen by chance, randomization still provides some warrant that the allocation was not done on purpose with a view to promoting somebody’s interests. A priori , the experimental procedure is thus more impartial with respect to the interests at stake. It has thus been argued that RCTs in medicine, while no guarantor of the best outcomes, were adopted by the U.S. Food and Drugs Administration (FDA) to different degrees during the 1960s and 1970s in order to regain public trust in its decisions about treatments, which it had lost due to the thalidomide and other scandals (Teira and Reiss 2013; Teira 2010). It is important to notice, however, that randomization is at best effective with respect to one kind of bias, viz. selection bias. Important other epistemic concerns are not addressed by the procedure but should not be ignored (Worrall 2002).

In sections 2–5, we have encountered various concepts of scientific objectivity and their limitations. This prompts the question of how unified (or disunified) scientific objectivity is as a concept: Is there something substantive shared by all of these analyses? Or is objectivity, as Heather Douglas (2004) puts it, an “irreducibly complex” concept?

Douglas defends pluralism about scientific objectivity and distinguishes three areas of application of the concept: (1) interaction of humans with the world, (2) individual reasoning processes, (3) social processes in science. Within each area, there are various distinct senses which are again irreducible to each other and do not have a common core meaning. This does not mean that the senses are unrelated; they share a complex web of relationships and can also support each other—for example, eliminating values from reasoning may help to achieve procedural objectivity. For Douglas, reducing objectivity to a single core meaning would be a simplification without benefits; instead of a complex web of relations between different senses of objectivity we would obtain an impoverished concept out of touch with scientific practice. Similar arguments and pluralist accounts can be found in Megill (1994), Janack (2002) and Padovani et al. (2015)—see also Axtell (2016).

It has been argued, however, that pluralist approaches give up too quickly on the idea that the different senses of objectivity share one or several important common elements. As we have seen in section 4.1 and 5.1 , scientific objectivity and trust in science are closely connected. Scientific objectivity is desirable because to the extent that science is objective we have reasons trust scientists, their results and recommendations (cf. Fine 1998: 18). Thus, perhaps what is unifying among the difference senses of objectivity is that each sense describes a feature of scientific practice that is able to inspire trust in science.

Building on this idea, Inkeri Koskinen has recently argued that it is in fact not trust but reliance that we are after (Koskinen forthcoming). Trust is something that can be betrayed, but only individuals can betray whereas objectivity pertains to institutions, practices, results, etc. We call scientific institutions, practices, results, etc. objective to the extent that we have reasons to rely on them. The analysis does not stop here, however. There is a distinct view about objectivity that is behind Daston and Galison’s historical epistemology of the concept and has been defended by Ian Hacking: that objectivity is not a—positive—virtue but rather the absence of this or that vice (Hacking 2015: 26). Speaking of objectivity in imaging, for instance, Daston and Galison write that the goal is to

let the specimen appear without that distortion characteristic of the observer’s personal tastes, commitments, or ambitions. (Daston and Galison 2007: 121)

Koskinen picks up this idea of objectivity as absence of vice and argues that it is specifically the aversion of epistemic risks for which the term is reserved. Epistemic risks comprise “any risk of epistemic error that arises anywhere during knowledge practices’ (Biddle and Kukla 2017: 218) such as the risk of having mistaken beliefs, the risk of errors in reasoning and risks related to operationalization, concept formation, and model choice. Koskinen argues that only those epistemic risks that relate to failings of scientists as human beings are relevant to objectivity (Koskinen forthcoming: 13):

For instance, when the results of an experiment are incorrect because of malfunctioning equipment, we do not worry about objectivity—we just say that the results should not be taken into account. [...] So it is only when the epistemic risk is related to our own failings, and is hard to avert, that we start talking about objectivity. Illusions, subjectivity, idiosyncrasies, and collective biases are important epistemic risks arising from our imperfections as epistemic agents.

Koskinen understands her account as a response to Hacking’s (2015) criticism that we should stop talking about objectivity altogether. According to Hacking, “objectivity” is an “elevator” or second-level word, similar to “true” or “real”—“Instead of saying that the cat is on the mat, we move up one story and and say that it is true that the cat is on the mat” (2015: 20). He recommends to stick to ground-level questions and worry about whether specific sources of error have been controlled. (A similar elimination request with respect to the labels “objective” and “subjective” in statistical inference has been advanced by Gelman and Hennig (2017).) In focussing on averting specific epistemic risks, Koskinen’s account does precisely that. Koskinen argues that a unified account of objectivity as averting epistemic risks takes into account Hacking’s negative stance and explains at the same time important features of the concept—for example, why objectivity does not imply certainty and why it varies with context.

The strong point of this account is that none of the threats to a peculiar analysis puts scientific objectivity at risk. We can (and in fact, we do) rely on scientific practices that represent the world from a perspective and where non-epistemic values affect outcomes and decisions. What is left open by Koskinen’s account is the normative question of what a scientist who cares about her experiments and inferences being objective should actually do. That is, the philosophical ideas we have reviewed in this section stay mainly on the descriptive level and do not give an actual guideline for working scientists. Connecting the abstract philosophical analysis to day-to-day work in science remains an open problem.

So is scientific objectivity desirable? Is it attainable? That, as we have seen, depends crucially on how the term is understood. We have looked in detail at four different conceptions of scientific objectivity: faithfulness to facts, value-freedom, freedom from personal biases, and features of community practices. In each case, there are at least some reasons to believe that either science cannot deliver full objectivity in this sense, or that it would not be a good thing to try to do so, or both. Does this mean we should give up the idea of objectivity in science?

We have shown that it is hard to define scientific objectivity in terms of a view from nowhere, value freedom, or freedom from personal bias. It is a lot harder to say anything positive about the matter. Perhaps it is related to a thorough critical attitude concerning claims and findings, as Popper thought. Perhaps it is the fact that many voices are heard, equally respected and subjected to accepted standards, as Longino defends. Perhaps it is something else altogether, or a combination of several factors discussed in this article.

However, one should not (as yet) throw out the baby with the bathwater. Like those who defend a particular explication of scientific objectivity, the critics struggle to explain what makes science objective, trustworthy and special. For instance, our discussion of the value-free ideal (VFI) revealed that alternatives to the VFI are as least as problematic as the VFI itself, and that the VFI may, with all its inadequacies, still be a useful heuristic for fostering scientific integrity and objectivity. Similarly, although entirely “unbiased” scientific procedures may be impossible, there are many mechanisms scientists can adopt for protecting their reasoning against undesirable forms of bias, e.g., choosing an appropriate method of statistical inference, being transparent about different stages of the research process and avoiding certain questionable research practices.

Whatever it is, it should come as no surprise that finding a positive characterization of what makes science objective is hard. If we knew an answer, we would have done no less than solve the problem of induction (because we would know what procedures or forms of organization are responsible for the success of science). Work on this problem is an ongoing project, and so is the quest for understanding scientific objectivity.

  • Anderson, Elizabeth, 2004, “Uses of Value Judgments in Science: A General Argument, with Lessons from a Case Study of Feminist Research on Divorce”, Hypatia , 19(1): 1–24. doi:10.1111/j.1527-2001.2004.tb01266.x
  • Atkinson, Anthony B., 2001, “The Strange Disappearance of Welfare Economics”, Kyklos , 54(2‐3): 193–206. doi:10.1111/1467-6435.00148
  • Axtell, Guy, 2016, Objectivity , Cambridge: Polity Press.
  • Bakker, Marjan, Annette van Dijk, and Jelte M. Wicherts, 2012, “The Rules of the Game Called Psychological Science”, Perspectives on Psychological Science , 7(6): 543–554. doi:10.1177/1745691612459060
  • Bernardo, J.M., 2012, “Integrated Objective Bayesian Estimation and Hypothesis Testing”, in Bayesian Statistics 9: Proceedings of the Ninth Valencia Meeting , J.M. Bernardo et al. (eds.), Oxford: Oxford University Press, 1–68.
  • Betz, Gregor, 2013, “In Defence of the Value Free Ideal”, European Journal for Philosophy of Science , 3(2): 207–220. doi:10.1007/s13194-012-0062-x
  • Biddle, Justin B., 2013, “State of the Field: Transient Underdetermination and Values in Science”, Studies in History and Philosophy of Science Part A , 44(1): 124–133. doi:10.1016/j.shpsa.2012.09.003
  • Biddle, Justin B. and Rebecca Kukla, 2017, “The Geography of Epistemic Risk”, in Exploring Inductive Risk: Case Studies of Values in Science , Kevin C. Elliott and Ted Richards (eds.), New York: Oxford University Press, 215–238.
  • Bloor, David, 1982, “Durkheim and Mauss Revisited: Classification and the Sociology of Knowledge”, Studies in History and Philosophy of Science Part A , 13(4): 267–297. doi:10.1016/0039-3681(82)90012-7
  • Braithwaite, R. B., 1953, Scientific Explanation , Cambridge: Cambridge University Press.
  • Carnap, Rudolf, 1950 [1962], Logical Foundations of Probability , second edition, Chicago: University of Chicago Press.
  • Cartwright, Nancy, 2007, “Are RCTs the Gold Standard?”, BioSocieties , 2(1): 11–20. doi:10.1017/S1745855207005029
  • Chang, Hasok, 2004, Inventing Temperature: Measurement and Scientific Progress , Oxford: Oxford University Press. doi:10.1093/0195171276.001.0001
  • Churchman, C. West, 1948, Theory of Experimental Inference , New York: Macmillan.
  • Collins, H. M., 1985, Changing Order: Replication and Induction in Scientific Practice , Chicago, IL: University of Chicago Press.
  • –––, 1994, “A Strong Confirmation of the Experimenters’ Regress”, Studies in History and Philosophy of Science Part A , 25(3): 493–503. doi:10.1016/0039-3681(94)90063-9
  • Cranor, Carl F., 1993, Regulating Toxic Substances: A Philosophy of Science and the Law , New York: Oxford University Press.
  • Crasnow, Sharon, 2013, “Feminist Philosophy of Science: Values and Objectivity: Feminist Philosophy of Science”, Philosophy Compass , 8(4): 413–423. doi:10.1111/phc3.12023
  • Daston, Lorraine, 1992, “Objectivity and the Escape from Perspective”, Social Studies of Science , 22(4): 597–618. doi:10.1177/030631292022004002
  • Daston, Lorraine and Peter Galison, 1992, “The Image of Objectivity”, Representations , 40(special issue: Seeing Science): 81–128. doi:10.2307/2928741
  • –––, 2007, Objectivity , Cambridge, MA: MIT Press.
  • Dilthey, Wilhelm, 1910 [1981], Der Aufbau der geschichtlichen Welt in den Geisteswissenschaften , Frankfurt am Main: Suhrkamp.
  • Dorato, Mauro, 2004, “Epistemic and Nonepistemic Values in Science”, in Machamer and Wolters 2004: 52–77.
  • Douglas, Heather E., 2000, “Inductive Risk and Values in Science”, Philosophy of Science , 67(4): 559–579. doi:10.1086/392855
  • –––, 2004, “The Irreducible Complexity of Objectivity”, Synthese , 138(3): 453–473. doi:10.1023/B:SYNT.0000016451.18182.91
  • –––, 2009, Science, Policy, and the Value-Free Ideal , Pittsburgh, PA: University of Pittsburgh Press.
  • –––, 2011, “Facts, Values, and Objectivity”, Jarvie and Zamora Bonilla 2011: 513–529.
  • Duhem, Pierre Maurice Marie, 1906 [1954], La théorie physique. Son objet et sa structure , Paris: Chevalier et Riviere; translated by Philip P. Wiener, The Aim and Structure of Physical Theory , Princeton, NJ: Princeton University Press, 1954.
  • Dupré, John, 2007, “Fact and Value”, in Kincaid, Dupré, and Wylie 2007: 24–71.
  • Earman, John, 1992, Bayes or Bust? A Critical Examination of Bayesian Confirmation Theory , Cambridge/MA: The MIT Press. 
  • Elliott, Kevin C., 2011, “Direct and Indirect Roles for Values in Science”, Philosophy of Science , 78(2): 303–324. doi:10.1086/659222
  • Feyerabend, Paul K., 1962, “Explanation, Reduction and Empiricism”, in H. Feigl and G. Maxwell (ed.), Scientific Explanation, Space, and Time , (Minnesota Studies in the Philosophy of Science, 3), Minneapolis, MN: University of Minnesota Press, pp. 28–97.
  • –––, 1975, Against Method , London: Verso.
  • –––, 1978, Science in a Free Society , London: New Left Books.
  • Fine, Arthur, 1998, “The Viewpoint of No-One in Particular”, Proceedings and Addresses of the American Philosophical Association , 72(2): 7. doi:10.2307/3130879
  • Fisher, Ronald Aylmer, 1935, The Design of Experiments , Edinburgh: Oliver and Boyd.
  • –––, 1956, Statistical Methods and Scientific Inference , New York: Hafner.
  • Franklin, Allan, 1994, “How to Avoid the Experimenters’ Regress”, Studies in History and Philosophy of Science Part A , 25(3): 463–491. doi:10.1016/0039-3681(94)90062-0
  • –––, 1997, “Calibration”, Perspectives on Science , 5(1): 31–80.
  • Freese, Jeremy and David Peterson, 2018, “The Emergence of Statistical Objectivity: Changing Ideas of Epistemic Vice and Virtue in Science”, Sociological Theory , 36(3): 289–313. doi:10.1177/0735275118794987
  • Gadamer, Hans-Georg, 1960 [1989], Wahrheit und Methode , Tübingen : Mohr. Translated as Truth and Method , 2 nd edition, Joel Weinsheimer and Donald G. Marshall (trans), New York, NY: Crossroad, 1989.
  • Gelman, Andrew and Christian Hennig, 2017, “Beyond Subjective and Objective in Statistics”, Journal of the Royal Statistical Society: Series A (Statistics in Society) , 180(4): 967–1033. doi:10.1111/rssa.12276
  • Giere, Ronald N., 2006, Scientific Perspectivism , Chicago, IL: University of Chicago Press.
  • Good, Irving John, 1950, Probability and the Weighing of Evidence , London: Charles Griffin.
  • Gul, Faruk and Wolfgang Pesendorfer, 2008, “The Case for Mindless Economics”, in The Foundations of Positive and Normative Economics: a Handbook , Andrew Caplin and Andrew Schotter (eds), New York, NY: Oxford University Press, pp. 3–39.
  • Guyatt, Gordon, John Cairns, David Churchill, Deborah Cook, Brian Haynes, Jack Hirsh, Jan Irvine, Mark Levine, Mitchell Levine, Jim Nishikawa, et al., 1992, “Evidence-Based Medicine: A New Approach to Teaching the Practice of Medicine”, JAMA: The Journal of the American Medical Association , 268(17): 2420–2425. doi:10.1001/jama.1992.03490170092032
  • Haack, Susan, 2003, Defending Science—Within Reason: Between Scientism and Cynicism , Amherst, NY: Prometheus Books.
  • Hacking, Ian, 1965, Logic of Statistical Inference , Cambridge: Cambridge University Press. doi:10.1017/CBO9781316534960
  • –––, 2015, “Let’s Not Talk About Objectivity”, in Padovani, Richardson, and Tsou 2015: 19–33. doi:10.1007/978-3-319-14349-1_2
  • Hanson, Norwood Russell, 1958, Patterns of Discovery: An Inquiry into the Conceptual Foundations of Science , Cambridge: Cambridge University Press.
  • Haraway, Donna, 1988, “Situated Knowledges: The Science Question in Feminism and the Privilege of Partial Perspective”, Feminist Studies , 14(3): 575–599. doi:10.2307/3178066
  • Harding, Sandra, 1991, Whose Science? Whose Knowledge? Thinking from Women’s Lives , Ithaca, NY: Cornell University Press.
  • –––, 1993, “Rethinking Standpoint Epistemology: What is Strong Objectivity?”, in Feminist Epistemologies , Linda Alcoff and Elizabeth Potter (ed.), New York, NY: Routledge, 49–82.
  • –––, 2015a, Objectivity and Diversity: Another Logic of Scientific Research , Chicago: University of Chicago Press.
  • –––, 2015b, “After Mr. Nowhere: What Kind of Proper Self for a Scientist?”, Feminist Philosophy Quarterly , 1(1): 1–22. doi:10.5206/fpq/2015.1.2
  • Hausman, Daniel M., 2012, Preference, Value, Choice, and Welfare , New York: Cambridge University Press. doi:10.1017/CBO9781139058537
  • Hempel, Carl G., 1965, Aspects of Scientific Explanation , New York: The Free Press.
  • Hesse, Mary B., 1980, Revolutions and Reconstructions in the Philosophy of Science , Bloomington, IN: University of Indiana Press.
  • Howson, Colin, 2000, Hume’s Problem: Induction and the Justification of Belief , Oxford: Oxford University Press.
  • Howson, Colin and Peter Urbach, 1993, Scientific Reasoning: The Bayesian Approach , second edition, La Salle, IL: Open Court.
  • Hrdy, Sarah Blaffer, 1977, The Langurs of Abu: Female and Male Strategies of Reproduction , Cambridge, MA: Harvard University Press.
  • Ioannidis, John P. A., 2005, “Why Most Published Research Findings Are False”, PLoS Medicine , 2(8): e124. doi:10.1371/journal.pmed.0020124
  • Janack, Marianne, 2002, “Dilemmas of Objectivity”, Social Epistemology , 16(3): 267–281. doi:10.1080/0269172022000025624
  • Jarvie, Ian C. and Jesús P. Zamora Bonilla (eds.), 2011, The SAGE Handbook of the Philosophy of Social Sciences , London: SAGE.
  • Jaynes, Edwin T., 1968, “Prior Probabilities”, IEEE Transactions on Systems Science and Cybernetics , 4(3): 227–241. doi:10.1109/TSSC.1968.300117
  • Jeffrey, Richard C., 1956, “Valuation and Acceptance of Scientific Hypotheses”, Philosophy of Science , 23(3): 237–246. doi:10.1086/287489
  • Jeffreys, Harold, 1939 [1980], Theory of Probability , third edition, Oxford: Oxford University Press.
  • Kelvin, Lord (William Thomson), 1883, “Electrical Units of Measurement”, Lecture to the Institution of Civil Engineers on 3 May 1883, reprinted in 1889, Popular Lectures and Addresses , Vol. I, London: MacMillan and Co., p. 73.
  • Kincaid, Harold, John Dupré, and Alison Wylie (eds.), 2007, Value-Free Science?: Ideals and Illusions , Oxford: Oxford University Press. doi:10.1093/acprof:oso/9780195308969.001.0001
  • Kitcher, Philip, 2011a, Science in a Democratic Society , Amherst, NY: Prometheus Books.
  • –––, 2011b, The Ethical Project , Cambridge, MA: Harvard University Press.
  • Koskinen, Inkeri, forthcoming, “Defending a Risk Account of Scientific Objectivity”, The British Journal for the Philosophy of Science , first online: 3 August 2018. doi:10.1093/bjps/axy053
  • Kourany, Janet A., 2010, Philosophy of Science after Feminism , Oxford: Oxford University Press.
  • Kuhn, Thomas S., 1962 [1970], The Structure of Scientific Revolutions , second edition, Chicago: University of Chicago Press.
  • –––, 1977, “Objectivity, Value Judgment, and Theory Choice”, in his The Essential Tension. Selected Studies in Scientific Tradition and Change , Chicago: University of Chicago Press: 320–39.
  • Lacey, Hugh, 1999, Is Science Value-Free? Values and Scientific Understanding , London: Routledge.
  • –––, 2002, “The Ways in Which the Sciences Are and Are Not Value Free”, in In the Scope of Logic, Methodology and Philosophy of Science: Volume Two of the 11th International Congress of Logic, Methodology and Philosophy of Science, Cracow, August 1999 , Peter Gärdenfors, Jan Woleński, and Katarzyna Kijania-Placek (eds.), Dordrecht: Springer Netherlands, 519–532. doi:10.1007/978-94-017-0475-5_9
  • Laudan, Larry, 1984, Science and Values: An Essay on the Aims of Science and Their Role in Scientific Debate , Berkeley/Los Angeles: University of California Press.
  • Levi, Isaac, 1960, “Must the Scientist Make Value Judgments?”, The Journal of Philosophy , 57(11): 345–357. doi:10.2307/2023504
  • Lloyd, Elisabeth A., 1993, “Pre-Theoretical Assumptions in Evolutionary Explanations of Female Sexuality”, Philosophical Studies , 69(2–3): 139–153. doi:10.1007/BF00990080
  • –––, 2005, The Case of the Female Orgasm: Bias in the Science of Evolution , Cambridge, MA: Harvard University Press.
  • Longino, Helen E., 1990, Science as Social Knowledge: Values and Objectivity in Scientific Inquiry , Princeton, NY: Princeton University Press.
  • –––, 1996, “Cognitive and Non-Cognitive Values in Science: Rethinking the Dichotomy”, in Feminism, Science, and the Philosophy of Science , Lynn Hankinson Nelson and Jack Nelson (eds.), Dordrecht: Springer Netherlands, 39–58. doi:10.1007/978-94-009-1742-2_3
  • Machamer, Peter and Gereon Wolters (eds.), 2004, Science, Values and Objectivity , Pittsburgh: Pittsburgh University Press.
  • Mayo, Deborah G., 1996, Error and the Growth of Experimental Knowledge , Chicago & London: The University of Chicago Press.
  • McMullin, Ernan, 1982, “Values in Science”, PSA: Proceedings of the Biennial Meeting of the Philosophy of Science Association 1982 , 3–28.
  • –––, 2009, “The Virtues of a Good Theory”, in The Routledge Companion to Philosophy of Science , Martin Curd and Stathis Psillos (eds), London: Routledge.
  • Megill, Allan, 1994, “Introduction: Four Senses of Objectivity”, in Rethinking Objectivity , Allan Megill (ed.), Durham, NC: Duke University Press, 1–20.
  • Mill, John Stuart, 1859 [2003], On Liberty , New Haven and London: Yale University Press.
  • Mitchell, Sandra D., 2004, “The Prescribed and Proscribed Values in Science Policy”, in Machamer and Wolters 2004: 245–255.
  • Nagel, Thomas, 1986, The View From Nowhere , New York, NY: Oxford University Press.
  • Nixon, Richard, 1969, “Special Message to the Congress on Social Security”, 25 September 1969. [ Nixon 1969 available online ]
  • Norton, John D., 2003, “A Material Theory of Induction”, Philosophy of Science , 70(4): 647–670. doi:10.1086/378858
  • –––, 2008, “Must Evidence Underdetermine Theory?”, in The Challenge of the Social and the Pressure of Practice , Martin Carrier, Don Howard and Janet Kourany (eds), Pittsburgh, PA: Pittsburgh University Press: 17–44.
  • Oakeshott, Michael, 1933, Experience and Its Modes , Cambridge: Cambridge University Press.
  • Okruhlik, Kathleen, 1994, “Gender and the Biological Sciences”, Canadian Journal of Philosophy Supplementary Volume , 20: 21–42. doi:10.1080/00455091.1994.10717393
  • Open Science Collaboration, 2015, “Estimating the Reproducibility of Psychological Science”, Science , 349(6251): aac4716. doi:10.1126/science.aac4716
  • Padovani, Flavia, Alan Richardson, and Jonathan Y. Tsou (eds.), 2015, Objectivity in Science: New Perspectives from Science and Technology Studies , (Boston Studies in the Philosophy and History of Science 310), Cham: Springer International Publishing. doi:10.1007/978-3-319-14349-1
  • Page, Scott E., 2007, The Difference: How the Power of Diversity Creates Better Groups, Firms, Schools, and Societies , Princeton, NJ: Princeton University Press.
  • Paternotte, Cédric, 2011, “Rational Choice Theory”, in The SAGE Handbook of The Philosophy of Social Sciences , Jarvie and Zamora Bonilla 2011: 307–321.
  • Popper, Karl. R., 1934 [2002], Logik der Forschung , Vienna: Julius Springer. Translated as Logic of Scientific Discovery , London: Routledge.
  • –––, 1963, Conjectures and Refutations: The Growth of Scientific Knowledge , New York: Harper.
  • –––, 1972, Objective Knowledge: An Evolutionary Approach , Oxford: Oxford University Press.
  • Porter, Theodore M., 1995, Trust in Numbers: The Pursuit of Objectivity in Science and Public Life , Princeton, NJ, Princeton University Press.
  • Putnam, Hilary, 2002, The Collapse of the Fact/Value Dichotomy and Other Essays , Cambridge, MA: Harvard University Press.
  • Putnam, Hilary and Vivian Walsh, 2007, “A Response to Dasgupta”, Economics and Philosophy , 23(3): 359–364. doi:10.1017/S026626710700154X
  • Reichenbach, Hans, 1938, “On Probability and Induction”, Philosophy of Science , 5(1): 21–45. doi:10.1086/286483
  • Reiss, Julian, 2008, Error in Economics: The Methodology of Evidence-Based Economics , London: Routledge.
  • –––, 2010, “In Favour of a Millian Proposal to Reform Biomedical Research”, Synthese , 177(3): 427–447. doi:10.1007/s11229-010-9790-7
  • –––, 2013, Philosophy of Economics: A Contemporary Introduction , New York, NY: Routledge.
  • –––, 2020, “What Are the Drivers of Induction? Towards a Material Theory+”, Studies in History and Philosophy of Science Part A 83: 8–16.
  • Resnik, David B., 2007, The Price of Truth: How Money Affects the Norms of Science , Oxford: Oxford University Press.
  • Rickert, Heinrich, 1929, Die Grenzen der naturwissenschaftlichen Begriffsbildung. Eine logische Einleitung in die historischen Wissenschaften , 6th edition, Tübingen: Mohr Siebeck. First edition published in 1902.
  • Royall, Richard, 1997, Scientific Evidence: A Likelihood Paradigm , London: Chapman & Hall.
  • Rudner, Richard, 1953, “The Scientist qua Scientist Makes Value Judgments”, Philosophy of Science , 20(1): 1–6. doi:10.1086/287231
  • Ruphy, Stéphanie, 2006, “‘Empiricism All the Way down’: A Defense of the Value-Neutrality of Science in Response to Helen Longino’s Contextual Empiricism”, Perspectives on Science , 14(2): 189–214. doi:10.1162/posc.2006.14.2.189
  • Sen, Amartya, 1993, “Internal Consistency of Choice”, Econometrica , 61(3): 495–521.
  • Shrader-Frechette, K. S., 1991, Risk and Rationality , Berkeley/Los Angeles: University of California Press.
  • Simonsohn, Uri, Leif D. Nelson, and Joseph P. Simmons, 2014, “P-Curve: A Key to the File-Drawer.”, Journal of Experimental Psychology: General , 143(2): 534–547. doi:10.1037/a0033242
  • Sprenger, Jan, 2016, “Bayesianism vs. Frequentism in Statistical Inference”, in Oxford Handbook on Philosophy of Probability , Alan Hájek and Christopher Hitchcock (eds), Oxford: Oxford University Press.
  • –––, 2018, “The Objectivity of Subjective Bayesianism”, European Journal for Philosophy of Science , 8(3): 539–558. doi:10.1007/s13194-018-0200-1
  • Sprenger, Jan and Stephan Hartmann, 2019, Bayesian Philosophy of Science , Oxford: Oxford University Press. doi:10.1093/oso/9780199672110.001.0001
  • Steel, Daniel, 2010, “Epistemic Values and the Argument from Inductive Risk”, Philosophy of Science , 77(1): 14–34. doi:10.1086/650206
  • Steele, Katie, 2012, “The Scientist qua Policy Advisor Makes Value Judgments”, Philosophy of Science, 79(5): 893–904. doi:10.1086/667842
  • Stegenga, Jacob, 2011, “Is Meta-Analysis the Platinum Standard of Evidence?”, Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences , 42(4): 497–507. doi:10.1016/j.shpsc.2011.07.003
  • –––, 2018, Medical Nihilism , Oxford: Oxford University Press. doi:10.1093/oso/9780198747048.001.0001
  • Teira, David, 2010, “Frequentist versus Bayesian Clinical Trials”, in Philosophy of Medicine , Fred Gifford (ed.), (Handbook of the Philosophy of Science 16), Amsterdam: Elsevier, 255–297. doi:10.1016/B978-0-444-51787-6.50010-6
  • Teira, David and Julian Reiss, 2013, “Causality, Impartiality and Evidence-Based Policy”, in Mechanism and Causality in Biology and Economics , Hsiang-Ke Chao, Szu-Ting Chen, and Roberta L. Millstein (eds.), (History, Philosophy and Theory of the Life Sciences 3), Dordrecht: Springer Netherlands, 207–224. doi:10.1007/978-94-007-2454-9_11
  • Weber, Max, 1904 [1949], “Die ‘Objektivität’ sozialwissenschaftlicher und sozialpolitischer Erkenntnis”, Archiv für Sozialwissenschaft und Sozialpolitik , 19(1): 22–87. Translated as “‘Objectivity’ in Social Science and Social Policy”, in Weber 1949: 50–112.
  • –––, 1917 [1949], “Der Sinn der ‘Wertfreiheit’ der soziologischen und ökonomischen Wissenschaften”. Reprinted in Gesammelte Aufsätze zur Wissenschaftslehre , Tübingen: UTB, 1988, 451–502. Translated as “The Meaning of ‘Ethical Neutrality’ in Sociology and Economics” in Weber 1949: 1–49.
  • –––, 1949, The Methodology of the Social Sciences , Edward A. Shils and Henry A. Finch (trans/eds), New York, NY: Free Press.
  • Wilholt, Torsten, 2009, “Bias and Values in Scientific Research”, Studies in History and Philosophy of Science Part A , 40(1): 92–101. doi:10.1016/j.shpsa.2008.12.005
  • –––, 2013, “Epistemic Trust in Science”, The British Journal for the Philosophy of Science , 64(2): 233–253. doi:10.1093/bjps/axs007
  • Williams, Bernard, 1985 [2011], Ethics and the Limits of Philosophy , Cambridge, MA: Harvard University Press. Reprinted London and New York, NY: Routledge, 2011.
  • Williamson, Jon, 2010, In Defence of Objective Bayesianism , Oxford: Oxford University Press. doi:10.1093/acprof:oso/9780199228003.001.0001
  • Windelband, Wilhelm, 1915, Präludien. Aufsätze und Reden zur Philosophie und ihrer Geschichte , fifth edition, Tübingen: Mohr Siebeck.
  • Winsberg, Eric, 2012, “Values and Uncertainties in the Predictions of Global Climate Models”, Kennedy Institute of Ethics Journal , 22(2): 111–137. doi:10.1353/ken.2012.0008
  • Wittgenstein, Ludwig, 1953 [2001], Philosophical Investigations , G. Anscombe (trans.), London: Blackwell.
  • Worrall, John, 2002, “ What Evidence in Evidence‐Based Medicine?”, Philosophy of Science , 69(S3): S316–S330. doi:10.1086/341855
  • Wylie, Alison, 2003, “Why Standpoint Matters”, in Science and Other Cultures: Issues in Philosophies of Science and Technology , Robert Figueroa and Sandra Harding (eds), New York, NY and London: Routledge, pp. 26–48.
  • Ziliak, Stephen Thomas and Deirdre N. McCloskey, 2008, The Cult of Statistical Significance: How the Standard Error Costs Us Jobs, Justice and Lives , Ann Arbor, MI: University of Michigan Press.
How to cite this entry . Preview the PDF version of this entry at the Friends of the SEP Society . Look up topics and thinkers related to this entry at the Internet Philosophy Ontology Project (InPhO). Enhanced bibliography for this entry at PhilPapers , with links to its database.
  • Norton, John, manuscript, The Material Theory of Induction , retrieved on 9 January 2020.
  • Objectivity , entry by Dwayne H. Mulder in the Internet Encyclopedia of Philosophy .

Bayes’ Theorem | confirmation | feminist philosophy, interventions: epistemology and philosophy of science | feminist philosophy, interventions: philosophy of biology | Feyerabend, Paul | hermeneutics | incommensurability: of scientific theories | Kuhn, Thomas | logic: inductive | physics: experiment in | science: theory and observation in | scientific realism | statistics, philosophy of | underdetermination, of scientific theories | Weber, Max

Copyright © 2020 by Julian Reiss < reissj @ me . com > Jan Sprenger < jan . sprenger @ unito . it >

  • Accessibility

Support SEP

Mirror sites.

View this site from another server:

  • Info about mirror sites

The Stanford Encyclopedia of Philosophy is copyright © 2023 by The Metaphysics Research Lab , Department of Philosophy, Stanford University

Library of Congress Catalog Data: ISSN 1095-5054

  • Publications

Welcome to SRRI Online

SRRI is a federation of researchers from various disciplines, departments, and backgrounds. We study multiple facets of teaching, learning, and thinking in science, mathematics, and engineering, at K-12 and university levels. More recently, we have been applying this research to the development of educational software tools and classroom activities and researching the effects of instructional interventions on student reasoning. → learn more…

Latest News

Energy in the Human Body, A Middle School Life Science Curriculum Access the Energy in the Human Body curriculum here: http://www.cesd.umass.edu/energyinthehumanbody/

http://www.umass.edu/teachingstrategies/ .

In 2015, CLSG authors published in the International Journal of Science Education and Computers & Education .

We are happy to announce that TinkerPlots has a new home and publisher, Learn Troop.

View all recent news and announcements

Current Projects

  • Strategies for Leading Classroom Discussions Aimed at Core Ideas and Scientific Modeling Practices
  • Model Construction Processes in Experts

Quick Links

  • News & Announcements
  • Calendar of Events

This site is maintained by the Scientific Reasoning Research Institute . © 2016 University of Massachusetts Amherst • Site Policies (Site editors: login/profile , logout )

IMAGES

  1. Model of scientific reasoning (according to Mayer, 2007)

    scientific reasoning research

  2. Steps of Scientific Method Infographic

    scientific reasoning research

  3. 1.1B: Scientific Reasoning

    scientific reasoning research

  4. The scientific method vector illustration

    scientific reasoning research

  5. -The flow diagrams of inductive and deductive reasoning

    scientific reasoning research

  6. Model of scientific reasoning (according to Mayer, 2007)

    scientific reasoning research

VIDEO

  1. Day 2: Basics of Scientific Research Writing (Batch 18)

  2. A Guide to Nootropics for Scientific Thinking and Empirical Reasoning

  3. Scientific Reasoning/Critical Thinking in Labs PER Interest Group Feb 23, 2024

  4. Pulling ideas from the brain

  5. 3. The Basic Processes of Scientific Reasoning

  6. Meaning & characteristics of scientific research || वैज्ञानिक शोध का अर्थ एवं विशेषताएँ

COMMENTS

  1. 35 Scientific Thinking and Reasoning

    Abstract. Scientific thinking refers to both thinking about the content of science and the set of reasoning processes that permeate the field of science: induction, deduction, experimental design, causal reasoning, concept formation, hypothesis testing, and so on. Here we cover both the history of research on scientific thinking and the different approaches that have been used, highlighting ...

  2. Scientific Reasoning

    Scientific Reasoning. Procedural/scientific reasoning is the process by which the therapist identifies the problem by observing cues, determining patterns, and comparing to type. ... Scientific research in the field of chemical engineering proposes models of the behaviour of chemical devices. It typically proceeds to study the behaviour of ...

  3. Conceptual review on scientific reasoning and scientific thinking

    Introduction. As part of high-order thinking processes, Scientific Reasoning (SR) and Scientific Thinking (ST) are concepts of great relevance for psychology and educational disciplines (Kuhn, 2009 ). The relevance of these concepts resides in two levels. First, the level of ontogenetical development (Zimmerman, 2007) reflected in the early ...

  4. A new framework for teaching scientific reasoning to students from

    Reconstruction of application-oriented scientific reasoning. Research in the engineering sciences has received only limited attention from philosophers of science. Conventionally, it is understood as research that applies scientific methods and/or theories to the attainment of practical goals (Bunge, 1966). Consequently, also in the self ...

  5. PDF Scientific Reasoning and Argumentation: Advancing an ...

    Research on scientific reasoning amongst laypeople has its roots in developmental psychology. Inhelder and Piaget (1958) assumed that scientific rationality was a model of the ideal human reasoning, that is, a person who reflects on theories, builds hypothetical models of reality, critically and exhaustively tests for all possible main and ...

  6. Scientific Reasoning: Research, Development, and Assessment

    In developing scientific reasoning, research has shown that inquiry-based science instruction can promote scientific reasoning abilities (Adey and Shayer, 1990; Lawson, 1995; Marek and Cavallo, 1997; Benford and Lawson, 2001; Gerber, Cavallo and Marek, 2001). Additionally, studies have shown that students had higher gains on scientific ...

  7. PDF Conceptual review on scientific reasoning and scientific thinking

    As part of high-order thinking processes, Scientific Reasoning (SR) and Scientific Thinking (ST) are concepts of great rele-vance for psychology and educational disciplines (Kuhn, 2009). The relevance of these concepts resides in two levels. First, the level of ontogenetical development (Zimmerman, 2007) reflected in the early curiosity of the ...

  8. (PDF) Scientific Thinking and Reasoning

    Scientific thinking refers to both thinking about the content of science and. the set of reasoning processes that permeate the field of science: induction, deduction, experimental design, causal ...

  9. The development and application of scientific reasoning.

    Scientific reasoning is by definition a broad term, and encompasses the mental activities that are involved when people attempt to make systematic and empirical based discoveries about the world. The goal of the scientific reasoning process, as highlighted by Zimmerman (2000), is to extend our world knowledge, thus allowing us to gain a more detailed and conceptually richer understanding of ...

  10. The development of scientific reasoning in medical education: a

    Scientific reasoning has been studied across several distinct domains - cognitive sciences, education, developmental psychology, even artificial intelligence - which have tried to identify its underlying mechanisms. Although research has defined scientific reasoning through different rationales, a common approach refers to the mental ...

  11. Understanding the Complex Relationship between Critical Thinking and

    Specifically, scientific reasoning in writing is more strongly associated with formulating a compelling argument for the significance of the research in the context of current literature in biology, making meaning regarding the implications of the findings in chemistry, and providing an organizational framework for interpreting the thesis in ...

  12. Learning and Scientific Reasoning

    One such ability, scientific reasoning ( 7 - 9 ), is related to cognitive abilities such as critical thinking and reasoning ( 10 - 14 ). Scientific-reasoning skills can be developed through training and can be transferred ( 7, 13 ). Training in scientific reasoning may also have a long-term impact on student academic achievement ( 7 ).

  13. Scientific Reasoning

    Scientific reasoning is the foundation supporting the entire structure of logic underpinning scientific research. It is impossible to explore the entire process, in any detail, because the exact nature varies between the various scientific disciplines. Despite these differences, there are four basic foundations that underlie the idea, pulling ...

  14. Scientific Reasoning:Research, Development, and Assessment

    Self-Efficacy, Scientific Reasoning, and Learning Achievement in the STEM Project-Based Learning Literature. The main goal of education is to prepare students for future job opportunities and civic responsibilities, and this is one of the biggest challenges in the 21st century. Science, Technology,….

  15. Scientific Method

    Science is an enormously successful human enterprise. The study of scientific method is the attempt to discern the activities by which that success is achieved. Among the activities often identified as characteristic of science are systematic observation and experimentation, inductive and deductive reasoning, and the formation and testing of ...

  16. A new framework for teaching scientific reasoning to students from

    About three decades ago, the late Ronald Giere introduced a new framework for teaching scientific reasoning to science students. Giere's framework presents a model-based alternative to the traditional statement approach—in which scientific inferences are reconstructed as explicit arguments, composed of (single-sentence) premises and a conclusion. Subsequent research in science education ...

  17. 1.2: The Science of Biology

    Scientists seek to understand the world and the way it operates. To do this, they use two methods of logical thinking: inductive reasoning and deductive reasoning. Figure 1.2.1 1.2. 1: Scientific Reasoning: Scientists use two types of reasoning, inductive and deductive, to advance scientific knowledge. Inductive reasoning is a form of logical ...

  18. Styles of Scientific Reasoning

    Styles of Scientific Reasoning. By Disan Davis. Science is a process for making meaning and deriving understanding about our world and ourselves. It is fundamentally a human endeavor, and thus is fallible in all of the ways humans are fallible. Also like humans, science is a multifaceted, creative, collaborative, and more….

  19. Learning and Scientific Reasoning

    an impact on the development of scientific-reasoning ability. The scientific-reasoning ability studied in this paper focuses on domain-general reasoning skills such as the abilities to systematically explore a prob-lem, to formulate and test hypotheses, to manipulate and isolate variables, and to observe and evaluate the consequences. Research ...

  20. PDF Development and Validation of the Scientific Reasoning Scale

    weaknesses of scientific research arising from shortcomings in its processes or results. These weaknesses are essential to the critical reasoning needed to evaluate imperfect evidence and controversies and are not addressed by current tests. Second, the SRS integrates research from cognitive devel-opmental psychology on scientific reasoning skills.

  21. (PDF) Scientific Reasoning and Argumentation: Advancing an

    Scientific reasoning and scientific argumentation are highly valued outcomes of K-12 and higher education. In this article, we first review main topics and key findings of three different strands ...

  22. Scientific Objectivity

    Similarly, although entirely "unbiased" scientific procedures may be impossible, there are many mechanisms scientists can adopt for protecting their reasoning against undesirable forms of bias, e.g., choosing an appropriate method of statistical inference, being transparent about different stages of the research process and avoiding certain ...

  23. Welcome to SRRI Online

    We study multiple facets of teaching, learning, and thinking in science, mathematics, and engineering, at K-12 and university levels. More recently, we have been applying this research to the development of educational software tools and classroom activities and researching the effects of instructional interventions on student reasoning.