Home Argomenti The Fundamental Flaw in Research Assessment Systems

The Fundamental Flaw in Research Assessment Systems

1 Novembre 2011

A popular idea in government circles at the moment is that the quantity and quality of research produced in a country can be improve by introducing what could be called a research assessment system.
The pioneering such system was the research assessment exercise, or RAE, introduced in the UK by Margaret Thatcher as long ago as 1986. The RAE continued in the UK until 2008, when it was replaced by a new research assessment system, known as the REF. Now there are plans to introduce research assessment systems in other countries such as France and Italy. Naturally all these research assessment systems differ from each other in detail, but they do have a common feature which can be used to define what is meant by such a system. I will give this definition in the next paragraph, and then argue that research assessment systems, as thus characterised, have a fundamental flaw. As a result of this flaw, the effect of introducing such a system is to lower the quality of research produced, instead of raising it.

I will define a research assessment system as a system in which groups of researchers are assessed at intervals. If the assessment is good, the group retains its funding or gets more, while, if the assessment is bad, the group’s funds are reduced or perhaps removed altogether.

Now a research assessment system might seem, at first sight, to be an obvious and common sense procedure. We want to produce good research. So let us first find out who the good researchers are by an assessment, and then give funding to good researchers while removing it from bad researchers. In this way we will obviously improve the quality of the research produced. What could be wrong with such a system?

In this short article, I will examine the problem in the context of research in the natural sciences. The situation is somewhat different in other areas. For example, in my 2011, I argue that the damaging effect of research assessment systems is greater in economics than the natural sciences, but there is not space to go into this problem here.

The fundamental flaw in a research assessment system for the natural sciences is shown by a study of the history of science. Such a study shows that it is not in fact possible for researchers to give accurate assessments of contemporary research. After twenty or thirty years, the assessments of a piece of research have normally reached a consensus which does not change much thereafter. However, this consensus after twenty or thirty years may be wildly different from the judgements which were made at the time the research was first produced. Research which was then thought to be really important may, after twenty or thirty years, be seen as the exploration of a blind alley, while research which was thought then to be of no value may after twenty or thirty years be seen to be a crucial breakthrough.

The phenomenon to which I wish to draw attention could be described as delayed recognition. Let us suppose that a scientist, Mr S say, publishes a paper in which he proposes a new theory based on his research, which, after thirty years, is recognised as a major advance in the field. It may well be that his fellow scientists working in that field may not immediately recognise that Mr S’s new theory is a good one. They may initially think that Mr S’s theory is completely wrong, and largely ignore his work. Mr S. may then have to continue developing his theory through his research, and perhaps that of a few supporters, for many years before its value is recognised by the scientific community.

Delayed recognition is a very common phenomenon in the history of science, and, interestingly, it most often occurs for advances which, with hindsight, are seen to be among the most important breakthroughs. It is moreover fairly easy to explain why this happens. According to Kuhn, and I think he is correct here, scientists always work within a framework of assumptions or paradigm, which they accept for the time being as correct. Now a major advance in research is likely to go against some of the assumptions in the dominant paradigm. Working scientists are likely to reject, at least initially, a theory which contradicts any of the basic assumptions of their paradigm. Hence we would expect there to be initially a negative reaction to what later turns out to be a major advance.

The phenomenon of delayed recognition is extremely common in the history of science, and I give many examples of it in my book How Should Research be Organised? (published in 2008 to coincide with the results of the last RAE). For this short article, however, I will confine myself to one recent example.

In 2008, Harald zur Hausen was awarded the Nobel prize for the discovery that a form of cervical cancer is caused by a preceding infection by the papilloma virus. In the research which led to the discovery, however, the majority of researchers favoured the view that the causal agent for cervical cancer was a herpes virus and not a papilloma virus. This was the dominant paradigm at the time, and zur Hausen and his group were the only ones who favoured the papilloma virus.

One of the reasons why the research community favoured the idea that a herpes virus was the cause of cervical cancer was that it had been shown that a herpes virus, the Epstein-Barr virus, was the cause of another cancer: Burkitt’s Lymphoma. The dominance of the herpes virus approach is shown by the fact that, in December 1972, there was an international conference of researchers in the field at Key Biscayne in Florida, which had the title: Herpesvirus and Cervical Cancer. Zur Hausen attended this conference and made some criticisms of the herpes virus approach. He said that he believed that the results indicate at least a basic difference in the reaction of herpes simplex virus type 2 with cervical cancer cells, as compared to another herpes virus, Epstein-Barr virus. In Burkitt’s lymphomas and nasopharyngeal carcinomas, the tumor cells seem to be loaded with viral genomes, and obviously the complete viral genomes are present in those cells. Thus a basic difference seems to exist between these 2 systems. (cf. Goodheart, 1973, p. 1417). It is reported that the audience listened to zur Hausen in stony silence (Mcintyre, 2005, p.35). The summary of the conference written by George Klein (Klein, 1973) does not mention zur Hausen. Clearly at that time, contemporary assessments of zur Hausen’s research by the scientific community would have given him a low rating. He was regarded as a fringe crank, and his work was not referred to or taken seriously by the mainstream. In the long run, however, zur Hausen proved to be correct.[1]

At the time when zur Hausen was working, there was, fortunately for European science, no research assessment system operating in Germany. Let us now consider what would have happened to him and his group had such a system been in place. From the account I have just given, it is obvious that if a research assessment had been conducted in 1973, then zur Hausen and his group would have got a very low rating. Their research funding would have been cut off, and the discovery of the cause of cervical cancer would have been long delayed. Millions of dollars would still have been spent on searching for a herpes virus causing cervical cancer, but no result would have been produced. Moreover, it would have been very difficult for zur Hausen or anyone else to challenge the dominant paradigm (that a herpes virus caused cervical cancer), because anyone who initially advocated such a view would have received a low rating in research assessment and consequently been denied funding. As a result the development of a vaccine which protects against this unpleasant and often fatal disease would have been delayed for several decades, while huge sums of money would have continued to be spent on research. It is worth noting that sales of the vaccine have generated large profits for pharmaceutical companies. So these profits would not have occurred either.

Let us now turn to the general effect of research assessment systems. I remarked earlier that the phenomenon of delayed recognition occurs most frequently in the case of big innovations, significant advances and major breakthroughs. This is explained by the fact that advances of this kind usually contradict some features of the dominant paradigm accepted by most scientists working in the field. Hence we can conclude that the effect of the use of research assessment systems will be to stifle big innovations, significant advances and major breakthroughs in research.

Research assessment systems are very expensive to implement. A great deal of administration is needed to mount such a system, and the administrators need to be paid. In addition, researchers have to devote much time to preparing their submissions for the research assessment system, and to helping in the assessment of the work of their fellow researchers. The time spent on such activities has to be deducted from the time they can spend on the more productive work of getting on with their own research. This causes an indirect increase in the costs of research. Thus research assessment systems are an expensive way of reducing the quality of research output.

It is also worth noting that it is precisely the big innovations which generate the largest profits for the private sector. So a subsidiary consequence would be to reduce profits in the private sector. As most governments are dedicated above all to generating large profits in the private sector, their advocacy of research assessment is a clear instance of shooting themselves in the foot!

It could still be asked whether there are realistic ways of organising research which do not use research assessment systems. The answer is that there are indeed such ways. One suggestion is made in my 2008, Part 3, pp. 63-129.

References

Clarke, B. (2011) Causality in Medicine with particular reference to the viral causation of cancers. PhD thesis. University College London.

Gillies, D. (2008) How Should Research be Organised? College Publications.

Gillies, D. (2011) Economics and Research Assessment Systems, submitted to Economic Thought[2]

Goodheart, C.R. (1973) Summary of informal discussion on general aspects of herpesviruses, Cancer Research, 33(6), p. 1417.

Klein, G. (1973) Summary of Papers Delivered at the Conference on Herpesvirus and Cervical Cancer (Key Biscayne, Florida), Cancer Research, 33(June 1973), pp. 1557-1563.

McIntyre, P. (2005) Finding the viral link: the story of Harald zur Hausen, Cancer World, July-August, pp. 32-37.

[1] This account of Zur Hausen’s work is based on discussions with Brendan Clarke and on his (2011).

[2] I can send anyone interested a copy of this paper, if they contact me on donald.gillies@ucl.ac.uk.

Share this on WhatsApp

10 Commenti

Alberto Baccini 2 Novembre 2011 At 08:07

Gillies illustra gli effetti distorsivi delle procedure di valutazione. La valutazione non è la panacea per tutti i problemi. Nè quando è condotta con indicatori bibliometrici, né quando viene usata la peer review. Questo non significa che non sia utile. Specialmente per il sistema italiano della ricerca. Il problema è mettere in piedi procedure scientificamente corrette, e meccanismi in grado di correggere le possibili distorsioni. Alcune idee sono qui. http://www.nature.com/nature/journal/v465/n7300/full/465845a.html

Entra per lasciare un commento
Marco Antoniotti 3 Novembre 2011 At 08:49

Il problema sollevato (e noto anche senza pensarci troppo) è che la “valuitazione” di cui si parla (RAE, ANVUR, etc) è tutta ex-post e che i finanziamenti centrali principali sono pressochè automaticamente determinati da queste valutazioni. Il nocciolo della questione è qui. Forse che negli USA c’è un ANVUR vel similia? No. I finanziamenti “infrastrutturali” (didattica inclusa) sono vari e direi “a pioggia”, mentre i finanziamenti, ben più robusti, alla ricerca sono “risk-based”. Il “track record” è importante nelle valutazioni dei “progetti”. Possiamo quindi dire – nello spazio e con tutti i limiti di un commento FB – che negli USA i finanziamenti sono “rivolti al futuro”. Con l’ANVUR, i finanziamenti sono rivolti al passato.

Entra per lasciare un commento
fabiosabatini 3 Novembre 2011 At 23:21

Interessante, ma non mi pare molto convincente. Qui Gillies sostiene, con moderazione e senza approfondire, che la valutazione può nel lungo periodo peggiorare la qualità della ricerca. Il pezzo è ampiamente basato su un esempio emblematico di delayed recognition – l’intuizione che il papilloma virus causa il cancro uterino, che all’epoca fu totalmente ignorata dalla letteratura – che tuttavia si presta a interpretazioni diverse. In modo solo apparentemente paradossale, si può sostenere che una soluzione al problema della delayed recognition è valutare la “quantità” – rinunciando entro certi limiti a valutare la “qualità” – della ricerca, anziché non valutare affatto. Rimango convinto che la valutazione sia necessaria, soprattutto nelle scienze sociali dove è molto rilevante il rischio che una quota significativa della comunità accademica abbia una produzione scientifica nulla. Provo a spiegarmi meglio. Se si tentasse di valutare la qualità della ricerca, bisognerebbe necessariamente costruire degli indicatori, che molto probabilmente sarebbero basati su valori bibliometrici (impact factor, reputazione delle riviste su cui si pubblica, eccetera). L’esperienza italiana di questi giorni mostra quanto qualsiasi sistema di questo tipo sia destinato a essere fallace e comunque molto difficile da costruire (basti vedere alcune assurdità contenute nella bozza di decreto ministeriale per regolamentare l’abilitazione scientifica nazionale). In questo caso la valutazione rischia effettivamente di avere effetti perversi. Ma rinunciare alla valutazione avrebbe secondo me conseguenze ancora peggiori. Nella mia disciplina, l’economia, moltissimi ricercatori e professori hanno una produzione scientifica nulla. Zero articoli su rivista, zero working paper, zero presentazioni a conferfenze, zero works in progress. Solo qualche monografia pubblicata da case editrici compiacenti (per lo più collegate all’ateneo di affiliazione). La situazione è ancora peggiore per scienze politiche, sociologia e psicologia. Molto spesso tali ricercatori-a-produzione-zero invocano a loro difesa un argomento molto forte (perché popolare, ragionevole e facilmente condivisibile): non mi pubblicano perché sono fuori dal mainstream, perciò non scrivo, o scrivo poco. Rinunciare a valutare l’attività di ricerca di queste persone è un suicidio per il sistema universitario. Sono certo che si dovrebbe almeno provare a valutare la “quantità” della loro ricerca o, in altre parole, stabilire se tali ricercatori hanno svolto attività di ricerca oppure no. L’unico modo è contare i loro working paper e le loro submission. Non ti hanno pubblicato? Ok, non fa niente, sappiamo che esiste il problema della delayed recognition (o anche della no recognition at all), conosciamo le magagne della tua disciplina e non ti penalizzeremo certo per questo. Ma tu dimostraci che hai pensato, studiato, elaborato, e infine scritto e magari anche provato a sottoporre. Facci vedere i working paper e le submission. Dimostraci che oltre a insegnare – e vendere il tuo libro di testo – hai anche fatto ricerca. Insomma in economia, e più in generale nelle scienze sociali, siamo di fronte a una vera e propria emergenza che forse a voi fisici è sconosciuta: le nostre discipline sono piene di persone che non fanno assolutamente n i e n t e. Niente, nulla, zero. In economia il problema è aggravato dal fatto che tanti nullafacenti per giustificarsi invocano la loro appartenza a qualche eterodossia, e danneggiano in questo modo il pluralismo degli approcci scientifici. Rifletto da tempo su come si potrebbe disegnare un sistema di valutazione della quantità (nel senso di esistenza) della ricerca, e mi rendo conto che i problemi di sviluppo e implementazione sono enormi e ancora non ho idea di come si potrebbero evitare abusi di vario genere. Nel frattempo, secondo me sarà in ogni caso sempre e comunque preferibile l’istituzione di un sistema che provi “almeno” a valutare la qualità, come quello che tra mille difetti ed errori (alcuni scandalosi), il nostro paese sta iniziando a progettare.

Entra per lasciare un commento
- Mario Ricciardi 4 Novembre 2011 At 14:44
  
  Caro Fabio,
  
  sull’utilità della valutazione direi che siamo tutti d’accordo. Ma alcuni di noi sono preoccupati dalle distorsioni che certi modi di applicare alcuni metodi di valutazione possono generare. Prendiamo l’esempio che tu fai. So bene che quello che tu segnali è un problema serio in alcune branche dell’economia e forse anche in altre scienze sociali. Tuttavia, mi pare che ci siano dei controesempi da prendere in considerazione. Ad esempio, quello degli storici, anche dell’economia. Chi è impegnato in un progetto di ricerca storico di lungo periodo lo fa avendo in mente un libro come esito della propria ricerca, e non è detto che trovi utile anticiparne parte nella forma di un articolo da proporre a una rivista. Un sistema di valutazione che ponga un’enfasi eccessiva sulla produzione di articoli entro un arco di tempo breve può disincentivare dall’impegnarsi in ricerche di lungo periodo. Siamo sicuri di voler ottenere questo risultato? La stessa cosa si può dire per altri settori di indagine come la filosofia. Generalizzare a partire da un caso può condurci a non vedere qualche aspetto del problema che sarebbe sbagliato trascurare.
  
  Tra l’altro, c’è una cosa che mi ha sempre incuriosito, sin dai tempi in cui ho letto il libro di Perotti. Dato per assodato che tra gli economisti italiani ce ne sono alcuni che non scrivono nulla, c’è mai stato qualcuno che ha provato a disaggregare questo dato? Per esempio, come sono distribuiti tra diverse discipline economiche? O tra diverse aree geografiche?
- Alberto Baccini 5 Novembre 2011 At 16:52
  
  Sono pronto a scommettere: distribuzione omogenea per settori disciplinari, e quota simile in tutte le università italiane. ALcuni Nuclei di valutazione hanno tentato di contarli. Per esempio qui: http://www.unisi.it/dl2/20100426141302700/sintesiricerca.pdf
- Mario Ricciardi 5 Novembre 2011 At 21:30
  
  Io avrei scommesso su una maggiore diffusione nelle aree meno internazionalizzate della disciplina.
- Renzino l'Europeo 6 Novembre 2011 At 19:31
  
  Concordo, sostanzialmente, con Baccini.
Francesco Guala 4 Novembre 2011 At 09:15

Caro Donald (innanzi tutto ti saluto: come stai?), la mia esperienza del Research Assessment britannico – ho partecipato a due tornate, 2001 e 2008 – è molto diversa dalla tua. Il RAE a mio parere ha avuto effetti molto positivi, specialmente all’inizio, e avrebbe effetti positivi anche in Italia. Il tuo argomento è basato su un’assunzione errata, ovvero che tutta la ricerca scientifica sia difficile da valutare. Questo è vero per il 5 o forse 1% della ricerca di punta, che tu prendi come esempio di “scienza rivoluzionaria”. Al contrario, gran parte della ricerca è piuttosto facile da valutare, in quanto non soddisfa requisiti metodologici o qualitativi fondamentali (come la coerenza, il rigore, l’originalità, o anche solo la chiarezza espositiva). Questi sono requisiti minimi che tutti noi ricercatori applichiamo nel selezionare gli articoli pubblicati sulle riviste soggette a refereeing, come Philosophy of Science o il British Journal for Philosophy of Science, che tu hai diretto per molti anni. La ricerca rivoluzionaria che tu prendi come esempio paradigmatico soddisfa già tutti questi criteri, e dunque ha già passato un processo rigoroso di selezione e valutazione collettiva. Un sistema RAE inizialmente serve solo a questo: a fare in modo che i ricercatori meno attivi o meno intraprendenti si sforzino di soddisfare i criteri minimi per vedere i propri lavori pubblicati su riviste internazionali (e dunque per fare in modo che siano letti). Credo che le tue critiche derivino in parte dal fatto che in Gran Bretagna questo obiettivo è già stato raggiunto da molti anni, e per questo il RAE (o REF) è diventato un tentativo di distinguere fra la “super-super-ricerca” e quella ricerca che invece è soltanto “super” (andatevi a leggere i criteri delle “four stars” nel RAE 2008 per convincervi). Questo, sono d’accordo con te, è un obiettivo sbagliato perché neppure gli scienziati sono in grado di fare previsioni accurate. In Italia e altrove siamo molto lontani da questo tipo di obiettivo: le prime tornate di un RAE servono solo a distinguere chi non è produttivo (zero pubblicazioni) da chi lo è, e chi pubblica su riviste che soddisfano gli standard minimi di cui sopra ( dunque viene letto dai suoi colleghi), da chi non lo fa. Questo può essere fatto, e deve essere fatto, se vogliamo porci fra vent’anni le stesse domande che tu sollevi nel tuo intervento.

Entra per lasciare un commento
Alberto Baccini 5 Novembre 2011 At 16:47

Condivido l’opinione di Guala. UN esercizio di valutazione nell’accademia italiana è utile se riesce a individuare chi non ha un minimo decente di pubblicazioni. Il vecchio CIVR non ha fatto questo. Il VQR se l’ANVUR saprà evitare alcune trappole potrà farlo. C’è un problema non piccolo: non sapremo chi sono! E chi verosimilmente lo saprà (i responsabili delle strutture) non potrà fare molto.

Entra per lasciare un commento
Valutazione, asticelle e scelte 6 Ottobre 2012 At 22:48

[…] filosofo della scienza Donald Gillies osserva, esaminando il ruolo dell’agenzia di valutazione inglese (RAE) negli ultimi venti anni, che il […]

Entra per lasciare un commento

LASCIA UN COMMENTO Cancella la risposta

Questo sito utilizza Akismet per ridurre lo spam. Scopri come vengono elaborati i dati derivati dai commenti.