Document Digitisation and Machine Learning
Data acquisition forms the primary step in all empirical research. The availability of data directly impacts the quality and extent of conclusions and insights. In particular, larger and more detailed datasets provide convincing answers even to complex research questions. The main problem is that “large and detailed" usually implies “costly and difficult", especially when the data medium is paper and books. Human operators and manual transcription have been the traditional approach for collecting historical data. Researchers spend vast hours on sorting, organizing and manually transcribing paper documents. This quickly becomes infeasible when the data requirements grow and the corresponding amount of documents reaches an unsurmountable level. Instead of manual transcription we advocate the use of modern machine learning techniques to automate the digitization process. We give an overview of the potential for applying machine digitization for data collection, we show that it performs on-par or better than existing methods for tabular sequence transcription at a fraction of the cost, and finally we discuss the steps in applying machine learning methods on a few cases of actual documents: US and UK mortality data and Danish death certificates. We also briefly comment on the prospects of active learning and present an example of a recently developed digitization application.
Table detection and segmentation
Tables are paramount in quantitive social science, economics, and demography as they provide structured information that can easily be operationalized for statistical analysis. We provide an overview of the challenges of transcribing such tables and suggest novel applications of coherent point drift, auto-encoders and geometric map learning for this purpose. We show that these methods can effectively be applied for automated segmentation of tables from historic documents and be used as a pre-step before feeding into conventional OCR and transcription systems (e.g. Tesseract/micro-task platforms).
Joint work of Emil N. Sørensen, Christian E. Westermann, Christian Møller Dahl.
Combining machine learning, Bayesian inference and historical botanical garden data to unravel plant ageing
How plant ageing manifests itself demographically is still an open question, which is remarkable since it has been a focus of population biologists ever since the first careful plant demographic studies were conducted a century ago. However, current evidence suggests demographic consequences (i.e. consequences for survival and reproduction) of growing older vary among plant species and even among populations of the same species. This is particularly the case for survival. Understanding plant ageing would not only be valuable for population viability analyses of threatened species and population projections for economically important plants but would also allow insight into how actuarial senescence (a gradual increase in mortality with advancing age) has evolved in other organism groups, such as mammals.
Lacking data availability is currently limiting our knowledge of plant ageing, as long-term individual-based monitoring data is needed for detailed analyses. However, there exist large quantities of such data that have never been used for similar analyses, in the form of hand-written records from botanical gardens. The aim of the proposed project is to develop machine learning methods to extract this data and then use it to determine how plants age. From the records, we will extract information on time of birth (planting), time of death (or removal), cause of death/removal (actual death and its probable cause, or removal of plants due to e.g. restructuring of the garden, heavy disease burden, etc.) as well as geographical location / climate of the area where the species occur in the wild. We will subsequently use and further develop Bayesian survival trajectory analyses, developed by FC, to determine how age drives plant survival. Finally, we will investigate general patterns and variation in survival trajectories and assess how phylogenetic relatedness, similarity in life history strategy, and environmental conditions may affect differences in rates of ageing among species.