Bringing data to a new form : Internship within the Institute of Research and History of Text

One of the largest gap that which must be closed in order to truly go from humanities to digital humanities is the transformation of countless sources of written data into organized digital data. But the transition from simply having digital data on hand and possessing exploitable, well structured and accessible digital data is far from an easy task. One I have come to understand much better during the internship I undertook under the Institut de la Recherche et d’Histoire des Textes or IRHT (Institute of Research and History of Text) and the Société des Historiens Médiévistes de l’Enseignement Supérieur Public or SHMESP ( Society of Medieval Historians of Public Higher Education)  in Paris, under the supervision of Dominique Stutzmann.

My internship was divided in two parts, with a specific task assigned to each one. The first one, was the transfer of the bibliographic database of the SHMESP from one open web archive, Biblio.shmesp.fr, to another, HAL. The transfer to HAL required a lot of work beforehand: I got semi-structured data in Excel, which had to be majorly cleaned up, from duplicates to many incomplete entries, using sources like the SUDOC catalogue. All of that data was structured further via the use of software for data manipulation like Open Refine and Zotero, then transformed from format to format depending on what softwae was used, from Unicode to Bibtex. In the end, around fifteen thousands bibliographic records were uploaded on the open archive HAL. The second part, which I had to devote the least of time to, was a work on the results provided by the analyzing tool developped by the SHMESP, supposed to recognize and catalogue thousands of elements inside Books of Hours, from initials to miniatures: I had to verify the results and correct them if necessary.

In the end, this internship was a precious experience. Not only did I expanded my knowledge about open archives, medieval manuscripts, but I also learned a lot about the transfer from one database to another and the challenges it poses. Furthemore I was forced to take a higher perspective on works with massive amounts of data and their time constraints, and how to use the tools given to work not only quicker, but better. Which is one of the goal of the digital support for the field of humanities.