The objective of this project is to provide widespread access to the only index to Soviet journals published during the years 1956-1975. These indexes contain invaluable information available from no other information source during this time period. The entries are poorly arranged, making them so difficult to use that they are virtually unused. Furthermore, they are printed on highly acidic paper and in danger of disintegration. By digitizing these resources and offering them on the World Wide Web, we will accomplish three goals: preserve the information; make it available to users worldwide; and provide keyword searching, which overcomes the lack of indexing. The project encompasses three authorized activities under Title VI, Section 606, numbers (1), (3), and (5):
- to facilitate access to or preserve foreign information resource in print or electronic forms;
- to develop new means of shared electronic access to international data; and
- to develop methods for the wide dissemination of resource written in non-Roman language alphabets.
Facilitate Access or Preserve Foreign Information Resources
This project will preserve and provide improved access to a major foreign information source. Our first consideration is preservation. The paper stock used for this journal is extremely poor. Although the volumes from 1926 through 1955 have been reprinted by Kraus on quality paper, the originals (and the section of the journal we propose digitizing) are, by contrast, generally deteriorating. These volumes cannot sustain normal use without suffering damage and potential loss of the information they contain. Volumes from 1957, for example, are so brittle that the pages break when used or copied. Some volumes as late as the early 1970s are in the same condition. Even today, Letopis' continues to be printed on paper of mixed quality, calling into question the long-term preservation of the collection.
Examination of the original volumes published from 1957 through 1974 shows
that the paper in these volumes is extremely brittle. Pages are discolored from acid deterioration, with darker staining on all three edges. The pages in the more recent years are too brittle to sustain any treatment, and the pages in the earlier years are so fragile that they break at a single fold
At Indiana University the issues were library-bound using oversewing to consolidate the text block. Oversewing, which was the standard method at the time, reduces the inner margin of the text and perforates the pages at 1/4" intervals. As oversewn volumes deteriorate, the fragile paper detaches along the perforation lines. Oversewn volumes are also very difficult to use without causing damage, since the tight sewing resists opening or photocopying. The disadvantages of oversewing are in this case compounded by the 3" and greater width of many of the volumes, which causes even more difficulty in handling and copying.
Our second consideration is improved access to the information. Students and scholars need access to the indexes and to the entries, and yet access is extremely limited. Few universities in the United States have backfiles or current subscriptions to this journal. It is a reference work, so students and scholars do not have access via interlibrary loan; the volumes themselves cannot be loaned and photocopying is not a viable option for an index.
Access to the information recorded in the journal is also limited due to its poor arrangement. Refined subject searches are almost impossible principally because the Letopis' lacks a cumulative index. Over the years, it has included periodic indexes of various quality appearing at different intervals (bi-monthly, quarterly, annual), although there are years for which no index appears to exist at all. For those years, users would have to search each weekly publication. For the period we will digitize for this project, there is a quarterly name and geographic index; it begins at a length of about 250 pages, increasing to over 500 pages per index by the mid-1970s. There are no title indices and the subject categories are prohibitive ("so broad as to prove almost useless," notes the bibliographer for Slavic and East European Studies at the University of Chicago). Digital access to the journal will overcome the barriers to use created by the lack of indices; users will have access by any keyword in
New Means of Shared Electronic Access to International Data
The World Wide Web has proven to be an effective and reliable distribution medium for electronic publications. For this project, the Indiana University Digital Library Program will create a digital version of the most significant Russian periodical index published during the years 1956-1975, Letopis' Zhurnal'nykh Statei. Users from all parts of the world will have access, free-of-charge, to this unique resource. Digital technology and the World Wide Web allow us to provide vastly improved access to a relatively inaccessible information resource, including access to the index itself and access to the bibliographic data contain in the index, through keyword searching.
Wide Dissemination of Resources Written in non-Roman Alphabets
One of the most important aspects of the project will be to offer keyword searching of Cyrillic text. This capability is not widely available. Few scholarly texts have been OCR-processed and made available to students and scholars in the United States. We have been experimenting with optical character recognition software developed by a Russian company, ABBYY/BitSoft http://www.abbyy.ru/. The software is called Fine Reader 4.0. We have contacted the company and have explored a number of ways in which we might collaborate on software development. The project will have a formal evaluation component, which will provide useful information to ABBYY. Working together, we should be able to improve Cyrillic OCR software for searching bibliographic data. Projects to date have focused on narrative text.
Need for the Project
In 1920 Lenin laid the foundation for the creation of the Russian national bibliography. This was to be a registration and indexing of all printed matter produced in the country. In time it proved to be the most thorough national bibliography ever produced. Part of this plan was the registration and indexing of serial articles, called Letopis' zhurnal'nykh statei.
Started in 1926, this subject arrangement of journal articles included the analyses of 200 serial titles. By 1966 this had grown to the classified listing of all articles in almost 2,000 journals. Preference had been given to academic journals, including the almost endless university and Academy publications know variously as Uchenye zapiski, sborniki, trudy and izvestiia. Included were also popular journals, general mass media publications, women's serials, and many regional studies.
The massiveness of this endeavor resulted in inadequate indexing. To date this is the only real index to Soviet journal publications, and as such has tremendous research value. The difficulty of using this work has daunted many students and scholars. Even today a common reference question of graduate students is that they wish to know what journals they should use to find articles on their particular interest, whether it be gold mining in the Urals, Buriat studies, or early manufacturing in Siberia. The answer, of course, is that such information could be in any of hundreds of publications.
The tool for this research is namely the Letopis' zhurnal'nykh statei. At best there are four indexes per year. Some years have no subject indexing. Researchers become lost in this mass of almost uncontrolled information.
Slavic librarians who work in large research libraries will welcome access to this information. Suggestions have been made to microfilm this register of journal articles, but in reality, microformed copies may help preserve this crumbling resource, but would make it impossible to use. The articles are listed in subject order in this weekly publication. They are numbered consecutively from 1 on for each year. Consequently a subject-or author-indexed word my have several numbers attached to it, such as 87, 341, 4,687 and 11,212. Each of these item numbers must be located to find the actual citation being referred to.
This is a valuable reference source, physically crumbling before us, and very difficult to use in its present form. An electronic, searchable database of this information would be used by every scholar, in any field, needing Russian source materials.
We selected Letopis' Zhurnal'nykh Statei for digitization based upon the unanimous advice of Slavic librarians who were queried via a listserv for Slavic specialists. One librarian wrote that this resource will be particularly valuable during the years we propose digitizing because "there [are] no remotely comprehensive bibliographies that can compare to it." Our letters of support further attest to the significance of providing access to Letopis' Zhurnal'nykh Statei. One supporter notes, "Senior Slavic scholars as well as students tend to ignore this resource� Making the resource available over the Internet, with the enhancements that electronic formats offer, will ensure that this resource will be heavily used." We believe that the potential audience is enormous. Our goal and desire is to provide Letopis' Zhurnal'nykh Statei to students and scholars at home and abroad. As another supporter notes, "The Web site would provide access to a vast hidden treasure of information,
not only for those at institutions with the largest Russian and Soviet collections, but for the many hundreds of researchers with medium-to-small collections and no holdings at all of Letopis' Zhurnal'nykh Statei. In fact, I would predict that this database would also be heavily used by hundreds of libraries throughout Eastern Europe, Russia, and the countries of the former Soviet Union."
Project Design and Personnel
The project will have four major phases:
- Image Scanning
- Optical Character Recognition (OCR) Processing
- Electronic Text Encoding and Editing
- Evaluation of the Interface and the OCR Software
Indiana University's Digital Library Program has experience in all aspects of the project. Below, we have described each phase of the project and the personnel who will be responsible for overseeing and accomplishing this work.
The overall project will be co-directed by existing staff, Kristine Brancolini and Perry Willett. We will hire a project manager for the duration of the grant. The project manager will be a librarian who is fluent in Russian. Other major participants include: Murlin Croucher, Slavic Bibliographer; John Walsh, Electronic Text Specialist; and Lorraine Olley, Head of Preservation.
• Image Scanning
The image scanning will be handled by our Preservation Department brittle books operation. We will need to purchase additional equipment and hire hourly staff, but this phase of the project will follow the procedures already in place for other brittle books. The books will be dismantled for scanning, then the originals discarded. This will enable us to scan each page more accurately and quickly. Each page will be captured as a TIFF file. As a byproduct of scanning we will print from the digital images a facsimile copy of the work on permanent paper with a wide binding margin. The hardcover-bound paper facsimiles will replace the deteriorated originals in the stacks. The image files will also be retained, so that other libraries and individuals may order paper copies of each volume to add to their collections or to fill in gaps in their paper holdings.
Lorraine Olley, Head of Preservation, will oversee this phase of the project. Since 1997, the Preservation Department has created digital copies of a wide variety of materials, including brittle books, photo albums from Indiana University's Kinsey Institute, and manuscript pages from the Hoagy Carmichael collection at the Indiana University Archives of Traditional Music. The digital copies of the brittle books are used to create facsimiles of the original works, which are printed on archival paper and bound for the circulating collections. The digital files, which are in TIFF format, are stored on magneto-optical (MO) disks.
For the Letopis' Project, the Preservation Department will be using three Fujitsu Scanpartner 600C bi-tonal scanners. This scanner is capable of capturing images at 600 dpi, which is the archival standard. It is a high-speed scanner that comes with a sheet feeder, but can also be used as a flatbed scanner. We do not believe that we will be able to use the sheet feeder, due to the condition of the paper, but we will experiment with the sheet feeder to test this hypothesis. As with other brittle books, the resulting TIFF images will be stored on magneto-optical disks, then transferred to our Library Electronic Text Resource Service (LETRS) for optical character recognition processing, SGML-encoding, and editing.
• Optical Character Recognition (OCR) Processing
After the pages have been scanned and converted to TIFF images, the files will be batch-processed using OCR software from ABBYY Bitsoft. The OCR software that will be used for this project, Fine Reader, has been tested by the staff of Murlin Croucher, Slavic Bibliographer, and found to be approximately 99.8 percent accurate. This means that each page will have approximately 6-10 errors. These errors will be corrected during the text-editing phase of the project. The most common error found in our tests was in mistaking imperfections in the paper for punctuation such as commas, periods, and asterisks. This type of error is much easier to find and correct than actual mistaken characters, which we found to occur at a very infrequent rate, of about 1 per page. Also, given the highly structured nature of the indexes, with each entry containing generally the same kind of information, automated error correction and encoding will be performed.
The Digital Library Project team has a great deal of experience with OCR software. With the Victorian Women Writers Project, the library has processed over 25,000 pages and worked with a number of different OCR software packages. The Digital Library Program is extremely familiar with the problems encountered during scanning and optical character recognition, and is prepared to work the ABBYY/Bitsoft to find solutions that will provide the highest accuracy.
John Walsh, Electronic Text Specialist, will oversee this phase of the project.
• Electronic Text Encoding and Editing
The resulting text file, with some encoding automatically provided, will be edited for errors, both in the bibliographic entries themselves and in the encoding. The files will be encoded following the Text Encoding Initiative (TEI) Guidelines, which use the rules for encoding languages established by the Standard Generalized Markup Language (SGML) guidelines. Student employees, knowledgeable in Russian, will be trained in the TEI Guidelines. Five students, working 20 hours per week, proofreading at a rate of 30 pages an hour, will be able to complete editing the entire range of 1956-1975. The Project Manager will oversee this portion of the project, with assistance from the Electronic Text Specialist.
Encoded text files will be stored, indexed, and made accessible on an existing IU Web server using OpenText software. Results will be displayed in HTML, translated using a series of SGML-to-HTML Perl scripts already in use at Indiana University. We have a great deal of experience in indexing SGML-encoded text using OpenText software, as we have a large number of full-text collections available for searching.
We have worked extensively with highly structured documents, such as the Letopis', in our work with finding aids encoded according the Encoded Archival Description (EAD) format. We have indexed several of our EAD encoded finding aids using OpenText software, and these finding aids are searchable over the WWW much like we propose for Letopis'.
The Project Manager, working with John Walsh, will oversee this phase of the project, as it requires staff who are fluent in Russian.
• Evaluation of the Interface and the OCR Software
The project will include two evaluation components. The first evaluation will test the usability of the interface and the search engine. We will test the database with actual users from several research institutions in the United States.
Murlin Croucher, Slavic Bibliographer, will oversee the interface evaluation.
The second evaluation will test the accuracy of the OCR software. We will work with the R&D unit of ABBYY to develop a plan for testing Fine Reader during our project.
John Walsh will oversee the evaluation of the OCR software.
In order to estimate the length of time it will take to edit the text that has been OCR-scanned, Perry Willett and a staff member from Murlin Croucher's office conducted an evaluation of Fine Print, 4.0. Using a sample Russian text with Cyrillic characters, Fine Reader 4.0 recognized at an accuracy rate of 99.8%. As noted previously in this proposal, almost all of the errors involved misreading paper imperfections or discolorations as punctuation, inserting apostrophes, commas, periods or asterisks where none should be. This type of error will not affect full-text searching, since punctuation would be ignored. In looking only at incorrectly recognized characters, the accuracy rate approached an astonishing 99.95%. This rate is equivalent to that of a typing agency using double keying, and will allow for high quality output even before editing.
As we began planning for this project, questions arose about the copyright status of Letopis': Is the work covered by copyright? Is it covered by copyright for all years of its publication or are some volumes in the public domain? In order to answer these questions, we spoke with a local copyright expert, Fred Cate, Professor in the Indiana University School of Law, and an expert on Soviet/Russian copyright law, Michael Newcity, Professor in the Duke University School of Law. Independently, both concluded that this work is in the public domain for all years of its publication. This conclusion was based upon copyright law in both the Russian Federation and in the United States, as there are questions concerning which copyright law will apply in this case. However, in either case the conclusion is the same: Legally, we may digitize any and all volumes of Letopis' and offer the work to users via the World Wide Web.
Indiana University Expertise
For the Letopis' Project, we will draw upon Indiana University's wealth of resources in the area of Russian studies and in the creation of electronic text. Our academic programs offer both language and area studies. Consequently, Indiana University attracts undergraduate students, graduate students, and faculty with exceptional Russian language skills. Our department of history has a particularly strong concentration of Russian specialists, the largest group outside of Russia itself. The School of Library and Information Science offers a dual master's degree with the Russian and East European Institute, which contributes to IU's prominence in the training of Slavic librarians. In 1998/99, a Slavic Studies Fellow is training for a career in area librarianship at the Indiana University Libraries, with funding from the Andrew W. Mellon Foundation.
• Library Collections and Expertise
Indiana University's Slavic collection, which is composed of more than 550,000 volumes and 1,600 serial subscriptions, is one of the finest of its kind. Soviet and Russian Studies account for about half of the collection; the area traditionally called Eastern Europe accounts for the other half. Of particular importance is the Estonian collection, considered by many to be unsurpassed in the country.
While its emphasis is in the social sciences and humanities, the Slavic collection is known for its breadth. The Lilly Library (for rare books and manuscripts) boasts a strong collection in first editions of Czech writers from between the two wars, the rare Allen collection of the Caucasus, many early printings of Russian Bibles, and an impressive Amfiteatrov manuscript collection. The science libraries on the Bloomington campus all have Russian holdings. The Geography and Map Library, and libraries that serve the schools of Education, Music, Fine Arts, and Business have maintained fine research collections in the Slavic area as well.
Among our strengths: The Russian/Soviet collection has exceptional holdings for the periods of the 1917 Revolution with rare microfilms and books concerning all aspects of this period. Collections are exceptionally strong for the Bohemian lands, the Hapsburg Empire, Poland, Slovenia and Yugoslavia. For years we have maintained Romanian, Bulgarian and Albanian collections mainly through large exchange programs. New emphasis is being placed on Romanian studies, and this East European collection is rated as being one of the best three or four such collections in the country. The retrospective collections of journals are often exceptional, with many of the Academy publications being complete back to the 1870s, reflecting the beginnings of Slavic self-recognition.
Slavic Bibliographer Murlin Croucher is a nationally recognized expert in the field of Slavic bibliography. He compiled and edited Slavic Studies: a Guide to Bibliographies, Encyclopedias, and Handbooks (Wilmington, DE: Scholarly Resources, 1993.), which is currently in revision. He has published translations, co-authored a book on Czech book history, presented 18 papers at national conferences, and continues to make numerous book trips to Eastern Europe and Russia. Mr. Croucher is currently working with the Slavic Studies Fellow, the first of two area studies fellows funded by the Andrew W. Mellon Foundation.
• Russian and East European Institute
The Indiana University Russian and East European Institute (REEI) administers one of the country's largest programs in Russian and East European Studies. East European language training began at IU during World War II, and REEI was founded in 1958. Since its earliest years, REEI's primary objective has been to offer a broad interdisciplinary curriculum in advanced language and area training and to assist faculty and students in research and the publication of scholarly works. Today the institute supports over 50 faculty and 400 students of the Bloomington campus. REEI has long been a federally funded Natural Resource Center, first under the National Defense Education Act and later under Title VI of the Higher Education Act.
• Indiana University Digital Library Program
The Indiana Digital Library Program has experience in mounting large digital collections on the WWW and providing sustained support for network access to these collections. Perry Willett is General Editor of the Victorian Women Writers Project, a collection of SGML-encoded texts created and mounted on the WWW at Indiana University http://www.indiana.edu/~letrs/vwwp/index.html. The collection totals 150 volumes, with new volumes adding to the total continuously. In October 1997 the National Endowment of the Humanities named IU's Victorian Women Writer's Project one of the top 20 humanities sites on the World Wide Web. John Walsh, Electronic Text Specialist, has many years of experience working with OCR software, preparing large SGML collections for delivery over the Internet, converting electronic texts from other formats to SGML, and creating Web-based interfaces to these collections, using a combination of HTML/SGML/XML, Perl, and Java
programming. Kristine Brancolini, who will co-direct this
project, currently is co-directing Digitizing and Preserving the Hoagy Carmichael Collections at Indiana University, a project funded by a National Leadership Grant from the Institute of Museum and Library Services (IMLS).
The plan for digital conversion has two components: the image-scanning and the OCR processing. To prepare for digitization, staff from the Preservation Department dismantle the volumes and discard the bindings. Then, trained hourly staff will image-scan the pages, using recommended practices for brittle books scanning. The images will be stored as TIFF files. Once they have been digitized, the volumes will be reprinted on acid-free paper, bound, recataloged, and returned to the stacks.
Following image-scanning, staff from our Library Electronic Text Resource Service (LETRS) will convert the TIFF files to text files using Fine Print 4.0 OCR software. The resulting text files will then be edited, SGML-encoded and proofread by graduate student consultants, trained in SGML and the TEI Guidelines.
Letopis' Zhurnal'nykh Statei will be a keyword searchable database delivered via the World Wide Web. Users will be able to locate information from the bibliographic entries and the indexes: they will be able to search for any word that appears in the entries, or search by author, title, journal title, subject and/or geographic region. All of this information will be encoded to allow for differentiation among the elements of the bibliographic entry.
Access to networked information resources is supported at Indiana University by the data center services of University Information Technology Services (UITS). The UITS data center operates 24 hours a day, seven days a week to provide students, faculty, staff, and the international scholarly community with continuous access to university information resources. Through a regular program of upgrade and replacement, UITS maintains currency in hardware, software, and storage media. Information resources of long-term or permanent value are kept current as part of this upgrade and replacement cycle. UITS and the Indiana University Archives have participated in the research program of the National Historic Publications and Records Commission, studying means and methods of preserving digital content. To date, and in practice, our most reliable means of preservation has been a combination of routine content copying (to preserve against media decay), and periodic content conversion, as part of the equipment or application upgrade cycle (to preserve against hardware or software obsolescence). These practices, and the institution's commitment to maintaining the scholarly record, help assure preservation and continued access to networked information resources.
Scholars and researchers have long realized that periodicals, whether published by academies and research institutions or by chroniclers of the popular culture, provide a source of information unavailable elsewhere. For scholars of Russian culture and history, there exists only one meaningful index to this vast trove of information: Letopis Zhurnal'nykh Statei. This work is both grand and ungainly. Covering more than 1,700 journals, series and continuing publications, it provides in one serial publication information in the humanities, sciences, and social sciences. However, it is also difficult, if not impossible, to use. Insufficient availability, lack of subject indexes, and brittle pages conspire to prevent researchers from using this definitive reference tool. The best method to overcome these hindrances-and to serve students and scholars around the world-is to digitize Letopis', increase its value by allowing researchers to search it using keywords, and make it available via
the World Wide Web.
No institution is better suited to perform this undertaking than Indiana University. Our exceptional programs in international studies and languages, our proven ability to create and administer digital initiatives, and our expertise in supporting Slavic research collections enable us to provide a highly desirable new research tool to scholars throughout the world.