Charles W. Cushman Photograph Collection
Indiana University Archives / Digital Library Program
The Charles W. Cushman Collection web site is hosted on IBM eServer pSeries server hardware running version 5.1 of the IBM AIX operating system. The web site was developed in Java, using Java Servlet technology, Java Server Pages, and the Struts Java web application framework.
Metadata for the collection is stored in tables in a relational database under Oracle9i Release 2, and web-deliverable images are stored as JPEG files on the web server's filesystem. We use the Oracle Text search engine for indexing and searching the metadata, and we take advantage of Oracle Text's built-in thesaurus capabilities to provide enhanced searching and browsing that makes use of the subject relationships inherent in the Library of Congress Thesaurus for Graphic Materials - Subject Terms (TGM I). More detail on this aspect of the implementation is provided below.
Metadata entry, including transcription of the original notebook and slide information and entry of subject headings, was accomplished through the use of a custom-developed Microsoft Access database.
In the near future, we plan to implement a mapping from the Cushman Collection database schema to Dublin Core, and provide access to the collection's metadata through an OAI-PMH (Open Archives Initiative Protocol for Metadata Harvesting) data provider.
We also hope to eventually take the browse and search interface developed for the Cushman Collection and make it more generally applicable to other image collections using standard metadata schema, such as MODS (the Metadata Object Description Schema).
Thesaurus-Enhanced Searching and Browsing: Basic Explanation
As a result, further exploration of the Cushman Collection is facilitated by browse and search suggestions now provided. For every search term entered in a user's query that maps to the TGM I, suggestions for broadening or narrowing the search will be presented. The user can select to replace his or her original subject term(s) with any of the suggestions provided.
The Cushman collection also provides faceted browsing capabilities. Users can browse by any of the facets subject, date, location, or genre, and then add more facets to refine results as they explore. When a subject is browsed, narrower subject terms are available, just like in a subject search. When a location is browsed, the user can navigate hierarchically down from the country to the city level. Conversely, facets can be removed thereby broadening the result set.
As in the original release of the web site, searching the Cushman collection is an integrated process, which references both the descriptive fields in our database and the TGM I thesaurus so that a user's query is potentially matched in either or both places. If a search term matches in the thesaurus, we automatically search all the synonyms of that term. For example, if a user types "cars," we also retrieve images described with the subject term "automobiles," because that's the term we used in our cataloging. We also retrieve images described with all narrower terms related to the term entered. For example, if a user queries "sports," we will return images that have been described with the term sports as well as images of specific sports - including bowling, football, rowing, and soccer.
Thesaurus-Enhanced Searching and Browsing: Technical Details
These basic text searching capabilities are helpful on their own, but it is the thesaurus support in Oracle Text that makes possible the search and browse features described earlier. This support allows for the definition of a custom thesaurus and provides special syntax in SQL queries to extend the power of a search. For example, query results can be extended to include both matches for the term entered by the user and matches for the narrower terms related to the user's term.
The thesaurus can also be consulted in a stand-alone fashion using Oracle Text stored procedures, which can be invoked at any time from within an application. This capability provides the support for mapping terms to our controlled vocabulary, as Oracle's rules for structuring a thesaurus allowed us to create mappings between potential lead-in terms and their preferred terms. These relationships, and the stored procedure functions provided to look them up, allow the search and browse interface to take terms that were entered and immediately decide if the search should also be performed using a different word (i.e., a controlled term that was used when cataloging the collection). Similar stored procedure functions make it possible, in the March release, to take a user's search term(s), consult the thesaurus, and create lists of narrower and broader terms from which to navigate.
Web Application Development: Technical Details
The Struts project, sponsored by the Apache Software Foundation, is a server-side Java implementation of the Model-View-Controller (MVC) design pattern. It provides an open-source framework for creating Java-based Web applications that easily separate the presentation layer and allow it to be abstracted from the transaction/data layers. By utilizing Struts in our project, developers and interface designers were able to work effectively in parallel. Struts also helped simplify the development of the Servlets in the application by providing a set of useful classes and interfaces in an abstract programming structure.
To accommodate the needs of offering a user friendly search scheme while using Oracle's advanced text index functionalities, we developed a query parser that translates users' search input into Oracle-specific SQL syntax, to query the database with JDBC. It can accept users' input with basic logic, namely, Boolean operators (AND, OR, NOT), exact phrase by using quotation marks, grouping with parenthesis, and stemming operators (wildcards). It also supports advanced field operators so users can specify to search in certain predefined fields to control their results.
If you have any questions about the technical implementation of the Cushman site, please don't hesitate to contact us.
Last updated: Tuesday, June 19, 2007 04:30:18