Google Search Appliance

The University of Chicago uses a Google Search Appliance (GSA) as the main search tool for public web-based content. The GSA continually indexes new documents as they are posted to the University of Chicago websites, and guides users to relevant content using customized search results. The GSA uses the same technology as google.com: it’s a locally run instance of Google focused exclusively on the University of Chicago.

The GSA is managed by the Web Services and Web Administration groups within IT Services. If you have a question or issue with implementing a GSA-powered search form, please contact us at search@lists.uchicago.edu.

How does the GSA work?

GSA-powered searches differ from a public Google search in several important ways. The GSA is exclusively focused on University of Chicago content, and we are able to control the focus and timing of the content crawl (aka indexing). We can customize search results through “keymatches” (returning top matches for a word or phrase at the top of the results page) and define “collections” of websites to better focus searches.

Keymatches: If you have suggestion for a word or phrase that should return a certain site at the top of the results page, contact us to request a keymatch.

Collections: if you would like to search multiple, specific sites from a single search form, contact us to request a collection.

Indexing and crawling

The UChicago GSA is set on a “continuous crawl” of UChicago web content — once it completes one round of indexing it immediately starts another. The GSA crawls and indexes content on the following domains:

uchicago.edu
chicagogsb.edu
chicagobooth.edu
uchospitals.edu
uchicagokidshospital.org

The GSA uses the following as a starting point:

www.uchicago.edu

The appliance crawls by following links, and will only index content if it is linked from another indexed page. It follows HTML links in PDF files, Word documents, and Flash content. The search appliance crawler does not follow HTML links embedded in Javascript code, and it cannot submit HTML forms.

We have defined a list of exclusion rules that prevent the GSA from crawling certain sites and types of content, both to prevent high server traffic and to stay within the document-indexing limit defined in our license agreement. The following types of content are not included in the UChicago search collection:

images and media database files archive files binaries and executables Apache directory listings sites requiring any type of authentication dynamic calendars that can result in a high number of document counts (unique URLs) directory database listings resource reservation systems other dynamic sites that may provide a high number of document counts

If we find that your site is contributing to a high document count, we will work with you to resolve the issue.

Google Search Resources

Google Search Appliance email list

If you are using the UChicago GSA for the search form on your site, please subscribe to the Google Appliance email list. This will allow us to contact you with updates and announcements related to the appliance.