From Wikipedia, the free encyclopedia

Jump to: navigation, search

SIMILE is a research project focused on developing tools to increase the interoperability of disparate digital collections. As digital library collections proliferate and their contents expand, they come under increasing pressure to provide for interoperability across collections. The benefits for scholars of being able to seamlessly search across collections maintained by the local library, other digital libraries and licensed digital collections like JSTOR and ARTstor are compelling; networked libraries need to be interoperable. Digital libraries are unable to afford to maintain their own, collection-specific content descriptions. In any case, those descriptions do not integrate well with traditional library catalogues. The difficulties of retrieving information from thousands of digital libraries mean that much available information is effectively rendered invisible to individuals searching with mainstream library-based search systems. Researchers urgently need tools which can process a wide variety of types and sources of metadata and expose them to search. Such tools must operate across different communities with different schemes, vocabularies, ontologies and metadata to provide research services to their users. Project SIMILE was started to meet the challenge of developing such tools.


[edit] History

SIMILE stands for Semantic Interoperability of Metadata and Information in unLike Environments. SIMILE is a project run jointly by the The World Wide Web Consortium(, Massachusetts Institute of Technology Libraries and CSAIL. It was born out of DSpace, the open source system digital repository for scholarly materials developed at MIT. DSpace, which is now used at a number of research institutions, archives scholarly publications and makes them accessible. The aim of DSpace is to make it possible to federate the collections of the various holding libraries, avoiding the entombment of the contents of each DL within its individual research community. In order to grow and enable its users to find research material which has been described in various domain-specific ways, DSpace needs the ability to support metadata schemas beyond just Dublin Core. The challenge for DSpace and other digital libraries is that they must operate across different communities, with different schemes, vocabularies, ontologies and metadata, to provide research services to their users. Those schemes, etc., are often mutually unintelligible. Human parsing of the contents of these digital libraries is extremely expensive, in person-hours and cost, particularly as collections grow. The tools developed within SIMILE use Semantic Web technologies to further the goal of improving automated sharing and processing of web resources, thus helping to unlock the contents of digital libraries to the world.

[edit] RDF-based tools

The Semantic Web is based on the Resource Description Framework (RDF). RDF is used to represent metadata about resources on the web, and is intended for situations where information is processed by applications, rather than human beings. Specifically, the SIMILE tools assist in the storage, querying, transformation and mapping of very large collections of RDF data. The tools developed within SIMILE are meant to allow people who are not Semantic Web developers to create ontologies which describe their specialized metadata, create RDF and convert other types of metadata into RDF. These open source tools are designed to be scalable and provide for cross-community sharing of metadata at low cost

[edit] Longwell

Longwell is a faceted browser which enables the user to visualize and browse any RDF data set, allowing the user to quickly build a user-friendly web site out of the RDF data without requiring the user to write any RDF code. Facets are metadata fields considered important for a given data set. In its default configuration, the collection of facets is returned along the right-hand side of the page, clicking on any facet causes the refinement of facets in relation to the data retrieved. Longwell then displays only the subset of the data which meet those restrictions. This appears on the left-hand side of the page. Previously selected restrictions can be removed, which causes a broadening of the subset of items displayed.

[edit] Piggy Bank

Piggy Bank is a Firefox extension which enables the user to collect information from the Web, save it for future use, tag it with keywords, search and browse information collected, retrieve saved information, share collected information and install screen scrapers. Piggy Bank gathers RDF data where it is available, and where it is not available, it generates it from HTML by using screen scrapers. This incremental approach to the realization of the Semantic Web vision allows the user to save and tag information gathered from web pages without having to cut, paste and label the various products of their browsing. By clicking on the keyword she has used to tag particular types of item, the user can view all of those items together within her browser, without having to open other applications. Users can also deposit saved data in the Semantic Bank, where other users can browse it and add their own contributions. This pooling of keywords underlies services such as Flickr and, where communities can collaborate to build a Taxonomy for shared data. These taxonomies, which emerge as information is accumulated, are known as folksonomies.

[edit] Solvent

Solvent is a Firefox extension that enables the user to write screen scrapers for Piggy Bank.

[edit] Gadget

Gadget is an XML inspector which enables the user to condense large amounts of well-formed XML data.

[edit] Welkin

Welkin is a graph-based RDF visualizer. It graphs RDF data sets, allowing the user to visualize the global shape and clustering characteristics of the data, which can aid them in mentally modeling it, seeing how it connects and identifying mappings between the set and possible ontologies. A particular data cluster which stands out when graphed might well be missed when browsed at closer range.

[edit] Fresnel

Fresnel is a vocabulary for specifying how RDF graphs are presented. Fresnel addresses the problem that currently, each RDF browser and visualization tool decides, on an ad hoc basis, what information in an RDF graph is presented and how to present it. Fresnel uses the concepts of lenses and formats. Lenses determine which properties are displayed and how they are ordered. Formats control how resources and properties are presented.

[edit] Timeline

Timeline is similar to Google Maps. It is a tool for visualizing events over time. It can be populated by pointing it at an XML file

[edit] Referee

Referee is a program that crawls the links that point to its user's pages. It extracts metadata from those pages and the text around the links that pointed to its user's pages, converting it, if need be, into RDF format. Referee discriminates between the pages that refer to the user's pages and the comments, meaning the text immediately surrounding the link. It generates a data graph, allowing it to display the fact that, for example, the exact same comment in relation to its user's pages appears on more than one page, which is the container of the comment. A page can have more than one comment, and a comment can appear on more than one page. This can be illustrated in a data graph, but would not be possible with a data tree, such as is generated by the XML data model.

[edit] RDFizer

The RDFizer project is a directory of tools for converting various data formats into RDF. MIT Libraries provides a home for some of these tools. RDFizer's are a group of tools that allows the transformation of existing data into an RDF representation. Given a database of interest, these tools can often, when the data formats are highly structured, convert the data into an RDF representation without human intervention, first determining what ontology to use to express the information. Where semantic relationships are implicit, the RDFizers will not be as successful without human input. The SIMILE project has built RDFizers that convert from the following formats:

  • JPEG Joint Photographic Experts Group (Digital Photo-METADATA).
  • MARC United States Library of Congress MAchine-Readable Cataloging of bibliographic data.
  • MODS Metadata Object Description Schema for bibliographic element sets.
  • OAI-PMH Open Archives Initiative Protocol for Metadata Harvesting.
  • OCW Open Course Ware
  • EMail
  • BibTeX a tool for formatting lists of references usually associated with LaTex documents.
  • Flat
  • Weather
  • Java is an object-oriented applications programming language
  • Javadoc tool for generating API documentation into HTML format from Java source code.
  • Subversion or SVN is a software revision control system.
  • and Random.

[edit] Crowbar

Crowbar is a web scraping environment based on the use of a server-side headless mozilla-based browser. It is used as a research prototype to investigate how to enable the running of Piggy Bank javascript scrapers from the command line and thus automate web site scraping.

[edit] References

[edit] See also

[edit] External links

Personal tools