Collaborative filtering

From Wikipedia, the free encyclopedia

Jump to: navigation, search

Collaborative filtering (CF) is the process of filtering for information or patterns using techniques involving collaboration among multiple agents, viewpoints, data sources, etc. Applications of collaborative filtering typically involve very large data sets. Collaborative filtering methods have been applied to many different kinds of data including sensing and monitoring data - such as in mineral exploration, environmental sensing over large areas or multiple sensors; financial data - such as financial service institutions that integrate many financial sources; or in electronic commerce and web 2.0 applications where the focus is on user data, etc. The remainder of this discussion focuses on collaborative filtering for user data, although some of the methods and approaches may apply to the other major applications as well.

The method of making automatic predictions (filtering) about the interests of a user by collecting taste information from many users (collaborating). The underlying assumption of CF approach is that those who agreed in the past tend to agree again in the future. For example, a collaborative filtering or recommendation system for music tastes could make predictions about which music a user should like given a partial list of that user's tastes (likes or dislikes). Note that these predictions are specific to the user, but use information gleaned from many users. This differs from the simpler approach of giving an average (non-specific) score for each item of interest, for example based on its number of votes.

Contents

[edit] Methodology

Collaborative filtering systems usually take two steps:

  1. Look for users who share the same rating patterns with the active user (the user whom the prediction is for).
  2. Use the ratings from those like-minded users found in step 1 to calculate a prediction for the active user

Alternatively, item-based collaborative filtering popularized by Amazon.com (users who bought x also bought y) and first proposed in the context of rating-based collaborative filtering by Vucetic and Obradovic in 2000, proceeds in an item-centric manner:

  1. Build an item-item matrix determining relationships between pairs of items
  2. Using the matrix, and the data on the current user, infer his taste

See, for example, the Slope One item-based collaborative filtering family.

Another form of collaborative filtering can be based on implicit observations of normal user behavior (as opposed to the artificial behavior imposed by a rating task). In these systems you observe what a user has done together with what all users have done (what music they have listened to, what items they have bought) and use that data to predict the user's behavior in the future or to predict how a user might like to behave if only they were given a chance. These predictions then have to be filtered through business logic to determine how these predictions might affect what a business system ought to do. It is, for instance, not useful to offer to sell somebody some music if they already have demonstrated that they own that music.

In the age of information explosion such techniques can prove very useful as the number of items in only one category (such as music, movies, books, news, web pages) have become so large that a single person cannot possibly view them all in order to select relevant ones. Relying on a scoring or rating system which is averaged across all users ignores specific demands of a user, and is particularly poor in tasks where there is large variation in interest, for example in the recommendation of music. Obviously, other methods to combat information explosion exist such as web search, data clustering, and more.

[edit] History

Collaborative filtering stems from the earlier system of information filtering, where relevant information is brought to the attention of the user by observing patterns in previous behaviour and building a user profile. This system was essentially unable to help with exploration of the web and suffered from the cold-start problem that new users had to build up tendencies before the filtering was effective.

The first system to use collaborative filtering was the Information Tapestry project at Xerox PARC [1]. This system allowed users to find documents based on previous comments by other users. There were many problems with this system as it only worked for small groups of people and had to be accessed through word specific queries which largely defeated the purpose of collaborative filtering.

USENET Net news furthered collaborative filtering[dubious ] such that it was available for a mass scale of users while having a simpler method for accessing articles. The system allowed users to rate material based on popularity, which then allowed other users to search for articles based on these ratings.

One of the largest early collaborative filtering services for music recommendations widely available on the World Wide Web was Firefly, which evolved from early MIT Media Lab research projects. Firefly was bought by Microsoft in 1998. The service itself was closed down in 1999 with much of its technology and staff helping to create Microsoft Passport.

[edit] Types

[edit] Active filtering

Active filtering is a method that in recent years has become increasingly popular due to an ever growing base of information available to users of the World Wide Web. With an exponentially growing amount of information being added to the Internet, finding efficient and valuable information is becoming more difficult. In recent years a basic search for information using the World Wide Web turns out thousands of results and a high percentage of this information is not effective and — more often than not — irrelevant as well. There are a large number of databases and search engines in the market today to use for searches but a majority of the population is not familiar with all the options available. This is where active filtering comes into effect.

Active filtering differs from other methods of collaborative filtering because it uses a peer-to-peer approach. This means that it is a system where peers, coworkers, and people with similar interests rate products, reports, and other material objects, also sharing this information over the web for other people to see. It is a system based on the fact that people want to share consumer information with the other peers. The users of active filtering use lists of commonly used links to send the information over the web where others can view it and use the ratings of the products to make their own decisions.

Active collaborative filtering can be useful to many people in many situations. This type of filtering can be extremely important and effective in a situation where a non-guided web search produces thousands of results that are not useful or effective for the person locating the information. In cases where people are not comfortable or knowledgeable about the array of databases that are available to them, active filtering is very useful and effective.

[edit] Advantages

There are many advantages to using or viewing an Active collaborative filtering. One of these advantages is an actual rating given to something of interest by a person who has viewed the topic or product of interest. This produces a reasonable explanation and rank from a reliable source, being the person who has come into contact with the product. Another advantage of Active filtering is the fact that the people want to and ultimately do provide information regarding the matter at hand.

[edit] Disadvantages

There are a few disadvantages of active filtering. One is that the opinion may be biased. Also, as providing feedback requires action by the user, less data may be available than with a passive approach, and user expectations may not be met. In addition, small amounts of highly favorable opinions might result in a bloated evaluation for an item that place it unrealistically amongst other items with long standing evaluations.

One of the main characteristics in Collaborative Filtering, compared to e.g. Content-Based Filtering, is that the method knows nothing about the items' true content or what they are about. This means that they only rely on preference values, such as ratings to generate the recommendations. This means that performance of penetration of each item is highly dependent on other users' ratings and might introduce averaging effects. Averaging effects cause the overall most popular items to be recommended more often which means that they will be consumed and rated more frequently as a result.

Two other issues are generally well known and often associated with collaborative filtering: the First-Rater Problem and Cold-Start Problem. The First-Rater problem is caused by new items in the system that understandably have not yet received any ratings from any users. The system is therefore unable to generate semantic interconnections to these items and therefore are never recommended. Similarly, the Cold-Start Problem is caused by new users in the system which have not submitted any ratings. Without any information about the user the system is not able to guess the user's preferences and generate recommendations until a few items have been rated.

[edit] Passive filtering

A method of collaborative filtering that is thought to have great potential in the future is passive filtering, which collects information implicitly. A web browser is used to record a user’s preferences by following and measuring their actions. These implicit filters are then used to determine what else the user will like and recommend potential items of interest. Implicit filtering relies on the actions of users to determine a value rating for specific content, such as:

  • Purchasing an item
  • Repeatedly using, saving, printing an item
  • Referring or linking to a site
  • Number of times queried

An important feature of passive collaborative filtering is using the time aspect to determine whether a user is scanning a document or fully reading the material. The greatest strength of the system is that it takes away certain variables from the analysis that would normally be present in active filtering. For example, only certain types of people will take the time to rate a site, in passive collaborative filtering anyone accessing the site has automatically given data.

[edit] Item based filtering

Item based filtering is another method of collaborative filtering in which items are rated and used as parameters instead of users. This type of filtering uses the ratings to group various items together in groups so consumers can compare them as well as a rating scale that is available to manufacturers so they can locate where their product stands in the market in a consumer based rating scale.

Through this method of filtering, users or user groups use and test the product and give it a rating that is relevant to the product and the product class in which it falls. These users test many products and with the results, the products are classified based on the information which the rating holds. The products are used and tested by the same user or group in order to get an accurate rating and eliminate some of the error that is possible in the tests that take place under this type of filtering.

[edit] Explicit versus implicit filtering

Within active and passive filtering there are explicit and implicit methods for determining user preferences. Explicit collection of user preferences requires the evaluator to indicate a value for the content on a rating scale. This creates a cognitive aspect to collaborative filtering, but can mean that the feedback received is more accurate. Implicit collection does not involve the direct input of opinion by the user, but instead it is assumed that their opinion is implied by their actions. This reduces variability amongst users and reduces the demand on the user, which can mean that much more data is available. However, this behaviour data does not necessarily accurately represent the user's true opinion of an item.

[edit] Applications

[edit] In commercial systems

Commercial sites that implement collaborative filtering systems include:

[edit] In non-commercial systems

Non-commercial sites that implement collaborative filtering systems include:

Service Type
AmphetaRate RSS articles
Everyone's a Critic movies
GiveALink.org websites
Gnomoradio music (free)
Musicmobs music
Rate Your Music music

[edit] Innovations in Collaborative Filtering

  • New algorithms have been developed for CF as a result of the NetFlix prize.
  • Cross-System Collaborative Filtering where user profiles across multiple recommender systems are combined in a privacy preserving manner.
  • Robust Collaborative Filtering, where recommendation is stable towards efforts of manipulation. This research area is still active and not completely solved.[2]

[edit] See also

[edit] References

  1. ^ Goldberg, David; David Nichols, Brain M. Oki, Douglas Terry (1992). "Using collaborative filtering to weave an information tapestry". Communications of the ACM 35 (12): 61–70. doi:10.1145/138859.138867. ISSN 0001-0782. 
  2. ^ http://portal.acm.org/citation.cfm?id=1297240

[edit] External links

[edit] Software libraries

Below are links to software libraries that allow developers to add collaborative filtering to applications or web sites:

Personal tools