The Spock Challenge

One Challenging Problem.

One Compelling Prize ($50,000)

At Spock, we love finding and meeting new people. It’s our business.

What is Spock?
Spock, an industry leading people search application, helps users find and discover people on the web. With over one hundred million individuals indexed and millions added every day, Spock is the largest and most comprehensive people specific search application.


Our Technology
At the core, we organize relevant information around people and have developed unique technologies to do so. Not only is this a very fun product for us and our users, we are also fortunate enough to be working on some of the most interesting problems in computer science!


The Challenge
To improve our technology and to create a better user experience, we decided to share the fun! We have selected one of our most interesting problems, namely Entity Resolution, to share with the community, allowing other leading computer scientists and engineers to compete in an open contest. The winners of this global competition will reap a handsome reward, and perhaps even employment at Spock.


You can work individually and in teams. The competition will last 4 months and the winning team will win a Grand Prize of $50,000! Most importantly you’ll be working on a very important and widely applicable problem. We will also be issuing prizes for 2nd and 3rd place.
Click here to learn more about the Entity Resolution and Extraction Problem

The SPOCK Entity Resolution Problem:


A common problem that we face is that there are many people with the same name. Given that, how do we distinguish a document about Michael Jackson the singer from Michael Jackson the football player?
With billions of documents and people on the web, we need to identify and cluster web documents accurately to the people they are related to. Mapping these named entities from documents to the correct person is the essence of the Spock Challenge.
The complete data-set is divided into training and test sets containing roughly 25,000 and 75,000 documents, respectively. Along with a set of documents we've included a set of target names. You can assume that each document contains only one of the target names (even though most documents contain many names). The challenge is to partition all the documents relevant to a target name by their referent. Consider the following two documents with the target name "Michael Jackson":
Michael Jackson - The King of Pop or Wacko Jacko?
Michael Jackson statistics - pro-football-reference.com
The referents of these articles are the pop star and football player, respectively. We've included the ground truth for the training set so you have something to compare against.
Once you're done training, you can run your algorithm on the test set and submit your results on this site. We will provide instant feedback in the form of a percentage rank score (using the F-measure, described below). This way you can see how you stack up against the other teams. What good is a problem without a little competition?

The Process

  • Register for the contest
  • Download the dataset
  • Submit your proposal
  • Get qualified to enter contest (based upon your proposal)
  • Develop your algorithm and software and submit/resubmit results to be scored
  • Check leaderboard
  • Spock selects finalists
  • Prepare your software and data for the final round of testing.
  • Defense and award at Spock HQ

      The key judges for this contest are:

      Cash and glamor aside, you will have the unique opportunity to meet leading Industry Professionals, Academics, and Venture Capitalists, who are at the forefront of leading edge technologies and companies. The SPOCK Challenge is an opportunity which can distinguish you instantly amongst Academics & industry professionals, and jump start your career in the Silicon Valley.
      There is no cost to enter, no purchase of anything is required, and you need not be a registered SPOCK user (although we would love for you to register). So if you know (or want to learn) something about entity extraction, give it a shot. We’ll make it worth your while.

      README

      This README provides some information about the data we've provided and how to submit your results to our evaluation site.
      Be sure to check the website and the discussion board for more up-to-the-minute information.

      Data Files

      The training and test data sets are available in the training_set and test_set subdirectories, respectively.

    1. Within each of these directories, the data-set is further grouped (randomly) into 20 subdirectories. The reason for this is that some filesystems may not appreciate you putting 100k+ files in one directory.
      The files themselves are named according to the following format:
      SCI.(group_id).(id).html
      where group_id is the name of the subdirectory and id is a random integer. All ids are unique within each of the subdirectories, but not across the data-set.
      Each file is an HTML document crawled from the web.

      Submission file format

      In order for us to evaluate your algorithm you must provide a file formatted according to the following specification. Every line contains a list (space separated) of the files belonging to a cluster. Any text occurring after a # is considered a comment.
      Consider the following example:
      # generated using algorithm X
      file1 file5 file8 file9 # cluster name: James Kirk
      file6 file7 file1 file2 # cluster name: James Kirk
      file11 file17 # cluster name: William Riker

      This file says that files 1, 5, 8, and 9 all refer to the same person. Files 6, 7, 1, and 2 refer to someone else (who happens to have the same name). The comments (i.e. "cluster name:...") are only for human consumption. They are ignored by our evaluation measure, which will be described shortly.
      Note that this format does not require your results to be consistent. That is, a document can belong to multiple clusters. In our example, we specified that file1 refers to 2 different people. In general, its possible for a document to refer to multiple people, but in our data-set this is never the case.
      Included in the archive is the file train_groundTruth which gives the ground truth for the training set in the format required by the submission site. The comments in this file specify the name associated with each cluster.
      We've also included the file test_set_names, which lists the names used in the test set.

      Evaluation Measure

      We will use the F-measure as our evaluation measure. The F-measure is a weighted combination of precision and recall. If these terms are unfamiliar to you then take a look at this
      The F-measure includes a parameter alpha which controls the tradeoff between precision and recall. For our purposes, we set alpha=1/3, which favors precision 3 times as much as recall.
      Precision/recall are traditionally used to evaluate classification results. In order to use it to evaluate clustering results, we simply treat every pair of documents as an instance. The 'classification' task is to determine if two documents refer to the same person, i.e. in the same cluster.
      To make this clear, consider the following example:

      Truth:
      1,2,3,4
      5,6
      7,8,9
      Hypothesis:
      1,2,3
      4,5,6,7
      8,9

      In this case, the numbers 1 and 2 are correctly placed in the same group ("True Positive"), 4 and 5 are incorrectly placed in the same group ("False Positive"), 1 and 4 are incorrectly put into separate groups ("False Negative"), and 1 and 5 are correctly put into separate groups ("True Negative"). There are 36 pairs of numbers in total. Counting each case we have:
      True Positives (TP): 5
      False Positives (FP): 5
      False Negatives (FN): 5
      True Negatives (TN): 21
      The formula for precision (P) and recall (R) are:
      P = TP/(TP + FP)
      R = TP/(TP + FN)
      So in our example precision and recall both equal 1/2.
      For your convenience, we've provided a script for calculating the F-measure. It takes, as arguments, a list of file names. The first file specifies the ground truth and the remaining specify hypothesized groupings. Both files are in the format specified above. You can invoke the script as follows (assuming you have Python installed):
      ./evaluate_clusters.py train_groundTruth train_hyp
      where train_hyp are your results for the training set. To change the value of alpha to something other than 1/3, use the --alpha option:
      ./evaluate_clusters.py --alpha=1 train_groundTruth train_hyp


    2. No user avatar
      websig
      Latest page update: made by websig , Apr 19 2007, 11:03 PM EDT (about this update About This Update websig Edited by websig


      view changes

      - complete history)
      Keyword tags: Spock Challenge
      More Info: links to this page
      Top Contributors