As part of the presentation, I discussed some problems that I had in the past with obtaining well-organized news corpora that are both comprehensive and also easily accessible using standard tools. Factiva has an excellent database of articles, exported in a richly annotated XML format but unfortunately Factiva prohibits data mining of the content of its archives.
The librarians in the conference were very helpful in offerring suggestions and acknowledging that providing content for data mining purposes should be one of the goals of any preservation effort.
So, yesterday I received an email from Dorothy Carner informing me about the availability of The New York Times Corpus, a corpus of 1.8 million articles from The New York Times, dating from 1987 until 2007. The details are available from http://corpus.nytimes.com but let me repeat some of the interesting facts here (the emphasis below is mine):
Yes, 1.8 million articles, in richly annotated XML, with summaries, with hierarchically categorized articles, and with verified annotations of people, locations, and organizations! Expect the corpus to be a de facto standard for many text-centric research efforts! Hopefully more organizations are going to follow the example of New York Times and we are going to see such publicly available corpora from other high-quality sources. (I know that Associated Press has an archive of almost 1Tb of text, in computerized form, and hopefully we will see something similar from them as well.)
The New York Times Annotated Corpus is a collection of over 1.8 million articles annotated with rich metadata published by The New York Times between January 1, 1987 and July 19, 2007.
With over 650,000 individually written summaries and 1.5 million manually tagged articles, The New York Times Annotated Corpus has the potential to be a valuable resource for a number of natural language processing research areas, including document summarization, document categorization and automatic content extraction.
The corpus is provided as a collection of XML documents in the News Industry Text Format (NITF). Developed by a consortium of the world’s major news agencies, NITF is an internationally recognized standard for representing the content and structure of news documents. To learn more about NITF please visit the NITF website.
Highlights of The New York Times Annotated Corpus include:
- Over 1.8 million articles written and published between January 1, 1987 and June 19, 2007.
- Over 650,000 article summaries written by the staff of The New York Times Index Department.
- Over 1.5 million articles manually tagged by The New York Times Index Department with a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
- Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at NYTimes.com.
- Java tools for parsing corpus documents from xml into a memory resident object.
To learn more about The New York Times Annotated Corpus please read the PDF Overview.
How can you get the corpus? It is available from LDC, for 300 USD for non-members; members should get this for free.
I am looking forward to receiving the corpus and start playing!