-
Notifications
You must be signed in to change notification settings - Fork 4
DeepBank
This page describes the DeepBank project. For more details about the beta-release of DeepBank (v0.9), please read here.
Contents
The DeepBank project has the goal of annotating the one million words of 1989 Wall Street Journal text (the same set of sentences annotated in the original Penn Treebank project) with the English Resource Grammar, augmented with a robust approximating PCFG for complete coverage. DeepBank contains rich linguistic annotation on both syntactic and semantic structures of the sentences and is available in a variety of representation formats (see the description on formats below).
The project is hosted at the Department of Computational Linguistics of Saarland University and the Language Technology Lab of the German Research Center for Artificial Intelligence in Saarbrücken, Germany, and in close collaboration with CSLI Stanford. Other institutes, including (but not limited to) Humboldt University of Berlin and University of Oslo have also contributed to the development and release of the resource. In the long term, the DeepBank will be further supported by the DELPH-IN community with updates and maintenance.
The project is technically built on top of resources developed in the long-term grammar and software engineering effort maintained under the collaborative umbrella of DELPH-IN. Following earlier practice in the development of Redwoods treebanks, manual annotations are done using the discrimant-based treebanking environment provided by [incr tsdb()] to identify the correct full analysis among the candidate analyses proposed by the English Resource Grammar.
For the first public release of DeepBank, most of the data has gone through at least two rounds of human annotation with independent annotators. Also, the linguistic analyses in DeepBank were made independently from the previous treebank annotations of the same data (i.e. PTB), distinguishing it from PTB-derived treebanks including the Enju HPSG treebank, CCGBank, and the CoNLL syntactic dependency bank, to name a few.
For completeness of the annotations over the full corpus, the public release of DeepBank also includes analyses (trees) licensed by an approximating PCFG for the sentences of the WSJ corpus not correctly analysed by the current version of the ERG. Semantic structures are also composed robustly for these sentences, which comprise some 15% of the 50,000-sentence total.
The development of DeepBank started in the fall of 2008 as an internally funded project at the Department of Computational Linguistics, Saarland University and the LT-Lab of DFKI, under the supervision of Valia Kordoni and Yi Zhang. Thanks to the partial financial support of the Erasmus Mundus European Masters Program in Language and Communication Technologies (LCT), part-time student annotators were employed and trained for the first round of annotation. Dan Flickinger, the main ERG developer, has provided grammar updates throughout the project. He also went through a thorough (second) round of annotation updates to arrive at the first public release of DeepBank. Both the ERG and DeepBank have significantly evolved over the years of the project, but the dynamic nature of the annotation method has kept them synchronized through the update cycles.
By the summer of 2012, the development of DeepBank reached a mature stage where a significant amount of the data has gone through two rounds of careful annotation. The resource was made available for internal DELPH-IN review (alpha release) by several sites, including the University of Oslo, University of Washington, Melbourne University, University of Barcelona, Bulgarian Academy of Science, University of Lisbon, etc. Many suggestions and detailed feedback helped us prepare for the first full public release of DeepBank.
At the end of November 2012, a substantial portion of DeepBank (WSJ sections 00-15) was made open for public preview through a beta release announced at TLT in Lisbon. The beta version (v0.9) is now available for download. This beta-release only includes annotation for WSJ sections 00-15 in the original [incr tsdb()] format. Further sections and other formats will be released in the final release (v1.0), which is expected to arrive in January 2013.
The public release (v1.0) of DeepBank will include annotation in multiple formats. The combination of the raw [incr tsdb()] profiles with a corresponding version of the ERG enables automatic reconstruction of all detailed analyses. The HPSG derivations and the MRSes are recorded in these profiles and can be extracted directly.
For convenience of usage, DeepBank is also available in other representation formats (though not all details are preserved in the converted representations), including the (modified) Penn-style constituent tree representation with labeled brackets, and the CoNLL-style syntactic and semantic dependency representation with tabbed format. The conversion software will be available to the public and maintained collaboratively between Oslo and Saarbrücken.
For further information about the treebank, please feel free to contact Yi Zhang.
We are grateful to the Erasmus Mundus European Masters Program in Language and Communication Technologies (LCT, EM Grant Number: 2007-0060) for financial support of the project.
We are equally grateful to the following student annotators for their diligent and patient work. All remaining errors in the treebank are of course ours.
- Ming Wen
- Maria Sukhareva
- Lea Frermann
- Iliana Simova
The involvement of Yi Zhang in the project is also partially sponsored by the German Cluster of Excellence on "Multimodal Computing and Interaction" (MMCI) funded by the DFG, and the Deependance project funded by BMBF (01IW11003).
-
Dan Flickinger, Valia Kordoni and Yi Zhang. DeepBank: A Dynamically Annotated Treebank of the Wall Street Journal. In Proceedings of TLT-11, Lisbon, Portugal, 2012.
-
Angelina Ivanova, Stephan Oepen, Lilja Øvrelid, and Dan Flickinger. Who did what to whom? a contrastive study of syntacto-semantic dependencies. In Proceedings of the Sixth Linguistic Annotation Workshop, pages 2–11, Jeju, Republic of Korea, 2012.
-
Yi Zhang and Hans-Ulrich Krieger. Large-scale corpus-driven PCFG approximation of an HPSG. In Proceedings of the 12th International Conference on Parsing Technologies, pages 198–208, Dublin, Ireland, 2011.
-
Yi Zhang, Valia Kordoni. Discriminant Ranking for Efficient Treebanking. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, China, 2010.
-
Valia Kordoni, Yi Zhang. Disambiguating Compound Nouns for a Dynamic HPSG Treebank of Wall Street Journal Texts. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), Malta, 2010.
-
Valia Kordoni, Yi Zhang. Annotating Wall Street Journal Texts Using a Hand-Crafted Deep Linguistic Grammar. In Proceedings of the Third Linguistic Annotation Workshop, Singapore, 2009.
Home | Forum | Discussions | Events