Research scientist, Berlin-Brandenburg Academy of Sciences
Center for Lexicography of German
→ Notably in charge of contemporary and web text collections
- (Web) corpus construction and exploitation, from crawling/OCR to visualization
- Corpus and computational linguistics with emphasis on non-standard data
For more information see research blog and software released under open-source licenses
For more information see the archives or my presentations on SlideShare.
- Co-organizer of conferences (KONVENS 2018) and workshops, recently: Challenges in the Management of Large Corpora (CMLC) & 12th Web as Corpus Workshop (WAC-XII)
- Reviewer for conferences (notably ACL, CMC-Corpora, Computational Humanities Research, Digital Humanities, EACL, EMNLP, KONVENS), SwissText; volume chapters (e.g. proofreader profile for Language Science Press); journals (Journal of Open Humanities Data, Language Resources and Evaluation); and workshops (CPSS, SOCAI).
- Editor (2017-2021) of the Journal for Language Technology and Computational Linguistics (JLCL) and member of the executive board of the German Society for Computational Linguistics & Language Technology (GSCL)
- Director (2011-2013) of ENthèSe (association of doctoral candidates)
See also this comprehensive publication list on the HAL archive.
Trafilatura: A Web Scraping Library and Command-Line Tool for Text Discovery and Extraction
Proceedings of ACL/IJCNLP 2021: System Demonstrations, pp. 122-131, 2021.
[PDF] [Code] [Project]
Out-of-the-Box and Into the Ditch? Multilingual Evaluation of Generic Text Extraction Tools
Adrien Barbaresi, Gaël Lejeune
Language Resources and Evaluation Conference (LREC 2020), Proceedings of the 12th Web as Corpus Workshop (WAC-XII), pp. 5-13, 2020.
[PDF] [Code] [Project]
A corpus of German political speeches from the 21st century
11th Language Resources and Evaluation Conference (LREC 2018), pp. 792-797, 2018.
A Constellation and a Rhizome: Two Studies on Toponyms in Literary Texts
Visualisierung sprachlicher Daten: Visual Linguistics – Praxis – Tools, N. Bubenhofer & M. Kupietz (eds.), Heidelberg University Publishing, pp. 167-184, 2018.
- CLARIN-D and German Text Archive (DTA) projects at the BBAW
- Research associate at the Austrian Academy of Sciences (Academy Corpora group)
- COW at the FU Berlin
- Corpus Linguistics and Instrumented Text Databases team at ICAR lab
Powered by Jekyll and Minimal Light theme.