Dr. Christian M. Meyer


Robust Tokenization
and POS-Tagging
for Different Genres

Abstract. We present our system used for the AIPHES team submission in the context of the EmpiriST shared task on “Automatic Linguistic Annotation of Computer-Mediated Communication / Social Media”. Our system is based on a rule-based tokenizer and a machine learning sequence labelling POS tagger using a variety of features. We show that the system is robust across the two tested genres: German computer mediated communication (CMC) and general German web data (WEB). We achieve the second rank in three of four scenarios. Also, the presented systems are freely available as open source components.

Submitted: 15.05.2016 | Published: 12.08.2016