Indonesian News Corpus
Description
This corpus contains 150,466 news articles, which is derived from several freely accessible Indonesian news website. The corpus is designated for research purpose only. The news websites are: • kompas.com is a registered trademark of PT. Kompas Cyber Media. https://inside.kompas.com/about-us • tempo.co is a registered trademark of PT INFO MEDIA DIGITAL. https://www.tempo.co/about • merdeka.com is a registered trademark of PT KAPAN LAGI DOT COM NETWORKS. https://www.merdeka.com/company/tentang-kami.html • republika.co.id is a registered trademark of PT Republika Media Mandiri. https://www.republika.co.id/page/about • viva.co.id is a registered trademark of PT. Viva Media Baru. https://www.viva.co.id/tentang-kami • tribunnews.com is a registered trademark of PT Tribun Digital Online. http://www.tribunnews.com/about-us The corpus is a part of bachelor thesis work of Aad Miqdad Muadz Muzad under the supervision of Faisal Rahutomo. We crawled several categories of the websites for 6 months from July 2015 until December 2015.
Files
Steps to reproduce
Please read "README FIRST" file, and access the JSON or XML files.