HealthAid: Extracting domain targeted high precision procedural knowledge from online communities.
HealthAidKB, a Knowledge Base, is the result of an automatic extraction and clustering pipeline of common procedural knowledge in the domain of health. Our goal is to construct domain targeted high precision procedural knowledge base containing task frames. We developed a pipeline of methods leveraging Open IE to extract procedural knowledge by tapping into on-line communities. In addition, we devise a mechanism to canonicalize the task frames into clusters based on the similarity of the problems they intend to solve. The resulting knowledge base shows high precision based on an evaluation by human experts in the domain. We extracted the procedural knowledge by tapping into the health category of wiki how (https://www.wikihow.com/Category:Health ) and how to cure (https://howtocure.com/).
Steps to reproduce
Regenerating the knowledge base requires going through each of the steps in the pipeline ( see paper). Start by crawling the Wikihow and Howtocure articles following the URLs (ID-URL.csv). The crawling is done by using the python script (crawler.py) attached. The input to the crawler is the URL file (ID-URL.csv) and the Category file (ID-SCategory) file. The crawler may take several hours of web scraping and better speed is achieved by using parallel processing or Google colab environment. We used the latter option. The link to the script to run on Google's colab environment is shared with the script. The resulting data is given to the OpenIE to extract Location and Participating objects information from the Title of the tasks by using Open IE. The input to the OPenIE is The Titles of each task frame. The instructions to run OpenIE can be found on the shared link in the references section.