This dataset contains 4 files: 1. A .csv containing 29,105 sentences from CC-BY papers that contain citations ("pygothamCleanDataset.csv"). 2. A community edition databricks notebook to process and explore the data as .dbc 3. A community edition databricks notebook to view in HTML. 3. Pygotham slides in PDF format.
Steps to reproduce
Make sure to update all paths! Please see this link for an archived copy of the notebook with all output: https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/2644196477475309/2247597868200546/3108286398802724/latest.html