Annotated Privacy Policies of 100 Online Platforms

Published: 25 September 2023| Version 1 | DOI: 10.17632/pcgvm6zh43.1


The dataset contains information derived from 98 annotated privacy policies of 100 online platforms.* The hypothesis behind the study was that the privacy policies do not contain information sufficient for the consumers to fully understand what personal data exactly is being collected by the platforms, and how exactly it is used. To verify this hypothesis, two annotators (working independently) read the privacy policies in search for three types of occurrences: (1) general terms describing the categories of data collected ("GenData"); (2) general terms describing the purposes for which personal data is used ("GenUse"); (3) the no-distinction structure of a privacy policy, where the document first lists the categories of data collected, and then enumerates the purposes of use, without explaining what personal data is used for what purpose. The hypothesis has been confirmed. In the analyzed sample, all the privacy policies featured at least one instance of GenData, 97 out of 98 featured at least one instance of GenUse, and 89 out of 98 documents had a no-distinction structure. The sample contains 98 privacy policies of 100* digital platforms operating in sixteen market sectors: Cloud storage, Communication, Dating, Finance, Food, Gaming, Health, Music, Shopping, Social, Sports, Transportation, Travel, Video, Work and Various. The selected companies' headquarters span four legal surroundings: the US, the EU, Poland specifically, and Other jurisdictions. The chosen platforms are both privately held and publicly listed, and offer both fee-based and free services. The dataset consists of: (a) two spreadsheets: "PP_table Tagger1.xlsx" and PP_table Tagger2.xlsx," each containing the evaluative variables ascribed, and examples of clauses based on which the judgments have been made (b) two folders: "Tagger 1" and "Tagger 2," each containing 98 pdf files with the privacy policies analyzed, together with annotations made in the form of comments; (c) one text file: "Instruction," explaining the logic behind tagging. The reuse potential of the data is significant. It can be useful for empirical researchers interested in the dynamics of data collection processes of online platforms and normative scholars (like lawyers or political philosophers) interested in critiquing the status quo and proposing ideas for reforms. It can also be useful for non-academics, like governments interested in assessing the efficacy of their regulations, or businesses interested in avoiding the common pitfalls of privacy policy drafting. *(Apple and iCloud, as well as Google and YouTube, had the same privacy policy on the day of raw data collection, i.e. March 13, 2022). ACKNOWLEDGEMENT: The research leading to these results has received funding from the Norwegian Financial Mechanism 2014-2021, project no. 2020/37/K/HS5/02769, titled “Private Law of Data: Concepts, Practices, Principles & Politics.”


Steps to reproduce

The documents were retrieved from publicly accessible websites of respective online platforms on March 13, 2022, from the territory of Poland, the European Union. The URLs are listed in the "PP_table Tagger 1" and "PP_table Tagger 2" spreadsheets (the URLs are identical) and all the raw data is enclosed in the "folders" folder (pdfs have not been altered other than by adding the annotations). Each document was subsequently annotated independently by two researchers, based on the enclosed instruction. The instruction was prepared by the PI, with the help of the team. The annotators subsequently run consistency checks. The evaluative variables where brought to consistency, whereas the sample clauses might be different for both taggers. All the values in the spreadsheet can be verified against the annotations in the corresponding pdf files.


Uniwersytet Jagiellonski w Krakowie


Law, Computer Security and Privacy, Information Privacy


Norway Grants