Privacy Label Webcrawler | Yuvanshu_Agarwal

top of page

Web Crawler For Collecting Metadata Attributes Of App Privacy Labels

Location: Carnegie Mellon Chimps Lab

Role: Research Assistant

Date: January 2023 - May 2023

A Python web-crawler deployed on Google Cloud that systematically analyzes millions of apps on the Google Play Store to collect relevant metadata on the apps' privacy labels.

Why are app privacy labels important?

App privacy labels provide information to consumers about the data their apps collect and how it is used. In the age of data security and privacy, such labels support transparency in the app development community while preventing the use of malicious and/or pervasive apps that use data without consumer knowledge.

With privacy label metadata such as the app name, category, data collection labels (location, app activity, personal info...), and more, Carnegie Mellon researchers are searching for trends in how privacy labels have evolved and are implemented.

How do I collect app privacy label metadata?

All android apps' information is displayed on the Google Play Store website. To scrape only relevant metadata attributes, my web crawling algorithm must parse complex static and dynamic HTML objects on the page.

Sometimes, parsing can be made simpler through the use of libraries like Beautiful Soup. Other times, attributes stored in more convoluted containers require the algorithm to iteratively search for patterns within HTML strings.

How do you collect data for millions of apps?

I deployed and ran my algorithm using Google Cloud's cloud computing services. Google Cloud provided the means for processing metadata quickly and efficiently as well as the storage for storing large loads of data. Because of its scalability and flexibility, Google Cloud allowed me to continuously test and enhance my web crawler on a few apps quickly before deploying it on millions of apps.

Takeaways from my experience at Carnegie Mellon Chimps Lab?

At Carnegie Mellon Chimps Lab, I gained first-hand experience in deploying and running experiments on cloud computing platforms like Google Cloud. I also learned more about the backend behind websites and how to scrape information stored in dynamic and static HTML elements. In designing my own experiments, I thought critically about the types of metadata information that could yield useful insights about Google Play Store app privacy labels.

Please note all code is proprietary to Carnegie Mellon Chimps Lab and therefore not publicly accessible

AEnB2Ur5juO1svp3OGX2fnqA1bsA2TWa04sgeF-7NF4k9fCx4z3cF7Yk6FgXOIE0PpuOy9L-52eXmzNwKtwe9CIvx-

bottom of page