
We will use Airflow for scheduling and monitoring, google storage as a Data lake and BigQuery as a Data warehouse.Ĭheck if the data for 2 days ago is already in our BigQuery > get the data from Spotify Charts > insert it into JSON file in google storage > transform the JSON into CSV > insert the data from the CSV into BigQuery. The scraping itself is done here with the library beautiful-soup.

Although the recommended method is using API, not all sites support it. Scraping is the process of extracting data from websites. Moreover, the most recent data in Spotify Charts is from 2 days ago. The problem is that this site doesn't have an API so we need to do some scraping to get this data. For each country and each day, we have the 200 most streamed tracks. I don't really know how it gets that data but nevertheless, it is there. Although Spotify has an extensive API, it doesn't provide an option to get the most popular songs.įortunately, there is a website called Spotify Charts which does give this data.


Now, I want to focus on the process of getting quantitative data from the Web. You can look up my articles in which I tried to understand what makes a particular song more likable than others.
