This repo contains a simple Scrapy project for scraping exhibit- and gallery-level data from exploratorium.edu, the official webpage of the Exploratorium. This project is Part 1 of Explorer AI.
The spiders exhibits.py
and galleries.py
defined in exploratorium/spiders/
, scrape the following raw text data
from the landing pages for the museum's exhibits and
galleries, respectively.
-
Each item in the exhibit-level data (i.e. each exhibit) has the following fields:
id
is the id of the exhibit (taken from the URL slug)title
is the title of the exhibittagline
is a (catchy) short descriptiondescription
gives a brief description of the exhibitlocation
gives the (current) location of the exhibit (e.g. gallery) inside the museum, or says that it is not currently on viewbyline
is the information contained in the line beginning "Exhibit developer(s):"whats_going_on
is one of the more common headings in the exhibit's "about" sectiongoing_further
is the other common headingdetails
stores the text contents of the "about section" whenever it does not contain the previous two featuresphenomena
gives a list of phenomena which are illustrated by the exhibitkeywords
gives a list of keywords for the exhibitcollections
gives a list of collections (groupings of exhibits, based on some theme) that the exhibit belongs toaliases
gives other names that the exhibit might go bycollection_id
gives a list of ids for collections to which the exhibit belongsrelated_exhibit_id
gives a list of ids for related exhibits
-
Each item in the gallery-level data (i.e. each gallery) has the following fields:
id
is the id of the gallery (taken from the URL slug)title
is the title of the exhibittagline
is a catchy short descriptiondescription
gives a brief description of the exhibitcurator_url
gives a link to the curators' statementcurator_statement
gives the curators' names and their statement about the gallery
The json
files containing this data are located in data/text
.
In a following update, this project will also include image-scraping framework from the same websites referenced above.