Management commands¶
The scraped-data app includes the following commands for scraping campaign finance data from the CAL-ACCESS website.
As with any Django app management command, these can be invoked on the command line or called within your Python code.
Raw content downloaded from CAL-ACCESS is stored in .scraper_cache/
, found in the directory specified by BASE_DIR
in your Django project’s settings.
scrapecalaccess¶
This command runs the following management commands, in order:
scrapecalaccesspropositions
scrapecalaccesscandidates
scrapecalaccessincumbents
These commands are defined in more detail below.
Examples¶
The default behavior of the scraper commands is to avoid excessive downloads. As such, a CAL-ACCESS web page’s content will only be downloaded if:
- The page’s content isn’t cached; or
- The byte size of the cached content differs from the size of the content on the server (as specified in
Content-Length
header).
You can override this default behavior by invoking the force-download
option:
$ python manage.py scrapecalaccess --force-download
Alternatively, you can avoid making any network requests by invoking the --cache-only
option so as to parse and store data only from previously cached content:
$ python manage.py scrapecalaccess --cache-only
By default, data saved to your database from previous scrapes is preserved, or you can invoke the --flush
option to start over with empty data tables:
$ python manage.py scrapecalaccess --flush
Options¶
usage: manage.py scrapecalaccess [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH] [--traceback]
[--no-color] [--flush] [--force-download]
[--cache-only]
Run all scraper commands
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush database tables
--force-download Force the scraper to download URLs even if they are cached
--cache-only Skip the scraper's update checks. Use only cached
files.
scrapecalaccesscandidates¶
Scrape certified candidates for each election on the CAL-ACCESS site. A component of the scrapecalaccess
command.
This command requests and parses content from the “certified” view of the Campaign/Candidates/list.aspx
page (e.g., the 2016 General certified candidates). Data parsed from these pages are saved in the CandidateElection
and Candidate
models.
Options¶
usage: manage.py scrapecalaccesscandidates [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
[--flush] [--force-download]
[--cache-only]
Scrape certified candidates for each election on the CAL-ACCESS site.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush database tables
--force-download Force the scraper to download URLs even if they are
cached
--cache-only Skip the scraper's update checks. Use only cached
files.
scrapecalaccesscandidatecommittees¶
Scrape each candidate’s committees from the CAL-ACCESS site.
This command requests and parses content from the “general” view of the Campaign/Candidates/Detail.aspx
page for candidate’s most recent “session” (e.g., Edward T. Gaines general information leading up to the 2016 General election). Data parsed from these pages are saved in the CandidateCommittee
model.
Note
The scrapecalaccesscandidatecommittees
command is not currently included in scrapecalaccess
because of the number of CAL-ACCESS web pages it scrapes. This may change in the future.
Options¶
usage: manage.py scrapecalaccesscandidatecommittees [-h] [--version]
[-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
[--flush]
[--force-download]
[--cache-only]
Scrape each candidate's committees from the CAL-ACCESS site.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush database tables
--force-download Force the scraper to download URLs even if they are
cached
--cache-only Skip the scraper's update checks. Use only cached
files.
scrapecalaccessincumbents¶
Scrape list of incumbent state officials for each election on CAL-ACCESS site. A component of the scrapecalaccess
command.
This command requests and parses content from the “incumbent” view of the Campaign/Candidates/list.aspx
page (e.g., the 2017-2018 General incumbents). Data parsed from these pages are saved in the IncumbentElection
and Incumbent
models.
Options¶
usage: manage.py scrapecalaccessincumbents [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
[--flush] [--force-download]
[--cache-only]
Scrape list of incumbent state officials for each election on CAL-ACCESS site.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush database tables
--force-download Force the scraper to download URLs even if they are
cached
--cache-only Skip the scraper's update checks. Use only cached
files.
scrapecalaccesspropositions¶
Scrape links between filers and propositions from the official CAL-ACCESS site. A component of the scrapecalaccess
command.
This command requests and parses content from the Campaign/Measures/list.aspx
page (e.g., the 2015-2016 propositions and ballot measures) and “general” view of each propositions Campaign/Measures/Detail.aspx
page (e.g., Prop 60’s general information). Data parsed from these pages are saved in the PropositionElection
, Proposition
and PropositionCommittee
models.
Examples¶
$ python manage.py scrapecalaccesspropositions
Options¶
usage: manage.py scrapecalaccesspropositions [-h] [--version] [-v {0,1,2,3}]
[--settings SETTINGS]
[--pythonpath PYTHONPATH]
[--traceback] [--no-color]
[--flush] [--force-download]
[--cache-only]
Scrape links between filers and propositions from the official CAL-ACCESS
site.
optional arguments:
-h, --help show this help message and exit
--version show program's version number and exit
-v {0,1,2,3}, --verbosity {0,1,2,3}
Verbosity level; 0=minimal output, 1=normal output,
2=verbose output, 3=very verbose output
--settings SETTINGS The Python path to a settings module, e.g.
"myproject.settings.main". If this isn't provided, the
DJANGO_SETTINGS_MODULE environment variable will be
used.
--pythonpath PYTHONPATH
A directory to add to the Python path, e.g.
"/home/djangoprojects/myproject".
--traceback Raise on CommandError exceptions
--no-color Don't colorize the command output.
--flush Flush database tables
--force-download Force the scraper to download URLs even if they are
cached
--cache-only Skip the scraper's update checks. Use only cached
files.