Models for tracking updates

The raw-data app also keeps track of each snapshot of the CAL-ACCESS database released by the California Secretary of State, including its release date and byte size, as well as the activity of the management commands that process this data.

This tracking information is stored in the data tables outlined below.

Note

By default, the raw-data app does not archive previous versions of the CAL-ACCESS database. Rather, with each call to the management commands, the data files they process are overwritten.

You can configure the raw-data app to keep each copy of the zip file downloaded from the California Secretary of State as well as the indivdual raw .csv files and cleaned .tsv files by flipping the CALACCESS_STORE_ARCHIVE to True in settings.py:

# in settings.py
CALACCESS_STORE_ARCHIVE = True

By default, the older copies of these files will be saved to the path specified by your Django project’s MEDIA_ROOT setting (more on that here). However, if you’ve implemented a custom storage system or installed a third-party app (such as django-storages), that should work too.


RawDataVersion

Versions of CAL-ACCESS raw source data, typically released every day.

Fields

Name Type Unique key Definition
id Integer Yes Auto-incrementing unique identifer of versions
release_datetime DateTime No (Unique) date and time the version of the CAL-ACCESS database was released (value of Last-Modified field in HTTP response header)
expected_size Integer No The expected size of the downloaded CAL-ACCESS zip, as specified in the content-length field in HTTP response header
update_start_datetime DateTime No Date and time when the update to the CAL-ACCESS version started
update_finish_datetime DateTime No Date and time when the update to the CAL-ACCESS version finished
download_start_datetime DateTime No Date and time when the download of the CAL-ACCESS database export started
download_finish_datetime DateTime No Date and time when the download of the CAL-ACCESS database export finished
extract_start_datetime DateTime No Date and time when extraction of the CAL-ACCESS data files started
extract_finish_datetime DateTime No Date and time when extraction of the CAL-ACCESS data files finished
download_zip_archive FileField No An archive of the original zipped file downloaded from CAL-ACCESS
clean_zip_archive FileField No An archive zip of cleaned (and error log) files
clean_zip_size Integer No The actual size of the downloaded CAL-ACCESS zip after the downloaded completed
download_zip_size Integer No The size of the zip containing all cleaned raw data files and error logs

Instance methods and properties

.download_completed Check if the download of the version's zip file completed. Return True or False.
.download_stalled Check if the download of the version's zip file started but did not complete. Return True or False.
.download_file_count Returns the count of files included in the version's downloaded zip.
.download_record_count Returns the count of records in the version's downloaded files.
.clean_file_count Returns the count of files cleaned in the version.
.clean_record_count Returns the count of records in the version's cleaned files.
.error_file_count Returns the count of cleaned files with errors in the version.
.error_count Returns the count of cleaning errors in the version.
.extract_completed Check if the extract of files from the downloaded zip completed. Return True or False.
.extract_stalled Check if the extract of files from the downloaded zip started but did not complete. Return True or False.
.update_completed Check if the database update to the version completed. Return True or False.
.update_stalled Check if the database update to the version started but did not complete. Return True or False.
.pretty_clean_size() Returns a prettified version (e.g., "725M") of the zip of clean data files and error logs.
.pretty_download_size() Returns a prettified version (e.g., "725M") of the actual size of the downloaded zip.
.pretty_expected_size() Returns a prettified version (e.g., "725M") of the expected size of the downloaded zip.

Query set methods

.complete()

Filters down QuerySet to return only version that have a complete update.

$ python manage.py shell
>>> from calaccess_raw.models.tracking import RawDataVersion
>>> RawDataVersion.objects.completed()
<QuerySet [<RawDataVersion: 2016-08-15 11:20:29+00:00>, <RawDataVersion: 2016-08-11 11:20:24+00:00>, <RawDataVersion: 2016-08-09 11:20:49+00:00>, <RawDataVersion: 2016-08-05 11:20:27+00:00>, <RawDataVersion: 2016-08-04 11:20:28+00:00>, <RawDataVersion: 2016-07-31 11:20:29+00:00>, <RawDataVersion: 2016-07-30 11:20:42+00:00>, <RawDataVersion: 2016-07-29 11:20:30+00:00>, <RawDataVersion: 2016-07-28 11:20:30+00:00>, <RawDataVersion: 2016-07-26 11:20:28+00:00>, <RawDataVersion: 2016-07-22 11:20:30+00:00>, <RawDataVersion: 2016-07-05 11:20:30+00:00>, <RawDataVersion: 2016-07-04 11:20:30+00:00>, <RawDataVersion: 2016-06-28 11:20:28+00:00>, <RawDataVersion: 2016-06-14 11:20:49+00:00>, <RawDataVersion: 2016-06-10 11:20:26+00:00>, <RawDataVersion: 2016-06-08 11:20:29+00:00>, <RawDataVersion: 2016-05-27 11:20:28+00:00>, <RawDataVersion: 2016-05-21 15:35:11+00:00>, <RawDataVersion: 2016-05-20 13:59:57+00:00>, '...(remaining elements truncated)...']>

RawDataFile

Data files included in the given version of the CAL-ACCESS raw source data.

Fields

Name Type Unique key Definition
id Integer Yes Auto-incrementing unique identifer of the file
file_name String (up to 100) No Name of the raw source data file without extension
download_records_count Integer No Count of records in the original file downloaded from CAL-ACCESS
clean_records_count Integer No Count of records in the cleaned file generated by calaccess_raw
load_records_count Integer No Count of records in the loaded from cleaned file into calaccess_raw's data model
download_columns_count Integer No Count of columns in the original file downloaded from CAL-ACCESS
clean_columns_count Integer No Count of columns in the cleaned file generated by calaccess_raw
load_columns_count Integer No Count of columns on the loaded calaccess_raw data model
download_file_archive FileField No An archive of the original raw data file downloaded from CAL-ACCESS.
clean_file_archive FileField No An archive of the raw data file after being cleaned.
clean_file_size Integer No Size of the .CSV file
download_file_size Integer No Size of the .TSV file
error_log_archive FileField No An archive of the error log containing lines from the original download file that could not be parsed and are excluded from the cleaned file.
error_count Integer No Count of records in the original download that could not be parsed and are excluded from the cleaned file.
version_id Integer No Foreign key referencing the version of the raw source data in which the file was included.
clean_start_datetime DateTime No Date and time when the cleaning of the file started
clean_finish_datetime DateTime No Date and time when the cleaning of the file finished
load_start_datetime DateTime No Date and time when the loading of the file started
load_finish_datetime DateTime No Date and time when the loading of the file finished

Instance methods and properties

.model() Returns the RawDataFile's corresponding CalAccess database model object.
.pretty_clean_file_size Returns a prettified version (e.g., "725M") of the cleaned file's size.
.pretty_download_file_size Returns a prettified version (e.g., "725M") of the downloaded file's size.