The repository contains quickly written scripts designed to interact with the AVP Aviary audio/video repository software. These scripts, as of Nov 2024 are only proof-of-concept (i.e., return valid results but lack tests and error handling).
- Python 3
- Ability to work with proof-of-concept level software (e.g., exception handling is basic and involves reading stack traces)
Code changed after Oct 2024
- Python v3.12
- git clone the repository
Aviary is a SaaS vendor audio/video repository solution.
Terminology:
Collection
: a container holding resourcesResources
: the main container of the audio/video object (links together the metadata,media
files, etc.)Media
: the container representing an audio/video file and associated metadata; linked to theresource
Index
: the container representing indexes into the media and associated metadata; linked to amedia
itemTranscript
: the container representing a transcript of the audio/video; linked to amedia
itemSupplemental Files
: the container representing a supplemental file (JPEG, PDF, etc.) attached to the resource; linked to aresource
item
erDiagram
Collection ||--o{ Resource : contains
Collection {}
Resource ||--o{ Media : contains
Resource ||--o{ "Supplemental files" : contains
Media ||--o{ Index : contains
Media ||--o{ Transcripts : contains
The above entity relationship diagram models the Aviary API as of November 2024. A more detailed diagram is available in the Aviary Documentation: content model.
The scripts leverage the Aviary API. To use the API as of November 2024, an API Key needs to be generated and passed to the sign-in API endpoint along with the organization ID. The details for authentication:
- Create API Key and store
- Pass the value to the following scripts via environment variables (or anther method)
- export AVIARY_API_KEY=string_from_step_above
- export AVIARY_API_ORGANIZATION_ID=128
The main scripts:
- Download metadata [all | by specified collection | by specified resource] (includes attached files except for media files due to their special handling)
- aviary_api_report_2024-11-08.py
- Usage:
python3 aviary_api_report_2024-11-08.py --server ${SERVER_URL} --output /tmp/aviary_test/ --help
- includes ability to optionally limit to a single collection or resource (see --help output for details)
- stores in a hierarchical directory by collection id
- Request media files from single Aviary media item by ID (note the script's special handling of media files with restricted access controls)
More about the Aviary API in this link.
To check style:
pycodestyle --show-source --show-pep8 --ignore=E402,W504 --max-line-length=200 .
To run tests:
python3 tests/unit_tests.py
Note: March 2024 - the ./json directory contains the more recently used scripts (for the SpokenWeb export).
Notes: 2024-10-06
- Pagination: to be added to documentation on Monday
- Media associated with a Resource: limited to 10: bug fix Monday
- Fields available in UI and CVS export not available via the API: bug fix Monday
- Downloading file marked as "not downloadable": recommendation - use the make downloadable for period of time option and download. Sean: about 9 items, SILR that should be excluded from this approach
- Rate limiting: not currently imposing a specific rate, recommendation exponential back-off if error ( (no present requirement to use a rate limiter on the client side e.g., leaky bucket algorithm).
- Python 3
- Ability to work with proof-of-concept level software (e.g., exception handling is basic and involves reading stack traces)
- Elevated user privileges to access the Aviary API (i.e., default privileges on your campus computing id are insufficient)
Code before Oct. 2024
- Python < v3.12
- git clone the repository
- Install dependencies:
pip install -r requirements.txt --user
- Or python3 setup.py install --user
- installed required modules in a local user account
- Or without
--user
to install into the OS's central Python environment (required administrative privileges)
Note: the vendor rate limits API requests. Each of the following scripts uses a simple wait mechanism between API requests. A more advanced approach could be implemented where the wait time is dynamically computed based on the response latency plus a retry mechanism. As of 2023-05-29, running multiple scripts will cause one to fail with the default wait settings.
Note: pagination is not documented (as of 2023-05-29) so workarounds are needed for collections with more than 100 resources (e.g., use Web UI export to gain a list of IDs)
Note: as of 2023-06-29, a resource returned by the Aviary API lists the attached media item(s) in the media_file_id
field. However, media_file_id
only contains a maximum of 10 IDs (a significant number of resources have over 10 media attached). I also tried via the Web UI, "export Media Files(s) to CSV" but the resulting file doesn't contain media IDs. How to get the entire list of media items is unknown.
The main types of scripts (the details are in the following subsections):
- Request metadata about a single Aviary item by ID
- Upload a list of media items
- Request a CSV report of all items of a specified model. The file naming convention is
aviary_api_report_[model]_[output (CSV)]_[how the IDs are discovered]
. The offering includes:- aviary_api_report_index_csv_by_media_list.py
- aviary_api_report_media_csv_by_media_list.py
- aviary_api_report_resources_csv_by_list.py
- aviary_api_report_supplemental_csv_by_list.py
- aviary_api_report_transcripts_csv_by_media_list.py
- Upload a list of media items
- JSON output (new 2023/2024)
- ./json/
- Experimental: working with the Aviary API in different ways including the use of the collection API to find resource lists without needing the Web UI export.
The details:
The following script authenticates against the Aviary API and prints out the metadata of a specified object
python3 aviary_api_get_by_id.py --server ${aviary_server_name} --id ${media_id} --type [c|r|m]
Where:
- 'c': collection resources (no pagination so max 100 returned -- as of 2023-05-27 no documentation on pagination nor obvious mechanism in API response header or content)
- 'r': resource (single)
- 'm': media
Note: for the reports using the CSV export file to gain a list of IDs, some fields are required (fields can be disabled via the UI and thus prevented from being added to the CSV export). The CSV export is required due to the aforementioned lack of pagination (as of May 2023)
Note: the updated_at
field is unavailable via the web UI as of 2023-03-27
For input, use the Web UI resource table option to export. This obtains a list of resource IDs (required due to lack of pagination in the Aviary API as of 2023-05-26).
- Navigate to
/collection_resources
- Select
Table Options
-->Export Resources to CSV
- Use as the
input
in the following command
python3 aviary_api_report_resources_csv_by_list.py --server ${aviary_server_name} --output ${output_path} -input ${input_path}
Failed: an attempt to use the Aviary APIs /api/v1/collections
and /api/v1/collections/{:collection_id}/resources
to build a list of resources failed (2023-03-27) due to a limit of 100 resources returned per collection and no documentation on how to enable pagination.
python3 experimental/aviary_api_report_resources_csv.py --server ${aviary_server_name} --output ${output_path}
For a JSON-like output (more for debugging)
Or
python3 experimental/aviary_api_report_resources_json.py --server ${aviary_server_name} --output ${output_path}
For input, use the Web UI resource table option to export. This obtains a list of resource IDs (required due to lack of pagination in the Aviary API as of 2023-05-26).
- Navigate to
/collection_resource_files
(Media
in the Web UI) - Select
Table Options
-->Export Media file(s) to CSV
- Use as the
input
in the following command
python3 aviary_api_report_media_csv_by_media_list.py --server ${aviary_server_name} --output ${output_path} -input ${input_path}
Note: the resource API response, when the media files count
is >10, the media file IDs
will display only a maximum of 10 IDs in the list (as of May 2023). An example is resource 58924
Failed: an attempt to use the Aviary API /api/v1/collections
and /api/v1/collections/{:collection_id}/resources
to build a list of media failed (2023-03-27) due to a limit of 100 resources returned and no documentation on how to enable pagination (similar to the resource report).
python3 experimental/aviary_api_report_media_csv.py --server ${aviary_server_name} --output ${output_path}
Failed: an attempt to the resource CSV export instead of requiring the user to also export the media CSV failed. The resource CSV export contains a media_file_ids
field that contains a maximum of 10 items (no documentation nor easily identifiable means to increase). The experiment validates the list length versus the media_files_count
.
- Navigate to
/collection_resource
- Select
Table Options
-->Export resources to CSV
- Use as the
input
in the following command
python3 experimental/aviary_api_report_media_csv_by_list.py --server ${aviary_server_name} --output ${output_path} -input ${input_path}
Or JSON
For input, use the Web UI resource table option to export. This obtains a list of resource IDs (required due to lack of pagination in the Aviary API as of 2023-05-26).
- Navigate to
/collection_resource_files
(Media
in the Web UI) - Select
Table Options
-->Export Resources to CSV
- Use as the
input
in the following command
python3 aviary_api_report_transcripts_csv_by_media_list.py --server ${aviary_server_name} --output ${output_path} -input ${input_path}
Note: As of 2023-05-26, IDs for transcripts are discovered through the contest of the transcripts filed in the media API response -- there is no indication that the list is complete and may have a limit like the media file IDs
field in the resources API response.
Todo: alter to remove the need for an input file of ID once pagination is available and replace with /api/v1/collections
and /api/v1/collections/{:collection_id}/resources
to build a list of media.
Note: experiemental/aviary_api_report_transcripts_csv_by_list.py
uses the resource CSV export as input.
Note: As of 2023-05-26, the Aviary API does not support a direct HTTP GET API request to gather the metadata for the index
type. The index report is built from the contents of the indexes
field in the media API response -- there is no indication that the list is complete and may have a limit like the media file IDs
field in the resources API response.
- Navigate to
/collection_resource_files
(Media
in the Web UI) - Select
Table Options
-->Export Resources to CSV
- Use as the
input
in the following command
python3 aviary_api_report_index_csv_by_media_list.py --server ${aviary_server_name} --output ${output_path} -input ${input_path}
Or JSON, either JSON metadata or a index download/export of the WebVTT.
Note: the resource API response, when the media files count
is >10, the media file IDs
will display only a maximum of 10 IDs in the list (as of May 2023). An example is resource 58924
Todo: alter to remove the need for an input file of ID once pagination is available and replace with /api/v1/collections
and /api/v1/collections/{:collection_id}/resources
to build a list of media.
Note: experiemental/aviary_api_report_index_csv_by_list.py
uses the resource CSV export as input.
For input, use the Web UI resource table option to export. This obtains a list of resource IDs (required due to lack of pagination in the Aviary API as of 2023-05-26).
- Navigate to
/supplemental_files
- Select
Table Options
-->Export Supplemental Files(s) to CSV
python3 aviary_api_report_supplemental_files_csv_by_list.py --server ${aviary_server_name} --output ${output_path} -input ${input_path}
Todo: alter to remove the need for an input file of ID once pagination is available and replace with /api/v1/collections
and /api/v1/collections/{:collection_id}/resources
to build a list of media.
The following script authenticates against the Aviary API and via chunking, uploads a media file. This approach only works for media files below 1G (maybe up to 2G at times) due to security and configuration restrictions on the Aviary side (according to Feb. 2023 conversations with AVP)
- Sometime after Aug/Sept 2022 tests and before Feb 2023, the API upload seems to have broken
- the filename was made a required parameter (not in the release notes)
- the API upload occurs without error but the media does not appear in the web UI and the resource media listing the Web UI throws an error page after the API upload
python3 aviary_media_api_upload_chunked.py --server ${aviary_server_name} --input input.sample.csv
The ffmpeg tool can be used to generate test videos in cases where one requires a video of a certain size without copyright or permission encumbrances.
For example, the following creates a video of a testsrc with a 10sec duration at a 30 frames/second rate. By varying the duration, one can increase the storage size of the resulting video.
ffmpeg -f lavfi -i testsrc=duration=10:size=1280x720:rate=30 testsrc_10.mpg
Another option is to concatenate multiple videos together using ffmpeg and the concat
feature that takes as input a file listing the video files to concatenate (one-per-line) and the output file.
ffmpeg -f concat -safe 0 -i ffmpeg_concat.txt -c copy 3g.mp4
Note: uses fragile workarounds as AVP Aviary API does not cover all the required features (e.g., pagination as of March 2024; index API added after Nov 2023).
- Get the list of resources in the SpokenWeb Collection (ID: 1783)
- Workaround as the API doesn't have pagination nor a filter by collection
- Export all resources as CSV from the UI resources table: https://ualberta.aviaryplatform.com/collection_resources
- Verify the "Collection Title" property is enabled in the UI resources table "manage table" otherwise there is no information to determine each resource's collection
- Filter the CSV by the "Collection Title" property (verify enabled via the "manage table" list of displayed properties)
grep 'SpokenWeb UAlberta' delete2/UniversityofAlbertaLibrary_collection_resources_2024-03-25_1711392929.csv
- add back the CSV header
- Get the resource metadata (JSON)
python3 avp_aviary_api_interactions/json/aviary_api_report_resources_json_by_resource_list.py --server 'https://ualberta.aviaryplatform.com/' --output delete2/aviary_collection_1783_resources_2024-03-25.json --input delete2/UniversityofAlbertaLibrary_collection_resources_2024-03-25_1711392929_collection_1783.csv
- Get the media metadata (JSON)
- Nov. 2023 approach fails as of March 2024 (unable to export the media CSV from the Aviary UI) -
aviary_api_report_index_json_by_media_list.py
. - Use the 'media_file_id' in the resource JSON output (truncated to 10 media IDs when last tested in 2022)
- Check if the media_file_id field in the resource JSON has truncated the ID list at 10 items:
jq '.[].data.media_file_id | length' delete2/aviary_collection_1783_resources_2024-03-25.json
jq '.[1].data' delete2/aviary_collection_1783_resources_2024-03-25.json
python3 avp_aviary_api_interactions/json/aviary_api_report_media_by_resource_json.py --server 'https://ualberta.aviaryplatform.com/' --input delete2/aviary_collection_1783_resources_2024-03-25.json --output delete2/aviary_collection_1783_media_2024-03-25.json
- These two should match:
jq '. | length' delete2/aviary_collection_1783_media_2024-03-25.json
jq '.[].data.media_file_id[]' delete2/aviary_collection_1783_resources_2024-03-25.json | wc -l
- Nov. 2023 approach fails as of March 2024 (unable to export the media CSV from the Aviary UI) -
- Get the index metadata
python3 avp_aviary_api_interactions/json/aviary_api_report_index_by_media_json.py --server 'https://ualberta.aviaryplatform.com/' --input delete2/aviary_collection_1783_media_2024-03-25.json --output delete2/aviary_collection_1783_index_2024-03-25.json
- These two should match
jq '.[].data.indexes[].id' delete2/aviary_collection_1783_media_2024-03-25.json | wc -l
jq '. | length' delete2/aviary_collection_1783_index_2024-03-25.json
- Download index files
- using the same script as of November 2023
python3 avp_aviary_api_interactions/experimental/experimental_test_batch_download.py --server 'https://ualberta.aviaryplatform.com/' --input_file delete2/index_id_list_2024-03-25.csv --output_path delete2/index_file_export/ --type i --wait 10
- using the new API index endpoint as of March 2024
python3 avp_aviary_api_interactions/json/aviary_api_download_index_by_index_json.py --server 'https://ualberta.aviaryplatform.com/' --input delete2/aviary_collection_1783_index_2024-03-25.json --output delete2/index_file_export__new_api_march_2024 --wait 5
- using the same script as of November 2023