Odea module¶
Odea: Open Digital Ethnography Archives toolkit
This python toolkit is designed to operate with living collections of ethnographic documents, organized using the BagIt archival standard.
The goal is to provide tools for automating the management of archival documents – storage, indexing, validation, conversion to distribution formats, metadata cataloguing – in ways that allow everything to remain accessible from the computer file system and open to manipulation with standard tools.
-
odea.ITEM_METADATA_DIR= 'item_metadata'¶ The subdirectory of the bag containing metadata files for items.
-
odea.FILE_METADATA_DIR= 'file_metadata'¶ The subdirectory of the bag containing metadata files for files.
-
odea.THUMBS_DIR= 'thumbs'¶ The subdirectory of the bag containing thumbnail images for files.
-
odea.DATA_DIR= 'data'¶ The payload directory of the bag (should always be ‘data’ for BagIt standard compliance).
-
odea.DERIV_DIR= 'data/deriv'¶ The subdirectory directory of the bag in which derivative files will be stored on generation.
-
odea.HTML_DIR= 'html'¶ The subdirectory of the bag in which generated html metadata files will be stored.
-
odea.RE_UUID= re.compile('[0-F]{8}-[0-F]{4}-[0-F]{4}-[0-F]{4}-[0-F]{12}', re.IGNORECASE)¶ Regular expression for matching UUID identifiers in filenames.
-
odea.RE_HASHTAG= re.compile('(#[\\w\\d\\-_]+)')¶ Regular expression for matching hashtags in note fields.
-
odea.HASH_BLOCK_SIZE= 524288¶ Block size used when reading files for hashing.
-
odea.TERMS= ['dcmi_type', 'title', 'identifier', 'creator', 'subject', 'contributor', 'coverage', 'date', 'description', 'language', 'publisher', 'relation', 'rights', 'source', 'note']¶ List of metadata terms used in preparing html output for items. These will correspond to the item properties but are listed here in presentation order.
-
odea.DOCUTILS_CSS= '/home/docs/checkouts/readthedocs.org/user_builds/odea/envs/latest/lib/python3.7/site-packages/odea-1.0-py3.7.egg/static/docutils_odea.css'¶ Docutils css. This path is computed from the package location.
-
odea.DOCUTILS_TEMPLATE= '/home/docs/checkouts/readthedocs.org/user_builds/odea/envs/latest/lib/python3.7/site-packages/odea-1.0-py3.7.egg/static/docutils_template.txt'¶ Docutils page template. This path is computed from the package location. This template provides a “viewport” meta tag to enable responsive display.
-
odea.PANDOC_CSS= '/home/docs/checkouts/readthedocs.org/user_builds/odea/envs/latest/lib/python3.7/site-packages/odea-1.0-py3.7.egg/static/pandoc_odea.css'¶ Pandoc css. This path is computed from the package location.
-
odea.CSS= "q::before { content: none; } q::after { content: none; } q{font-style: italic}'"¶ Custom CSS to be added to html output (currently bases Bootstrap 5).
-
odea.HTML_TEMPLATE= '<!doctype html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <link rel="stylesheet" href="bootstrap.min.css"> <style>{css}</style> <title>{title} - {archive}</title> </head> <body> <nav class="navbar navbar-expand-lg navbar-dark bg-primary"> <div class="container"> <a class="navbar-brand" href="{archive_url}">{archive}</a> </div> </nav> <div class="container py-4"> {nav} <h1>{title}</h1> {body} </div> <footer class="footer mt-5 p-3"> <div class="container"> <p class="text-muted">{page_metadata}</p> <p class="text-muted">{license}</span> </div> </footer> </body> </html>\n'¶ Template for html page output. Variables passed to the string are {css}, {archive}, {title}, {body}, and {license}. Note that the default template expects a Bootstrap stylesheet to be present within the html directory; this needs to be downloaded from <http://v5.getbootstrap.com/>.
-
odea.CMD_DF_IMG_THUMB= 'convert "{source}[{frame}]" -density 300 -thumbnail 360x360^ -gravity center -extent 360x360 -background white -alpha remove -auto-orient {target}'¶ Shell command for deriving a thumbnail image from a source file. This will crop the image if it does not fit the bounding box.
-
odea.CMD_DF_IMG_MED= 'convert "{source}[{frame}]" -density 300 -resize 800x600\\> -background white -alpha remove -auto-orient {target}'¶ Shell command for deriving a medium-size image from a source file.
-
odea.CMD_DF_IMG_LG= 'convert "{source}[{frame}]" -density 300 -resize 1920x1080\\> -background white -alpha remove -auto-orient {target}'¶ Shell command for deriving a large image from a source file.
-
odea.CMD_PF_WEBARC= 'wget --input-file="{source}" --convert-links --page-requisites --span-hosts --adjust-extension --restrict-file-names=windows --directory-prefix={target}'¶ Shell command for generating an offline, archival copy of a web document. The input (source file) should be plain-text file containing a single URL or list or URLs.
-
odea.CMD_PF_WAV= 'ffmpeg -i "{source}" "{target}"'¶ Shell command for deriving a WAV audio file from a source media file.
-
odea.CMD_DF_MP3= 'ffmpeg -i "{source}" "{target}"'¶ Shell command for deriving an MP3 audio file from a source media file.
-
odea.CMD_DF_PDF_DOC= 'libreoffice --headless --convert-to pdf "{source}"; filename=$(basename -- "{source}"); mv "${{filename%.*}}.pdf" "{target}"'¶ Shell command for deriving a pdf file from a word processor document. This uses LibreOffice, which recognizes OpenDocument and MS-Office documents, spreadsheets, and presentations. Libreoffice does not allow output filename customization, but just writes the target in the current working directory (bag root), so the resulting file must be moved.
-
odea.CMD_DF_PDF_HTML= 'read -r URL < "{source}"; wkhtmltopdf "$URL" "{target}"'¶ Shell command for deriving a pdf file from a source html document. The input (source file) should be plain-text file containing a single URL or list or URLs.
-
odea.CMD_PF_SCREENSHOT= 'read -r URL < "{source}"; wkhtmltoimage "$URL" "{target}"'¶ Shell command for deriving a full-page screenshot from a source html document. The input (source file) should be plain-text file containing a single URL or list or URLs.
-
odea.CMD_DF_SCREENSHOT_CROPPED= 'read -r URL < "{source}"; wkhtmltoimage "$URL" --crop-h 800 --quality 60 "{target}"'¶ Shell command for deriving a cropped screenshot from a source html document. The input (source file) should be plain-text file containing a single URL or list or URLs.
-
odea.CMD_PF_TIFF= 'convert -compress none "{source}[{frame}]" "{target}"'¶ Shell command for deriving a preservation-format uncompressed TIFF file from a source image.
-
odea.CMD_DF_PDF_VECTOR= 'inkscape "{source}" --export-pdf="{target}"'¶ Shell command for deriving a pdf version of a vector image (svg)
-
odea.CMD_PF_VECTOR= 'inkscape "{source}" --export-plain-svg="{target}"'¶ Shell command for deriving a “clean” preservation-ready version of a source svg image.
-
odea.CMD_DF_H264= 'ffmpeg -loglevel panic -nostdin -i "{source}" -vcodec libx264 -acodec aac -ab 384K -crf 21 -bf 2 -flags +cgop -pix_fmt yuv420p -movflags faststart "{target}"'¶ Shell command for deriving an mp4 video with h.264 codec from a source video, at the input resolution.
-
odea.CMD_DF_H264_CONCAT= 'ffmpeg -loglevel panic -nostdin -f concat -segment_time_metadata 1 -i "{source}" -vcodec libx264 -acodec aac -ab 384K -crf 21 -bf 2 -flags +cgop -pix_fmt yuv420p -movflags faststart "{target}"'¶ Shell command for deriving an mp4 video with h.264 codec from a list of source video clips, provided in a plain-text file readable by the ffmpeg concat filter. See <https://ffmpeg.org/ffmpeg-formats.html#concat>. This command is primarily useful for assembling raw video footage from a project, stored archivally as a collection of source clips, into a single file (or virtual “reel”) for redistribution.
-
odea.CMD_DF_360P_VP9_400K= 'ffmpeg -loglevel panic -nostdin -i "{source}" -codec:v libvpx-vp9 -b:v 400K -crf 31 -speed 4 -tile-columns 6 -frame-parallel 1 -vf scale=-1:360 -f webm "{target}"'¶ Shell command for deriving a 360p webm video from a source video file, for redistribution online or in limited space/bandwidth contexts.
-
odea.CMD_PF_FFV1= 'ffmpeg -loglevel panic -nostdin -i "{source}" -vcodec ffv1 -acodec pcm_s16le "{target}"'¶ Shell command for deriving a preservation-format video, using the ffv1 codec, from a source video file. Warning: the resulting files will be extremely large!
-
odea.CMD_DF_IMG_STILL= 'ffmpeg -loglevel panic -nostdin -ss {frame}.0 -i "{source}" -frames:v 1 "{target}"'¶ Shell command for generating a still image from a source video, given the input video and a time point (“frame”). The time can be expressed either in HH:MM:SS format (e.g., “54:20”) or as a number of seconds with optional decimal fraction (e.g., “3260.2”).
-
odea.CMD_DF_IMG_STILLS= 'mkdir {target}; ffmpeg -i "{source}" -vf fps=1/6,scale=-1:360 "{target}/%%05d.jpg"'¶ Shell command for generating a series of still images from a video, one per six seconds.
-
odea.CMD_DF_DOCUTILS_HTML= 'rst2html5 --date --smart-quotes=yes --template="/home/docs/checkouts/readthedocs.org/user_builds/odea/envs/latest/lib/python3.7/site-packages/odea-1.0-py3.7.egg/static/docutils_template.txt" --stylesheet-path="/home/docs/checkouts/readthedocs.org/user_builds/odea/envs/latest/lib/python3.7/site-packages/odea-1.0-py3.7.egg/static/docutils_odea.css" "{source}" "{target}"'¶ Shell command to convert ReStructured Text to html via Docutils. The
--templatevalue is obtained from the variableDOCUTILS_TEMPLATE, which defaults to the file/static/ docutils_template.txtin the odea package. The--stylesheetvalue is obtained from the variableDOCUTILS_CSS, which defaults to the file/static/ docutils_odea.cssin the odea package.
-
odea.CMD_DF_PANDOC_HTML= 'pandoc -o "{target}" -t html5 -c "/home/docs/checkouts/readthedocs.org/user_builds/odea/envs/latest/lib/python3.7/site-packages/odea-1.0-py3.7.egg/static/pandoc_odea.css" --standalone "{source}"'¶ Shell command to convert Markdown, ReStructured Text, or any other plain- text format to html via Pandoc. The
-c(css) value is obtained from the variablePANDOC_CSS, which defaults to the file/static/ pandoc_odea.cssin the odea package.
-
odea.new(path=None, archive=None, title=None)¶ Create a new Bag structure on disk in the current working directory.
- Parameters
path – Path to a directory in which to create the new Bag.
archive – The name of the archive to which the collection belongs. This will be added to
bag-info.txt.title – The title of the collection. This will be added to
bag- info.txt.
Odea will abort when creating a Bag, Item, or File object if there is not a corresponding BagIt bag structure on disk, either in the current working directory or in the path above it. Since all file paths listed in Bag, Item, and File metadata are relative to the bag root, the root must be identifiable at the time any object is initialized.
A valid BagIt structure is currently assumed to exist in a directory containing:
A file named
bagit.txt(the contents are not currently verified);A
datasubdirectory for payload files.
This function will create either of these elements, as well as a template
bag-info.txtfile, if they are missing in the supplied directory path. The directory does not need to be empty.
-
odea.load_sample_file(filename)¶ Load and return a file from the
testdirectory as a File object.The testing directory within the odea package contains input documents in the various formats supported by odea. These are used in the docstring examples in this module as inputs for sample commands and testing.
This function loads a file and sets the following File properties:
File.filename,File.basename,File.ext,File.identifier,File.format,File.size,File.sha256.
-
odea.test_bag()¶ Create a sample bag for testing purposes.
This bag is used in the docstring examples throughout this module. The bag is created in a temporary directory, so there should be no risk of side-effects in testing.
>>> import odea >>> b = odea.test_bag() >>> b.title 'My test bag' >>> b.subject ['spam', 'eggs'] >>> b.identifier '893cddb6-6d94-4af6-be16-5cbfdb5d70e3' >>> print(b.tree()) ./ bagit.txt data/ deriv/ file_metadata/ html/ item_metadata/
-
odea.is_root(path)¶ Identify whether <path> is a bag root, returning True or False.
See also
-
odea.get_root(path)¶ Get the bag root, relative to the string <path>. Return the root or None.
- Parameters
path – A relative or absolute filesystem path; the input path will be resolved against the current directory if it is relative. The path does not need to exist on disk.
The root is resolved equivalent to
pathor any parent directory thereof that contains both adatasubdirectory and a filebagit.txt.>>> import odea, os >>> b = odea.test_bag() >>> root = os.getcwd() >>> odea.get_root(root) == root True >>> d2 = os.path.join(root, 'data', 'foo', 'bar', 'baz') >>> os.makedirs(d2) >>> os.chdir(d2) >>> odea.get_root('.') == root True >>> odea.get_root('spam/eggs.txt') == root True
If there is no bag in the path, None will be returned:
>>> odea.get_root('/random/dir/spam.txt') == None True
If there are multiple nested collections, only the lowest-level directory will be returned:
>>> os.chdir(d2) >>> odea.new() >>> odea.get_root('.') == os.getcwd() True
-
odea.load_bag()¶ Look for an existing ‘bag-info.txt’ metadata file and load as a Bag object. If no metadata file exists, create a new Bag object.
-
odea.load_item(item_uuid)¶ Look for an existing metadata file matching the item uuid and load as an Item object. If no metadata file exists, create a new Item object.
- Parameters
item_uuid – The UUID for an item in the archive.
-
odea.load_file(filename)¶ Look for an existing metadata file matching the filepath and load as a File object.
- Parameters
filename – The path to a file in the current Bag.
The metadata document is matched against the
File.identifierandFile.formatproperties; this function will ignore the filepath if either of these properties are not present in the filename.If no metadata file exists, a new File object will be created and returned.
>>> import odea >>> b = odea.test_bag() >>> spam = os.path.join('data', 'spam.txt') >>> open(spam, 'w').close() >>> f = odea.load_file(spam) >>> f File(filename='data/spam.txt', ...)
With a filename that doesn’t exist:
>>> f2 = odea.load_file('nonexistent-file.txt')
With a file that already has metadata:
>>> id = '2716fe6a-1fba-4dba-b34e-593450f9b975' >>> fn = 'data/test.txt' >>> tag_file = os.path.join(ITEM_METADATA_DIR, '{}.txt'.format(id)) >>> with open(tag_file, 'w') as t: ... o = t.write('{{"identifier": {}, "filename": {}"}}'.format(id,fn)) >>> odea.load_file(fn) File(filename='data/test.txt', ...)
-
exception
odea.BagError¶
-
exception
odea.BagValidationError(message, details=None)¶
Bag¶
-
class
odea.Bag(archive='odeum', archive_url=None, title=None, identifier=None, creator=None, subject=None, contributor=None, coverage=None, date=None, description=None, language=None, publisher=None, relation=None, rights=None, source=None, preview=None, dcmi_type='Collection', note=None)¶ An abstract instance of a Bag.
-
archive¶ The name of the archive to which this collection belongs.
-
archive_url¶ Web address (URL) of the archive responsible for this collection.
-
title¶ Dublin Core title metadata element for the Bag. Represents a name given to the resource.
-
identifier¶ Identifier for the Bag, represented by default as a version 4 UUID hexadecimal string. This property should normally be set automatically.
-
creator¶ Dublin Core creator metadata element for the Bag. Represents an entity primarily responsible for making the resource (i.e., the collection curator).
-
subject¶ Dublin Core subject metadata element for the Bag. Represents the topic of the resource (keyword).
-
contributor¶ Dublin Core contributor metadata element for the Bag. Represents an entity responsible for making contributions to the resource.
-
coverage¶ Dublin Core coverage metadata element for the Bag. Represents the spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.
-
date¶ Dublin Core coverage metadata element for the Bag. Represents a point or period of time associated with an event in the lifecycle of the resource. The date should be presented in an ISO 8601 string, but type checking is not enforced.
-
description¶ Dublin Core description metadata element for the Bag. Represents an account of the resource that may include an abstract, a table of contents, a graphical representation, or a free-text account of the resource.
-
language¶ Dublin Core language metadata element for the Bag. Represents the language of the resource, ideally using RFC 4646 language codes (e.g., “en” for English, “mn” for Mongolian).
-
publisher¶ Dublin Core publisher metadata element for the Bag. Represents an entity responsible for making the resource available.
-
relation¶ Dublin Core relation metadata element for the Bag. Represents a related resource.
-
rights¶ Dublin Core rights metadata element for the Bag. Represents information about rights held in and over the resource. Typically this will be a copyright statement, license name, or link to a document providing terms of use.
-
source¶ Dublin Core source metadata element for the Bag. Represents a related resource from which the described resource is derived.
-
preview¶ Path to a file within the Bag that provides a preview image representing the bag contents.
-
dcmi_type¶ Type of the Item, represented using the DCMI Type Vocabulary. The only valid type for a Bag is Collection.
-
note¶ Annotation
-
json()¶ Return a json string representing the Bag.
>>> import odea >>> b = odea.test_bag() >>> print(json.dumps(json.loads(b.json()), indent=4)) { "archive": "odeum", "archive_url": "", "contributor": null, "coverage": null, "creator": [], "date": null, "dcmi_type": "Collection", "description": null, "identifier": "893cddb6-6d94-4af6-be16-5cbfdb5d70e3", "language": null, "preview": "path/to/image", "publisher": null, "relation": null, "rights": null, "source": null, "subject": [ "spam", "eggs" ], "title": "My test bag" }
-
tree(path='.')¶ Print a directory tree representing the bag contents.
See the documentation for
test_bag()for an example.
-
html()¶ Return an html Bag description string.
- Example
>>> import odea >>> b = odea.test_bag() >>> print(b.html()) <!DOCTYPE doctype html> <html lang="en"> <head> <!-- Required meta tags --> <meta charset="utf-8"/> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <!-- Bootstrap CSS --> <link href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css"/> <style> .card-columns {column-count: 1;} @media (min-width: 768px) {.card-columns {column-count: 2;}} @media (min-width: 992px) {.card-columns {column-count: 3;}} .card{max-width:360px} </style> <title> My test bag - Digital Archive </title> </head> <body> <nav class="navbar navbar-expand-lg navbar-dark bg-primary"> <div class="container"> <a class="navbar-brand" href=""> Digital Archive </a> </div> </nav> <div class="container py-4"> <h1> <small class="text-muted text-uppercase"> collection / </small> My test bag </h1> <p> <img class="img-thumbnail" src="path/to/image"/> </p> <table class="table"> <tr> <th> title </th> <td> My test bag </td> </tr> <tr> <th> identifier </th> <td> 893cddb6-6d94-4af6-be16-5cbfdb5d70e3 </td> </tr> <tr> <th> subject </th> <td> <ul> <li> spam </li> <li> eggs </li> </ul> </td> </tr> <tr> <th> dcmi_type </th> <td> Collection </td> </tr> </table> <div class="card-columns"> </div> </div> <footer class="footer mt-5 p-3"> <div class="container"> <p class="text-muted"> rev. ... </p> <p class="text-muted"> Except where otherwise noted, content on this site is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/" rel="license"> Creative Commons Attribution 4.0 International License </a> . </p> </div> </footer> </body> </html>
-
update_manifest(alg='sha512')¶ Update the Bag manifest.
- Parameters
alg – The algorithm to be used. Defaults to
sha512;sha256can also be used.- Example
>>> import odea >>> b = odea.test_bag()
Any new file should be added to the manifest.
>>> spam = os.path.join('data', 'spam.txt') >>> with open(spam, 'w') as out: ... out.write('Spam, eggs, bacon, and spam!') 28 >>> b.update_manifest() >>> with open('manifest-sha512.txt', 'r') as manifest: ... manifest.readlines() ['6aec3c2caf8a5f9984fd1... data/spam.txt']
-
save()¶ Save the Bag data structure to disk in plain text format.
This will be in the file bag-info.txt in the root of the Bag.
>>> import odea >>> b = odea.test_bag() >>> b.title = "Modified title" >>> b.creator = ['Author 1', 'Author 2'] >>> b.save() >>> with open('bag-info.txt', 'r') as t: ... print(t)
-
items()¶ Return a list of Item objects for items in the bag.
Items are retrieved from the tag files stored in the
ITEM_METADATA_DIRdirectory.>>> import odea >>> b = odea.test_bag() >>> i = odea.Item(identifier=odea.NIL_UUID, title='test_item') >>> i.save() >>> for i in b.items(): ... print(i.title) test_item
-
pub_items()¶ Return a list of Item objects for published items in the bag.
This allows the html index to be updated only with items that are already included in the
HTML_DIRdirectory.>>> import odea, os >>> b = odea.test_bag() >>> i = odea.Item(identifier=odea.NIL_UUID, title='test item') >>> i.save() >>> b.pub_items() [] >>> h = os.path.join(odea.HTML_DIR, '{}.html'.format(odea.NIL_UUID)) >>> open(h, 'w').close() >>> print([x.title for x in b.pub_items()]) ['test item']
-
Item¶
-
class
odea.Item(title=None, identifier=None, creator=None, subject=None, contributor=None, coverage=None, date=None, description=None, language=None, publisher=None, relation=None, rights=None, source=None, dcmi_type=None, embed_url=None, note=None)¶ -
identifier¶ Identifier for the Item, represented by default as a version 4 UUID hexadecimal string. This property should normally be set automatically.
-
title¶ Dublin Core title metadata element for the Item. Represents a name given to the resource.
-
creator¶ Dublin Core creator metadata element for the Item. Represents an entity primarily responsible for making the resource. This property is a list of strings.
-
contributor¶ Dublin Core contributor metadata element for the Item. Represents An entity responsible for making contributions to the resource. This property is a list of strings.
-
coverage¶ Dublin Core coverage metadata element for the Item. Represents the spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.
-
date¶ Dublin Core coverage metadata element for the Item. Represents a point or period of time associated with an event in the lifecycle of the resource. The date should be presented in an ISO 8601 string, but type checking is not enforced.
-
description¶ Dublin Core description metadata element for the Item. Represents an account of the resource that may include an abstract, a table of contents, a graphical representation, or a free-text account of the resource.
-
language¶ Dublin Core language metadata element for the Item. Represents the language of the resource, ideally using RFC 4646 language codes (e.g., “en” for English, “mn” for Mongolian).
-
publisher¶ Dublin Core publisher metadata element for the Item. Represents an entity responsible for making the resource available.
-
relation¶ Dublin Core relation metadata element for the Item. Represents a related resource.
-
rights¶ Dublin Core rights metadata element for the Item. Represents information about rights held in and over the resource. Typically this will be a copyright statement, license name, or link to a document providing terms of use.
-
source¶ Dublin Core source metadata element for the Item. Represents a related resource from which the described resource is derived.
-
subject¶ Dublin Core subject metadata element for the Item. Represents the topic of the resource (keyword). This property is a list of strings.
-
dcmi_type¶ Type of the Item, represented using the DCMI Type Vocabulary. Valid types include Event, Image, MovingImage, PhysicalObject, Software, Sound, StillImage, and Text.
-
note¶ Annotation.
-
json()¶ Return a json string representing the Bag
-
files()¶ Return a list of file objects, corresponding to the files on disk associated with the Item. The list is generated by matching the identifier to tagged files.
-
save()¶ Save the Item data structure to disk.
>>> import odea >>> b = odea.test_bag() >>> i = odea.Item('data/example.txt') >>> i.title = 'Example item' >>> i.identifier = odea.NIL_UUID >>> i.save() >>> t = os.path.join(ITEM_METADATA_DIR, '{}.txt'.format(i.identifier)) >>> with open(t, 'r') as tag_file: ... print(tag_file)
-
html()¶ Return an html Item description string.
- Example
>>> import odea >>> b = odea.test_bag() >>> i = odea.Item(identifier=odea.NIL_UUID, title='Test item') >>> print(i.html()) <!DOCTYPE doctype html> <html lang="en"> <head> <!-- Required meta tags --> <meta charset="utf-8"/> <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/> <!-- Bootstrap CSS --> <link href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css" rel="stylesheet"/> <style> .card-columns {column-count: 1;} @media (min-width: 768px) {.card-columns {column-count: 2;}} @media (min-width: 992px) {.card-columns {column-count: 3;}} .card{max-width:360px} </style> <title> Test item - Digital Archive </title> </head> <body> <nav class="navbar navbar-expand-lg navbar-dark bg-primary"> <div class="container"> <a class="navbar-brand" href=""> Digital Archive </a> </div> </nav> <div class="container py-4"> <h1> <small class="text-muted text-uppercase"> item / </small> Test item </h1> <table class="table"> <tr> <th> title </th> <td> Test item </td> </tr> <tr> <th> identifier </th> <td> 0000000-0000-0000-0000-000000000000 </td> </tr> </table> <h2> Files </h2> <table class="table"> <tr> <th> file </th> <th> size </th> <th> date modified </th> </tr> </table> </div> <footer class="footer mt-5 p-3"> <div class="container"> <p class="text-muted"> rev. ... </p> <p class="text-muted"> Except where otherwise noted, content on this site is licensed under a <a href="http://creativecommons.org/licenses/by/4.0/" rel="license"> Creative Commons Attribution 4.0 International License </a> . </p> </div> </footer> </body> </html>
-
src()¶ Return the path to the “SRC” file for the item
-
tag_file()¶
-
File¶
-
class
odea.File(filename=None, sha512=None, sha256=None, size=None, mtime=None, identifier=None, basename=None, format=None, ext=None, preview=None, dimensions=None, duration=None, thumb=None)¶ This is a file on disk.
-
filename¶ The filename, including relative directory path from the bag root (e.g., data/subdir/file.ext)
-
sha512¶ The sha512 hash of the file (hex string)
-
sha256¶ The sha256 hash of the file (hex string)
-
size¶ The size of the file, in bytes (integer)
-
mtime¶ The modification time of the file (datetime object)
-
identifier¶ The unique identifier of the item to which the file belongs, represented as a UUID string. The uuid identifier string is included in the filename as a dot- separated element in the pattern <basename>[.<format>][.<uuid>].<ext>.
-
basename¶ The basename or “stem” of the filename. The basename identifier string is included in the filename as a dot- separated element in the pattern <basename>[.<format>][.<uuid>].<ext>.
-
format¶ The format of the file. The format identifier string is included in the filename as a dot- separated element in the pattern <basename>[.<format>][.<uuid>].<ext>. When generated by odea, this may be src for a source file, df- <type> for a distribution copy, or pf-<type> for an archival preservation copy. Available formats for derivative files correspond to the shell scripts listed in odea.templates.
-
ext¶ The filename extension
-
thumb¶ Path to a thumbnail image representing the file.
-
preview¶ Path to a medium-sized image representing the file.
-
dimensions¶ Dimensions of an image or video
-
duration¶ Duration of a video or audio file
-
get_checksum(alg='sha512')¶ Calculate the hash for a file.
- Parameters
alg – Supported algorithms are “sha256” and “sha512”.
See also
-
get_sha256()¶ Calculate the sha256 hash of the file.
An empty file returns None:
>>> import odea >>> b = odea.test_bag() >>> f = odea.load_sample_file('test_plain-text.txt')
If the file exists, the hash should be returned and written to the
sha256property:>>> f.get_sha256() '92b772380a3f8e27a93e57e6deeca6c01da07f5aadce78bb2fbb20de10a66925' >>> f.sha256 '92b772380a3f8e27a93e57e6deeca6c01da07f5aadce78bb2fbb20de10a66925'
-
get_sha512()¶ Calculate the sha512 hash of the file and save to the
sha512property. This value should be persisted to the file manifest-sha512.txt.>>> import odea >>> b = odea.test_bag() >>> f = odea.load_sample_file('test_plain-text.txt')
If the file exists, the hash should be returned and written to the
sha512property:>>> f.get_sha512() '9751ea443fd632e147831566ccb822482220188993cd1269edbe98d2e2d69...'
-
json()¶ Return a json string representing the File.
Null values are not included in the output.
>>> import odea >>> b = odea.test_bag() >>> f = odea.load_sample_file('test_plain-text.txt') >>> print(json.dumps(json.loads(f.json()), indent=4)) { "basename": "data/test_plain-text", "ext": "txt", "filename": "data/test_plain-text.SRC.0000000-0000-0000-0000-000000000000.txt", "format": "SRC", "sha256": "92b772380a3f8e27a93e57e6deeca6c01da07f5aadce78bb2fbb20de10a66925", "size": 15, "uuid": "0000000-0000-0000-0000-000000000000" }
-
get_mtime()¶ Return the mtime of a file on disk and set the
mtimeproperty.If the file does not exist, nothing is returned:
>>> import odea >>> b = odea.test_bag() >>> spam = os.path.join('data', 'spam.txt') >>> f = odea.File(spam) >>> f.get_mtime() == None >>> True
>>> open(spam, 'a').close() >>> os.utime(spam,(1330712280, 1330712292)) >>> f.get_mtime() == f.mtime True >>> f.mtime '2012-03-02T12:18:12Z'
-
get_img_dimensions()¶ Set and return the dimensions of an image file.
>>> import odea >>> b = odea.test_bag() >>> f = odea.load_sample_file('test_img_jpeg.jpg') >>> f.get_img_dimensions() '2835x4289'
Nothing will happen if the image dimensions cannot be determined:
>>> f = odea.load_sample_file('test_plain-text.txt') >>> f.get_img_dimensions() == None True
-
get_audio_duration()¶ Set and return the duration of an audio file.
>>> import odea >>> b = odea.test_bag() >>> f = odea.load_sample_file('test_wav_sound.wav') >>> f.get_audio_duration() 3.0
Nothing will happen if the sound file cannot be read:
>>> f = odea.load_sample_file('test_plain-text.txt') >>> f.get_audio_duration() == None True
-
get_video_duration()¶ Set and return the duration of a video file.
>>> import odea >>> b = odea.test_bag() >>> f = odea.load_sample_file('test_video.mp4') >>> f.get_audio_duration() 3.0
Nothing will happen if the sound file cannot be read:
>>> f = odea.load_sample_file('test_plain-text.txt') >>> f.get_video_duration() == None True
-
get_size()¶ Return the size of a file on disk and set the
sizeproperty.>>> import odea >>> b = odea.test_bag() >>> f = odea.load_sample_file('test_img_jpeg.jpg') >>> f.get_size() 3506068
-
get_uuid()¶ Retrieve and set the uuid property for a file.
If the
uuidproperty is set, it will be returned. Otherwise, the filename will be scanned for a matching uuid, using the regular expressionRE_UUID.>>> import odea, os >>> b = odea.test_bag() >>> id = '48342ee3-9080-407e-9862-12ce05143499' >>> spam = os.path.join('data', 'spam.{}.txt'.format(id)) >>> open(spam, 'w').close() >>> f = odea.load_file(spam) >>> f.identifier == id True >>> del f.identifier >>> f.identifier == id False >>> f.get_uuid() == id == f.identifier True
-
tag(_uuid=None)¶ Tag a filename with a UUID and update the File properties.
This operation does not actually rename files on disk.
>>> import odea >>> b = odea.test_bag() >>> f = odea.load_sample_file('test_plain-text.txt') >>> f.tag(_uuid=NIL_UUID) 'data/test_plain-text.SRC.0000000-0000-0000-0000-000000000000.txt'
The tagging operation should correctly set the following attributes:
File.filename,File.basename,File.ext,File.identifier, andFile.format.>>> a = 'data/test.file.many.parts.txt' >>> b = 'data/a.b.0000000-0000-0000-0000-000000000000.c.d.txt' >>> c = 'data/test-file-no-extension' >>> d = 'data/test-file.SRC.txt' >>> f = {} >>> for fn in [a, b, c, d]: ... open(fn, 'w').close() ... f[fn] = odea.File(filename=fn) ... o = f[fn].tag(_uuid=odea.NIL_UUID)
>>> f[a].filename 'data/test.file.many.parts.txt' >>> f[a].basename 'data/test' >>> f[a].ext 'txt' >>> f[a].identifier '0000000-0000-0000-0000-000000000000' >>> f[a].format 'file.many.parts'
>>> f[b].filename 'data/a.b.0000000-0000-0000-0000-000000000000.c.d.txt' >>> f[b].basename 'data/a' >>> f[b].ext 'c.d.txt' >>> f[b].identifier '0000000-0000-0000-0000-000000000000' >>> f[b].format 'b'
>>> f[c].filename 'data/test-file-no-extension' >>> f[c].basename 'data/test-file-no-extension' >>> f[c].ext '' >>> f[c].identifier '0000000-0000-0000-0000-000000000000' >>> f[c].format 'SRC'
>>> f[d].filename 'data/test-file.SRC.txt' >>> f[d].basename 'data/test-file' >>> f[d].ext 'txt' >>> f[d].identifier '0000000-0000-0000-0000-000000000000' >>> f[d].format 'SRC'
-
slug()¶ Return a shortened and sanitized form of the file basename.
The slug removes spaces and special characters from the filename, and truncates the basename to 60 characters (this includes the full path from the bag root, “data/path/to/file”, but NOT the filetype, uuid, and extension).
-
rename()¶ Rename a file on disk, based on its metadata properties.
The output format is <basename>[.<format>][.<uuid>].<ext>.
Note that the filename components are not set automatically by the
Fileclass on initialization. They can be set bytag()orload_file().>>> import odea >>> b = odea.test_bag() >>> fn = os.path.join(odea.DATA_DIR, 'test_plain-text.txt') >>> f = odea.File(fn) >>> open(fn, 'w').close() >>> f.identifier = 'b3050922-520f-426e-9a9c-cfe728bd178d' >>> f.format = 'sample' >>> f.basename = 'data/test_plain-text' >>> f.ext = 'rst' >>> f.filename 'data/test_plain-text.txt' >>> f.rename() == f.filename True >>> f.filename 'data/test_plain-text.sample.b3050922-520f-426e-9a9c-cfe728bd178d.rst'
-
get_filename_parts()¶ Populate filename part properties from the filename itself.
This should be used when importing a new file, as it will override any properties that have been set manually. This method is called by
tag(), which additionally creates or applies a uuid tag.See also
-
save()¶ Save the File data structure to disk.
-
derive(target, ext, frame=None, overwrite=False, target_dir=None)¶ Generate a derivative version of a file. Return the full filename of the derived file.
- Parameters
target – Conversion target. Available targets are produced through shell scripts defined in the variables
odea.CMD_<RULE>, which can be overwritten or extended. The built-in targets are listed below.frame – The page, image, or frame number to use from the input resource. For a multi-image or multi-page document input,
frameis an integer corresponding to the image or page number (starting with ‘0’ for the first image or page). This works with multi-image TIFF files as well as for PDF documents. For video files, the default stills generation command will take an image from the middle of the file ifframeis “0” or not specified, otherwise an image frame will be extracted using the ffmpeg time duration syntax. The following examples are all valid input values forframein this context: “55” (55 seconds), “12:03:45” (12 hours, 3 minutes, and 45 seconds), or “23.189” (23.189 seconds).ext – The extension of the target filename. For some commands this will change the output format (e.g., images processed by Imagemagick can be generated as PNG, JPEG, or another recognized format).
overwrite – If True, overwrite a target file if it already exits. Otherwise simply return the filename.
target_dir – The directory for derivative output. Defaults to
DERIV_DIR, but this can be overridden (e.g, thumbnail images are located in"THUMBS"_DIR)
Default targets
- PF_WAV
A lossless audio file in WAV format.
Target generated by
CMD_PF_WAV.- DF_MP3
A distribution copy of an audio file, in lossy MP3 format.
Target generated by
CMD_DF_MP3.- DF_PDF_DOC
A PDF version of word processor documents (text, spreadsheet, etc.)
Target generated by
CMD_DF_PDF_DOC.- DF_PDF_HTML
For web documents given using a “url” file, this will be a PDF corresponding to the print version of the resource.
Target generated by
CMD_DF_PDF_HTML.- DF_IMG_THUMB
A thumbnail image (currently 360x360).
Target generated by
CMD_DF_THUMB_IMG.- DF_IMG_MED
A medium-sized image (currently 800x600).
Target generated by
CMD_DF_IMG_MED.- DF_IMG_LG
A large-scale image (1920x1080).
Target generated by
CMD_DF_IMG_LG.- PF_FFV1
A lossless video, using the FFV1 codec. (Note that the file sizes are extremely large!)
Target generated by
CMD_PF_FFV1.- DF_360P_VP9_400K
A distribution copy of a video, reduced to 360p resolution in webm format using the VP9 codec.
Target generated by
CMD_DF_360P_VP9_400K.- DF_H264
A distribution copy of the video optimized for upload to video hosting services such as YouTube, Vimeo, or Internet Archive. With the default command the video is not scaled, but the bitrate may be reduced from the source file and the moov atom will be placed at the beginning of the file in order to enable streaming.
Target generated by
CMD_DF_H264.- DF_H264_CONCAT
The same as “DF_H264”, but generated from an ffconcat list. All the clips in the source directory, as referenced by the “df-concat-list” file, will be assembled to create unified derivatives; thus the file “<project-name>.df-h264.<uuid>.mp4” will be a single video that could include all the footage from the “<project-name>.dir” directory, and subsequent derivatives can be created from concatenated video.
Target generated by
CMD_DF_H264_CONCAT.- PF_WEBARC
The “web archive” is a directory containing a downloaded copy of the resource specified in a URL file, along with any other resources needed in order to display that resource (e.g., embedded images, stylesheets, or fonts). Links to the associated resources are converted to relative hyperlinks in order to make the resource viewable locally, but are otherwise unchanged.
Target generated by
CMD_PF_WEBARC.- PF_SCREENSHOT
A full screenshot that captures the entire page of a web resource, as viewed in a web browser. This is a bitmat image, so it can be very large.
Target generated by
CMD_PF_SCREENSHOT.- DF_SCREENSHOT_CROPPED
A cropped screenshot of a web resource, showing the visible part of the page without scrolling.
Target generated by
CMD_DF_SCREENSHOT_CROPPED.- DF_IMG_STILL
A still frame from a video.
Target generated by
CMD_DF_IMG_STILL.- DF_IMG_STILLS
A sequence of still images (thumbnails) from a video.
Target generated by
CMD_DF_IMG_STILLS.
Examples
>>> import odea >>> b = odea.test_bag() >>> c = odea.load_sample_file('test_corrupt-file.jpg') >>> c.derive('DF_IMG_MED', 'png')
>>> f1 = odea.load_sample_file('test_wav_sound.wav') >>> d1 = f1.derive('DF_MP3', 'mp3') >>> fd1 = odea.File(d1) >>> fd1.filename 'data/deriv/test_wav_sound.df-mp3.0000000-0000-0000-0000-000000000000.mp3'
>>> f2 = odea.load_sample_file('test_url.urls') >>> d2a = f2.derive('DF_PDF_HTML', 'pdf') >>> d2b = f2.derive('PF_WEBARC', 'dir') >>> d2c = f2.derive('PF_SCREENSHOT', 'png') >>> print(b.tree()) ./ bagit.txt data/ test_corrupt-file.SRC.0000000-0000-0000-0000-000000000000.jpg test_url.SRC.0000000-0000-0000-0000-000000000000.urls test_wav_sound.SRC.0000000-0000-0000-0000-000000000000.wav deriv/ test_url.df-pdf-html.0000000-0000-0000-0000-000000000000.pdf test_url.pf-screenshot.0000000-0000-0000-0000-000000000000.png test_wav_sound.df-mp3.0000000-0000-0000-0000-000000000000.mp3 test_url.pf-webarc.0000000-0000-0000-0000-000000000000.dir/ commons.wikimedia.org/ w/ index.php@title=File%3A1913_Gandan_Monastery_in_Khuree.jpg&oldid=359804693.html example.net/ index.html file_metadata/ html/ item_metadata/
>>> f3 = odea.load_sample_file('test_img_jpeg.jpg') >>> d3 = f3.derive('DF_IMG_MED', 'png') >>> fd3 = odea.File(d3) >>> f3.get_img_dimensions() '2835x4289' >>> fd3.get_img_dimensions() '397x600'
-
thumbs()¶ Generate thumbnail images for the input filename.
Two thumbnail images are generated and saved to the
THUMBS_DIRfolder in the Bag root, with 360px and 800px widths.The thumbnail files are named using the hash of the input filename, so will remain available even if the source file is moved or renamed. The paths to the generated images are stored in the
thumbandpreviewproperties.>>> import odea >>> b = odea.test_bag() >>> f = odea.load_sample_file('test_img_jpeg.jpg') >>> f.thumbs() ('thumbs/46568651e13bf1416a802075827b67ed-360x256.jpeg', 'thumbs/46568651e13bf1416a802075827b67ed-800x256.jpeg')
If the file contents change, it is necessary to remove the existing thumbnail images before generating a new one. The filenames are accessible from the properties
thumbandpreview.>>> import os >>> os.path.isfile(f.thumb) True >>> os.remove(f.thumb) >>> os.path.isfile(f.thumb) False >>> del f.thumb >>> f.thumb == None True
-