Odea module

Odea: Open Digital Ethnography Archives toolkit

This python toolkit is designed to operate with living collections of ethnographic documents, organized using the BagIt archival standard.

The goal is to provide tools for automating the management of archival documents – storage, indexing, validation, conversion to distribution formats, metadata cataloguing – in ways that allow everything to remain accessible from the computer file system and open to manipulation with standard tools.

odea.ITEM_METADATA_DIR = 'item_metadata'

The subdirectory of the bag containing metadata files for items.

odea.FILE_METADATA_DIR = 'file_metadata'

The subdirectory of the bag containing metadata files for files.

odea.THUMBS_DIR = 'thumbs'

The subdirectory of the bag containing thumbnail images for files.

odea.DATA_DIR = 'data'

The payload directory of the bag (should always be ‘data’ for BagIt standard compliance).

odea.DERIV_DIR = 'data/deriv'

The subdirectory directory of the bag in which derivative files will be stored on generation.

odea.HTML_DIR = 'html'

The subdirectory of the bag in which generated html metadata files will be stored.

odea.RE_UUID = re.compile('[0-F]{8}-[0-F]{4}-[0-F]{4}-[0-F]{4}-[0-F]{12}', re.IGNORECASE)

Regular expression for matching UUID identifiers in filenames.

odea.RE_HASHTAG = re.compile('(#[\\w\\d\\-_]+)')

Regular expression for matching hashtags in note fields.

odea.HASH_BLOCK_SIZE = 524288

Block size used when reading files for hashing.

odea.TERMS = ['dcmi_type', 'title', 'identifier', 'creator', 'subject', 'contributor', 'coverage', 'date', 'description', 'language', 'publisher', 'relation', 'rights', 'source', 'note']

List of metadata terms used in preparing html output for items. These will correspond to the item properties but are listed here in presentation order.

odea.DOCUTILS_CSS = '/home/docs/checkouts/readthedocs.org/user_builds/odea/envs/latest/lib/python3.7/site-packages/odea-1.0-py3.7.egg/static/docutils_odea.css'

Docutils css. This path is computed from the package location.

odea.DOCUTILS_TEMPLATE = '/home/docs/checkouts/readthedocs.org/user_builds/odea/envs/latest/lib/python3.7/site-packages/odea-1.0-py3.7.egg/static/docutils_template.txt'

Docutils page template. This path is computed from the package location. This template provides a “viewport” meta tag to enable responsive display.

odea.PANDOC_CSS = '/home/docs/checkouts/readthedocs.org/user_builds/odea/envs/latest/lib/python3.7/site-packages/odea-1.0-py3.7.egg/static/pandoc_odea.css'

Pandoc css. This path is computed from the package location.

odea.CSS = "q::before { content: none; } q::after { content: none; } q{font-style: italic}'"

Custom CSS to be added to html output (currently bases Bootstrap 5).

odea.HTML_TEMPLATE = '<!doctype html> <html lang="en"> <head> <meta charset="utf-8"> <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no"> <link rel="stylesheet" href="bootstrap.min.css"> <style>{css}</style> <title>{title} - {archive}</title> </head> <body> <nav class="navbar navbar-expand-lg navbar-dark bg-primary"> <div class="container"> <a class="navbar-brand" href="{archive_url}">{archive}</a> </div> </nav> <div class="container py-4"> {nav} <h1>{title}</h1> {body} </div> <footer class="footer mt-5 p-3"> <div class="container"> <p class="text-muted">{page_metadata}</p> <p class="text-muted">{license}</span> </div> </footer> </body> </html>\n'

Template for html page output. Variables passed to the string are {css}, {archive}, {title}, {body}, and {license}. Note that the default template expects a Bootstrap stylesheet to be present within the html directory; this needs to be downloaded from <http://v5.getbootstrap.com/>.

odea.CMD_DF_IMG_THUMB = 'convert "{source}[{frame}]" -density 300 -thumbnail 360x360^ -gravity center -extent 360x360 -background white -alpha remove -auto-orient {target}'

Shell command for deriving a thumbnail image from a source file. This will crop the image if it does not fit the bounding box.

odea.CMD_DF_IMG_MED = 'convert "{source}[{frame}]" -density 300 -resize 800x600\\> -background white -alpha remove -auto-orient {target}'

Shell command for deriving a medium-size image from a source file.

odea.CMD_DF_IMG_LG = 'convert "{source}[{frame}]" -density 300 -resize 1920x1080\\> -background white -alpha remove -auto-orient {target}'

Shell command for deriving a large image from a source file.

odea.CMD_PF_WEBARC = 'wget --input-file="{source}" --convert-links --page-requisites --span-hosts --adjust-extension --restrict-file-names=windows --directory-prefix={target}'

Shell command for generating an offline, archival copy of a web document. The input (source file) should be plain-text file containing a single URL or list or URLs.

odea.CMD_PF_WAV = 'ffmpeg -i "{source}" "{target}"'

Shell command for deriving a WAV audio file from a source media file.

odea.CMD_DF_MP3 = 'ffmpeg -i "{source}" "{target}"'

Shell command for deriving an MP3 audio file from a source media file.

odea.CMD_DF_PDF_DOC = 'libreoffice --headless --convert-to pdf "{source}"; filename=$(basename -- "{source}"); mv "${{filename%.*}}.pdf" "{target}"'

Shell command for deriving a pdf file from a word processor document. This uses LibreOffice, which recognizes OpenDocument and MS-Office documents, spreadsheets, and presentations. Libreoffice does not allow output filename customization, but just writes the target in the current working directory (bag root), so the resulting file must be moved.

odea.CMD_DF_PDF_HTML = 'read -r URL < "{source}"; wkhtmltopdf "$URL" "{target}"'

Shell command for deriving a pdf file from a source html document. The input (source file) should be plain-text file containing a single URL or list or URLs.

odea.CMD_PF_SCREENSHOT = 'read -r URL < "{source}"; wkhtmltoimage "$URL" "{target}"'

Shell command for deriving a full-page screenshot from a source html document. The input (source file) should be plain-text file containing a single URL or list or URLs.

odea.CMD_DF_SCREENSHOT_CROPPED = 'read -r URL < "{source}"; wkhtmltoimage "$URL" --crop-h 800 --quality 60 "{target}"'

Shell command for deriving a cropped screenshot from a source html document. The input (source file) should be plain-text file containing a single URL or list or URLs.

odea.CMD_PF_TIFF = 'convert -compress none "{source}[{frame}]" "{target}"'

Shell command for deriving a preservation-format uncompressed TIFF file from a source image.

odea.CMD_DF_PDF_VECTOR = 'inkscape "{source}" --export-pdf="{target}"'

Shell command for deriving a pdf version of a vector image (svg)

odea.CMD_PF_VECTOR = 'inkscape "{source}" --export-plain-svg="{target}"'

Shell command for deriving a “clean” preservation-ready version of a source svg image.

odea.CMD_DF_H264 = 'ffmpeg -loglevel panic -nostdin -i "{source}" -vcodec libx264 -acodec aac -ab 384K -crf 21 -bf 2 -flags +cgop -pix_fmt yuv420p -movflags faststart "{target}"'

Shell command for deriving an mp4 video with h.264 codec from a source video, at the input resolution.

odea.CMD_DF_H264_CONCAT = 'ffmpeg -loglevel panic -nostdin -f concat -segment_time_metadata 1 -i "{source}" -vcodec libx264 -acodec aac -ab 384K -crf 21 -bf 2 -flags +cgop -pix_fmt yuv420p -movflags faststart "{target}"'

Shell command for deriving an mp4 video with h.264 codec from a list of source video clips, provided in a plain-text file readable by the ffmpeg concat filter. See <https://ffmpeg.org/ffmpeg-formats.html#concat>. This command is primarily useful for assembling raw video footage from a project, stored archivally as a collection of source clips, into a single file (or virtual “reel”) for redistribution.

odea.CMD_DF_360P_VP9_400K = 'ffmpeg -loglevel panic -nostdin -i "{source}" -codec:v libvpx-vp9 -b:v 400K -crf 31 -speed 4 -tile-columns 6 -frame-parallel 1 -vf scale=-1:360 -f webm "{target}"'

Shell command for deriving a 360p webm video from a source video file, for redistribution online or in limited space/bandwidth contexts.

odea.CMD_PF_FFV1 = 'ffmpeg -loglevel panic -nostdin -i "{source}" -vcodec ffv1 -acodec pcm_s16le "{target}"'

Shell command for deriving a preservation-format video, using the ffv1 codec, from a source video file. Warning: the resulting files will be extremely large!

odea.CMD_DF_IMG_STILL = 'ffmpeg -loglevel panic -nostdin -ss {frame}.0 -i "{source}" -frames:v 1 "{target}"'

Shell command for generating a still image from a source video, given the input video and a time point (“frame”). The time can be expressed either in HH:MM:SS format (e.g., “54:20”) or as a number of seconds with optional decimal fraction (e.g., “3260.2”).

odea.CMD_DF_IMG_STILLS = 'mkdir {target}; ffmpeg -i "{source}" -vf fps=1/6,scale=-1:360 "{target}/%%05d.jpg"'

Shell command for generating a series of still images from a video, one per six seconds.

odea.CMD_DF_DOCUTILS_HTML = 'rst2html5 --date --smart-quotes=yes --template="/home/docs/checkouts/readthedocs.org/user_builds/odea/envs/latest/lib/python3.7/site-packages/odea-1.0-py3.7.egg/static/docutils_template.txt" --stylesheet-path="/home/docs/checkouts/readthedocs.org/user_builds/odea/envs/latest/lib/python3.7/site-packages/odea-1.0-py3.7.egg/static/docutils_odea.css" "{source}" "{target}"'

Shell command to convert ReStructured Text to html via Docutils. The --template value is obtained from the variable DOCUTILS_TEMPLATE, which defaults to the file /static/ docutils_template.txt in the odea package. The --stylesheet value is obtained from the variable DOCUTILS_CSS, which defaults to the file /static/ docutils_odea.css in the odea package.

odea.CMD_DF_PANDOC_HTML = 'pandoc -o "{target}" -t html5 -c "/home/docs/checkouts/readthedocs.org/user_builds/odea/envs/latest/lib/python3.7/site-packages/odea-1.0-py3.7.egg/static/pandoc_odea.css" --standalone "{source}"'

Shell command to convert Markdown, ReStructured Text, or any other plain- text format to html via Pandoc. The -c (css) value is obtained from the variable PANDOC_CSS, which defaults to the file /static/ pandoc_odea.css in the odea package.

odea.new(path=None, archive=None, title=None)

Create a new Bag structure on disk in the current working directory.

Parameters
  • path – Path to a directory in which to create the new Bag.

  • archive – The name of the archive to which the collection belongs. This will be added to bag-info.txt.

  • title – The title of the collection. This will be added to bag- info.txt.

Odea will abort when creating a Bag, Item, or File object if there is not a corresponding BagIt bag structure on disk, either in the current working directory or in the path above it. Since all file paths listed in Bag, Item, and File metadata are relative to the bag root, the root must be identifiable at the time any object is initialized.

A valid BagIt structure is currently assumed to exist in a directory containing:

  1. A file named bagit.txt (the contents are not currently verified);

  2. A data subdirectory for payload files.

This function will create either of these elements, as well as a template bag-info.txt file, if they are missing in the supplied directory path. The directory does not need to be empty.

odea.load_sample_file(filename)

Load and return a file from the test directory as a File object.

The testing directory within the odea package contains input documents in the various formats supported by odea. These are used in the docstring examples in this module as inputs for sample commands and testing.

This function loads a file and sets the following File properties: File.filename, File.basename, File.ext, File.identifier, File.format, File.size, File.sha256.

odea.test_bag()

Create a sample bag for testing purposes.

This bag is used in the docstring examples throughout this module. The bag is created in a temporary directory, so there should be no risk of side-effects in testing.

>>> import odea
>>> b = odea.test_bag()
>>> b.title
'My test bag'
>>> b.subject
['spam', 'eggs']
>>> b.identifier
'893cddb6-6d94-4af6-be16-5cbfdb5d70e3'
>>> print(b.tree())
./
    bagit.txt
    data/
        deriv/
    file_metadata/
    html/
    item_metadata/
odea.is_root(path)

Identify whether <path> is a bag root, returning True or False.

See also

get_root()

odea.get_root(path)

Get the bag root, relative to the string <path>. Return the root or None.

Parameters

path – A relative or absolute filesystem path; the input path will be resolved against the current directory if it is relative. The path does not need to exist on disk.

The root is resolved equivalent to path or any parent directory thereof that contains both a data subdirectory and a file bagit.txt.

>>> import odea, os
>>> b = odea.test_bag()
>>> root = os.getcwd()
>>> odea.get_root(root) == root
True
>>> d2 = os.path.join(root, 'data', 'foo', 'bar', 'baz')
>>> os.makedirs(d2)
>>> os.chdir(d2)
>>> odea.get_root('.') == root
True
>>> odea.get_root('spam/eggs.txt') == root
True

If there is no bag in the path, None will be returned:

>>> odea.get_root('/random/dir/spam.txt') == None
True

If there are multiple nested collections, only the lowest-level directory will be returned:

>>> os.chdir(d2)
>>> odea.new()
>>> odea.get_root('.') == os.getcwd()
True
odea.load_bag()

Look for an existing ‘bag-info.txt’ metadata file and load as a Bag object. If no metadata file exists, create a new Bag object.

odea.load_item(item_uuid)

Look for an existing metadata file matching the item uuid and load as an Item object. If no metadata file exists, create a new Item object.

Parameters

item_uuid – The UUID for an item in the archive.

odea.load_file(filename)

Look for an existing metadata file matching the filepath and load as a File object.

Parameters

filename – The path to a file in the current Bag.

The metadata document is matched against the File.identifier and File.format properties; this function will ignore the filepath if either of these properties are not present in the filename.

If no metadata file exists, a new File object will be created and returned.

>>> import odea
>>> b = odea.test_bag()
>>> spam = os.path.join('data', 'spam.txt')
>>> open(spam, 'w').close()
>>> f = odea.load_file(spam)
>>> f 
File(filename='data/spam.txt', ...)

With a filename that doesn’t exist:

>>> f2 = odea.load_file('nonexistent-file.txt')

With a file that already has metadata:

>>> id = '2716fe6a-1fba-4dba-b34e-593450f9b975'
>>> fn = 'data/test.txt'
>>> tag_file = os.path.join(ITEM_METADATA_DIR, '{}.txt'.format(id))
>>> with open(tag_file, 'w') as t:
...     o = t.write('{{"identifier": {}, "filename": {}"}}'.format(id,fn))
>>> odea.load_file(fn) 
File(filename='data/test.txt', ...)
exception odea.BagError
exception odea.BagValidationError(message, details=None)

Bag

class odea.Bag(archive='odeum', archive_url=None, title=None, identifier=None, creator=None, subject=None, contributor=None, coverage=None, date=None, description=None, language=None, publisher=None, relation=None, rights=None, source=None, preview=None, dcmi_type='Collection', note=None)

An abstract instance of a Bag.

archive

The name of the archive to which this collection belongs.

archive_url

Web address (URL) of the archive responsible for this collection.

title

Dublin Core title metadata element for the Bag. Represents a name given to the resource.

identifier

Identifier for the Bag, represented by default as a version 4 UUID hexadecimal string. This property should normally be set automatically.

creator

Dublin Core creator metadata element for the Bag. Represents an entity primarily responsible for making the resource (i.e., the collection curator).

subject

Dublin Core subject metadata element for the Bag. Represents the topic of the resource (keyword).

contributor

Dublin Core contributor metadata element for the Bag. Represents an entity responsible for making contributions to the resource.

coverage

Dublin Core coverage metadata element for the Bag. Represents the spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.

date

Dublin Core coverage metadata element for the Bag. Represents a point or period of time associated with an event in the lifecycle of the resource. The date should be presented in an ISO 8601 string, but type checking is not enforced.

description

Dublin Core description metadata element for the Bag. Represents an account of the resource that may include an abstract, a table of contents, a graphical representation, or a free-text account of the resource.

language

Dublin Core language metadata element for the Bag. Represents the language of the resource, ideally using RFC 4646 language codes (e.g., “en” for English, “mn” for Mongolian).

publisher

Dublin Core publisher metadata element for the Bag. Represents an entity responsible for making the resource available.

relation

Dublin Core relation metadata element for the Bag. Represents a related resource.

rights

Dublin Core rights metadata element for the Bag. Represents information about rights held in and over the resource. Typically this will be a copyright statement, license name, or link to a document providing terms of use.

source

Dublin Core source metadata element for the Bag. Represents a related resource from which the described resource is derived.

preview

Path to a file within the Bag that provides a preview image representing the bag contents.

dcmi_type

Type of the Item, represented using the DCMI Type Vocabulary. The only valid type for a Bag is Collection.

note

Annotation

json()

Return a json string representing the Bag.

>>> import odea
>>> b = odea.test_bag()
>>> print(json.dumps(json.loads(b.json()), indent=4))
{
    "archive": "odeum",
    "archive_url": "",
    "contributor": null,
    "coverage": null,
    "creator": [],
    "date": null,
    "dcmi_type": "Collection",
    "description": null,
    "identifier": "893cddb6-6d94-4af6-be16-5cbfdb5d70e3",
    "language": null,
    "preview": "path/to/image",
    "publisher": null,
    "relation": null,
    "rights": null,
    "source": null,
    "subject": [
        "spam",
        "eggs"
    ],
    "title": "My test bag"
}
tree(path='.')

Print a directory tree representing the bag contents.

See the documentation for test_bag() for an example.

html()

Return an html Bag description string.

Example
>>> import odea
>>> b = odea.test_bag()
>>> print(b.html()) 
<!DOCTYPE doctype html>
<html lang="en">
 <head>
  <!-- Required meta tags -->
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <!-- Bootstrap CSS -->
  <link href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css"/>
  <style>
   .card-columns {column-count: 1;}  @media (min-width: 768px) {.card-columns {column-count: 2;}} @media (min-width: 992px) {.card-columns {column-count: 3;}} .card{max-width:360px}
  </style>
  <title>
   My test bag - Digital Archive
  </title>
 </head>
 <body>
  <nav class="navbar navbar-expand-lg navbar-dark bg-primary">
   <div class="container">
    <a class="navbar-brand" href="">
     Digital Archive
    </a>
   </div>
  </nav>
  <div class="container py-4">
   <h1>
    <small class="text-muted text-uppercase">
     collection /
    </small>
    My test bag
   </h1>
   <p>
    <img class="img-thumbnail" src="path/to/image"/>
   </p>
   <table class="table">
    <tr>
     <th>
      title
     </th>
     <td>
      My test bag
     </td>
    </tr>
    <tr>
     <th>
      identifier
     </th>
     <td>
      893cddb6-6d94-4af6-be16-5cbfdb5d70e3
     </td>
    </tr>
    <tr>
     <th>
      subject
     </th>
     <td>
      <ul>
       <li>
        spam
       </li>
       <li>
        eggs
       </li>
      </ul>
     </td>
    </tr>
    <tr>
     <th>
      dcmi_type
     </th>
     <td>
      Collection
     </td>
    </tr>
   </table>
   <div class="card-columns">
   </div>
  </div>
  <footer class="footer mt-5 p-3">
   <div class="container">
    <p class="text-muted">
     rev. ...
    </p>
    <p class="text-muted">
     Except where otherwise noted, content on this site is licensed under a
     <a href="http://creativecommons.org/licenses/by/4.0/" rel="license">
      Creative Commons Attribution 4.0 International License
     </a>
     .
    </p>
   </div>
  </footer>
 </body>
</html>
update_manifest(alg='sha512')

Update the Bag manifest.

Parameters

alg – The algorithm to be used. Defaults to sha512; sha256 can also be used.

Example
>>> import odea
>>> b = odea.test_bag()

Any new file should be added to the manifest.

>>> spam = os.path.join('data', 'spam.txt')
>>> with open(spam, 'w') as out:
...     out.write('Spam, eggs, bacon, and spam!')
28
>>> b.update_manifest()
>>> with open('manifest-sha512.txt', 'r') as manifest:
...     manifest.readlines() 
['6aec3c2caf8a5f9984fd1... data/spam.txt']
save()

Save the Bag data structure to disk in plain text format.

This will be in the file bag-info.txt in the root of the Bag.

>>> import odea
>>> b = odea.test_bag()
>>> b.title = "Modified title"
>>> b.creator = ['Author 1', 'Author 2']
>>> b.save()
>>> with open('bag-info.txt', 'r') as t:
...     print(t)
items()

Return a list of Item objects for items in the bag.

Items are retrieved from the tag files stored in the ITEM_METADATA_DIR directory.

>>> import odea
>>> b = odea.test_bag()
>>> i = odea.Item(identifier=odea.NIL_UUID, title='test_item')
>>> i.save()
>>> for i in b.items():
...     print(i.title)
test_item
pub_items()

Return a list of Item objects for published items in the bag.

This allows the html index to be updated only with items that are already included in the HTML_DIR directory.

>>> import odea, os
>>> b = odea.test_bag()
>>> i = odea.Item(identifier=odea.NIL_UUID, title='test item')
>>> i.save()
>>> b.pub_items()
[]
>>> h = os.path.join(odea.HTML_DIR, '{}.html'.format(odea.NIL_UUID))
>>> open(h, 'w').close()
>>> print([x.title for x in b.pub_items()])
['test item']

Item

class odea.Item(title=None, identifier=None, creator=None, subject=None, contributor=None, coverage=None, date=None, description=None, language=None, publisher=None, relation=None, rights=None, source=None, dcmi_type=None, embed_url=None, note=None)
identifier

Identifier for the Item, represented by default as a version 4 UUID hexadecimal string. This property should normally be set automatically.

title

Dublin Core title metadata element for the Item. Represents a name given to the resource.

creator

Dublin Core creator metadata element for the Item. Represents an entity primarily responsible for making the resource. This property is a list of strings.

contributor

Dublin Core contributor metadata element for the Item. Represents An entity responsible for making contributions to the resource. This property is a list of strings.

coverage

Dublin Core coverage metadata element for the Item. Represents the spatial or temporal topic of the resource, the spatial applicability of the resource, or the jurisdiction under which the resource is relevant.

date

Dublin Core coverage metadata element for the Item. Represents a point or period of time associated with an event in the lifecycle of the resource. The date should be presented in an ISO 8601 string, but type checking is not enforced.

description

Dublin Core description metadata element for the Item. Represents an account of the resource that may include an abstract, a table of contents, a graphical representation, or a free-text account of the resource.

language

Dublin Core language metadata element for the Item. Represents the language of the resource, ideally using RFC 4646 language codes (e.g., “en” for English, “mn” for Mongolian).

publisher

Dublin Core publisher metadata element for the Item. Represents an entity responsible for making the resource available.

relation

Dublin Core relation metadata element for the Item. Represents a related resource.

rights

Dublin Core rights metadata element for the Item. Represents information about rights held in and over the resource. Typically this will be a copyright statement, license name, or link to a document providing terms of use.

source

Dublin Core source metadata element for the Item. Represents a related resource from which the described resource is derived.

subject

Dublin Core subject metadata element for the Item. Represents the topic of the resource (keyword). This property is a list of strings.

dcmi_type

Type of the Item, represented using the DCMI Type Vocabulary. Valid types include Event, Image, MovingImage, PhysicalObject, Software, Sound, StillImage, and Text.

note

Annotation.

json()

Return a json string representing the Bag

files()

Return a list of file objects, corresponding to the files on disk associated with the Item. The list is generated by matching the identifier to tagged files.

save()

Save the Item data structure to disk.

>>> import odea
>>> b = odea.test_bag()
>>> i = odea.Item('data/example.txt')
>>> i.title = 'Example item'
>>> i.identifier = odea.NIL_UUID
>>> i.save()
>>> t = os.path.join(ITEM_METADATA_DIR, '{}.txt'.format(i.identifier))
>>> with open(t, 'r') as tag_file:
...     print(tag_file) 
html()

Return an html Item description string.

Example
>>> import odea
>>> b = odea.test_bag()
>>> i = odea.Item(identifier=odea.NIL_UUID, title='Test item')
>>> print(i.html()) 
<!DOCTYPE doctype html>
<html lang="en">
 <head>
  <!-- Required meta tags -->
  <meta charset="utf-8"/>
  <meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
  <!-- Bootstrap CSS -->
  <link href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css" rel="stylesheet"/>
  <style>
   .card-columns {column-count: 1;}  @media (min-width: 768px) {.card-columns {column-count: 2;}} @media (min-width: 992px) {.card-columns {column-count: 3;}} .card{max-width:360px}
  </style>
  <title>
   Test item - Digital Archive
  </title>
 </head>
 <body>
  <nav class="navbar navbar-expand-lg navbar-dark bg-primary">
   <div class="container">
    <a class="navbar-brand" href="">
     Digital Archive
    </a>
   </div>
  </nav>
  <div class="container py-4">
   <h1>
    <small class="text-muted text-uppercase">
     item /
    </small>
    Test item
   </h1>
   <table class="table">
    <tr>
     <th>
      title
     </th>
     <td>
      Test item
     </td>
    </tr>
    <tr>
     <th>
      identifier
     </th>
     <td>
      0000000-0000-0000-0000-000000000000
     </td>
    </tr>
   </table>
   <h2>
    Files
   </h2>
   <table class="table">
    <tr>
     <th>
      file
     </th>
     <th>
      size
     </th>
     <th>
      date modified
     </th>
    </tr>
   </table>
  </div>
  <footer class="footer mt-5 p-3">
   <div class="container">
    <p class="text-muted">
     rev. ...
    </p>
    <p class="text-muted">
     Except where otherwise noted, content on this site is licensed under a
     <a href="http://creativecommons.org/licenses/by/4.0/" rel="license">
      Creative Commons Attribution 4.0 International License
     </a>
     .
    </p>
   </div>
  </footer>
 </body>
</html>
src()

Return the path to the “SRC” file for the item

tag_file()

File

class odea.File(filename=None, sha512=None, sha256=None, size=None, mtime=None, identifier=None, basename=None, format=None, ext=None, preview=None, dimensions=None, duration=None, thumb=None)

This is a file on disk.

filename

The filename, including relative directory path from the bag root (e.g., data/subdir/file.ext)

sha512

The sha512 hash of the file (hex string)

sha256

The sha256 hash of the file (hex string)

size

The size of the file, in bytes (integer)

mtime

The modification time of the file (datetime object)

identifier

The unique identifier of the item to which the file belongs, represented as a UUID string. The uuid identifier string is included in the filename as a dot- separated element in the pattern <basename>[.<format>][.<uuid>].<ext>.

basename

The basename or “stem” of the filename. The basename identifier string is included in the filename as a dot- separated element in the pattern <basename>[.<format>][.<uuid>].<ext>.

format

The format of the file. The format identifier string is included in the filename as a dot- separated element in the pattern <basename>[.<format>][.<uuid>].<ext>. When generated by odea, this may be src for a source file, df- <type> for a distribution copy, or pf-<type> for an archival preservation copy. Available formats for derivative files correspond to the shell scripts listed in odea.templates.

ext

The filename extension

thumb

Path to a thumbnail image representing the file.

preview

Path to a medium-sized image representing the file.

dimensions

Dimensions of an image or video

duration

Duration of a video or audio file

get_checksum(alg='sha512')

Calculate the hash for a file.

Parameters

alg – Supported algorithms are “sha256” and “sha512”.

get_sha256()

Calculate the sha256 hash of the file.

An empty file returns None:

>>> import odea
>>> b = odea.test_bag()
>>> f = odea.load_sample_file('test_plain-text.txt')

If the file exists, the hash should be returned and written to the sha256 property:

>>> f.get_sha256()
'92b772380a3f8e27a93e57e6deeca6c01da07f5aadce78bb2fbb20de10a66925'
>>> f.sha256
'92b772380a3f8e27a93e57e6deeca6c01da07f5aadce78bb2fbb20de10a66925'
get_sha512()

Calculate the sha512 hash of the file and save to the sha512 property. This value should be persisted to the file manifest-sha512.txt.

>>> import odea
>>> b = odea.test_bag()
>>> f = odea.load_sample_file('test_plain-text.txt')

If the file exists, the hash should be returned and written to the sha512 property:

>>> f.get_sha512() 
'9751ea443fd632e147831566ccb822482220188993cd1269edbe98d2e2d69...'
json()

Return a json string representing the File.

Null values are not included in the output.

>>> import odea
>>> b = odea.test_bag()
>>> f = odea.load_sample_file('test_plain-text.txt')
>>> print(json.dumps(json.loads(f.json()), indent=4))
{
    "basename": "data/test_plain-text",
    "ext": "txt",
    "filename": "data/test_plain-text.SRC.0000000-0000-0000-0000-000000000000.txt",
    "format": "SRC",
    "sha256": "92b772380a3f8e27a93e57e6deeca6c01da07f5aadce78bb2fbb20de10a66925",
    "size": 15,
    "uuid": "0000000-0000-0000-0000-000000000000"
}
get_mtime()

Return the mtime of a file on disk and set the mtime property.

If the file does not exist, nothing is returned:

>>> import odea
>>> b = odea.test_bag()
>>> spam = os.path.join('data', 'spam.txt')
>>> f = odea.File(spam)
>>> f.get_mtime() == None
>>> True
>>> open(spam, 'a').close()
>>> os.utime(spam,(1330712280, 1330712292))
>>> f.get_mtime() == f.mtime
True
>>> f.mtime
'2012-03-02T12:18:12Z'
get_img_dimensions()

Set and return the dimensions of an image file.

>>> import odea
>>> b = odea.test_bag()
>>> f = odea.load_sample_file('test_img_jpeg.jpg')
>>> f.get_img_dimensions()
'2835x4289'

Nothing will happen if the image dimensions cannot be determined:

>>> f = odea.load_sample_file('test_plain-text.txt')
>>> f.get_img_dimensions() == None
True
get_audio_duration()

Set and return the duration of an audio file.

>>> import odea
>>> b = odea.test_bag()
>>> f = odea.load_sample_file('test_wav_sound.wav')
>>> f.get_audio_duration()
3.0

Nothing will happen if the sound file cannot be read:

>>> f = odea.load_sample_file('test_plain-text.txt')
>>> f.get_audio_duration() == None
True
get_video_duration()

Set and return the duration of a video file.

>>> import odea
>>> b = odea.test_bag()
>>> f = odea.load_sample_file('test_video.mp4')
>>> f.get_audio_duration()
3.0

Nothing will happen if the sound file cannot be read:

>>> f = odea.load_sample_file('test_plain-text.txt')
>>> f.get_video_duration() == None
True
get_size()

Return the size of a file on disk and set the size property.

>>> import odea
>>> b = odea.test_bag()
>>> f = odea.load_sample_file('test_img_jpeg.jpg')
>>> f.get_size()
3506068
get_uuid()

Retrieve and set the uuid property for a file.

If the uuid property is set, it will be returned. Otherwise, the filename will be scanned for a matching uuid, using the regular expression RE_UUID.

>>> import odea, os
>>> b = odea.test_bag()
>>> id = '48342ee3-9080-407e-9862-12ce05143499'
>>> spam = os.path.join('data', 'spam.{}.txt'.format(id))
>>> open(spam, 'w').close()
>>> f = odea.load_file(spam)
>>> f.identifier == id
True
>>> del f.identifier
>>> f.identifier == id
False
>>> f.get_uuid() == id == f.identifier
True
tag(_uuid=None)

Tag a filename with a UUID and update the File properties.

This operation does not actually rename files on disk.

>>> import odea
>>> b = odea.test_bag()
>>> f = odea.load_sample_file('test_plain-text.txt')
>>> f.tag(_uuid=NIL_UUID)
'data/test_plain-text.SRC.0000000-0000-0000-0000-000000000000.txt'

The tagging operation should correctly set the following attributes: File.filename, File.basename, File.ext, File.identifier, and File.format.

>>> a = 'data/test.file.many.parts.txt'
>>> b = 'data/a.b.0000000-0000-0000-0000-000000000000.c.d.txt'
>>> c = 'data/test-file-no-extension'
>>> d = 'data/test-file.SRC.txt'
>>> f = {}
>>> for fn in [a, b, c, d]:
...     open(fn, 'w').close()
...     f[fn] = odea.File(filename=fn)
...     o = f[fn].tag(_uuid=odea.NIL_UUID)
>>> f[a].filename
'data/test.file.many.parts.txt'
>>> f[a].basename
'data/test'
>>> f[a].ext
'txt'
>>> f[a].identifier
'0000000-0000-0000-0000-000000000000'
>>> f[a].format
'file.many.parts'
>>> f[b].filename
'data/a.b.0000000-0000-0000-0000-000000000000.c.d.txt'
>>> f[b].basename
'data/a'
>>> f[b].ext
'c.d.txt'
>>> f[b].identifier
'0000000-0000-0000-0000-000000000000'
>>> f[b].format
'b'
>>> f[c].filename
'data/test-file-no-extension'
>>> f[c].basename
'data/test-file-no-extension'
>>> f[c].ext
''
>>> f[c].identifier
'0000000-0000-0000-0000-000000000000'
>>> f[c].format
'SRC'
>>> f[d].filename
'data/test-file.SRC.txt'
>>> f[d].basename
'data/test-file'
>>> f[d].ext
'txt'
>>> f[d].identifier
'0000000-0000-0000-0000-000000000000'
>>> f[d].format
'SRC'
slug()

Return a shortened and sanitized form of the file basename.

The slug removes spaces and special characters from the filename, and truncates the basename to 60 characters (this includes the full path from the bag root, “data/path/to/file”, but NOT the filetype, uuid, and extension).

rename()

Rename a file on disk, based on its metadata properties.

The output format is <basename>[.<format>][.<uuid>].<ext>.

Note that the filename components are not set automatically by the File class on initialization. They can be set by tag() or load_file().

>>> import odea
>>> b = odea.test_bag()
>>> fn = os.path.join(odea.DATA_DIR, 'test_plain-text.txt')
>>> f = odea.File(fn)
>>> open(fn, 'w').close()
>>> f.identifier = 'b3050922-520f-426e-9a9c-cfe728bd178d'
>>> f.format = 'sample'
>>> f.basename = 'data/test_plain-text'
>>> f.ext = 'rst'
>>> f.filename
'data/test_plain-text.txt'
>>> f.rename() == f.filename
True
>>> f.filename
'data/test_plain-text.sample.b3050922-520f-426e-9a9c-cfe728bd178d.rst'
get_filename_parts()

Populate filename part properties from the filename itself.

This should be used when importing a new file, as it will override any properties that have been set manually. This method is called by tag(), which additionally creates or applies a uuid tag.

See also

tag()

save()

Save the File data structure to disk.

derive(target, ext, frame=None, overwrite=False, target_dir=None)

Generate a derivative version of a file. Return the full filename of the derived file.

Parameters
  • target – Conversion target. Available targets are produced through shell scripts defined in the variables odea.CMD_<RULE>, which can be overwritten or extended. The built-in targets are listed below.

  • frame – The page, image, or frame number to use from the input resource. For a multi-image or multi-page document input, frame is an integer corresponding to the image or page number (starting with ‘0’ for the first image or page). This works with multi-image TIFF files as well as for PDF documents. For video files, the default stills generation command will take an image from the middle of the file if frame is “0” or not specified, otherwise an image frame will be extracted using the ffmpeg time duration syntax. The following examples are all valid input values for frame in this context: “55” (55 seconds), “12:03:45” (12 hours, 3 minutes, and 45 seconds), or “23.189” (23.189 seconds).

  • ext – The extension of the target filename. For some commands this will change the output format (e.g., images processed by Imagemagick can be generated as PNG, JPEG, or another recognized format).

  • overwrite – If True, overwrite a target file if it already exits. Otherwise simply return the filename.

  • target_dir – The directory for derivative output. Defaults to DERIV_DIR, but this can be overridden (e.g, thumbnail images are located in "THUMBS"_DIR)

Default targets

PF_WAV

A lossless audio file in WAV format.

Target generated by CMD_PF_WAV.

DF_MP3

A distribution copy of an audio file, in lossy MP3 format.

Target generated by CMD_DF_MP3.

DF_PDF_DOC

A PDF version of word processor documents (text, spreadsheet, etc.)

Target generated by CMD_DF_PDF_DOC.

DF_PDF_HTML

For web documents given using a “url” file, this will be a PDF corresponding to the print version of the resource.

Target generated by CMD_DF_PDF_HTML.

DF_IMG_THUMB

A thumbnail image (currently 360x360).

Target generated by CMD_DF_THUMB_IMG.

DF_IMG_MED

A medium-sized image (currently 800x600).

Target generated by CMD_DF_IMG_MED.

DF_IMG_LG

A large-scale image (1920x1080).

Target generated by CMD_DF_IMG_LG.

PF_FFV1

A lossless video, using the FFV1 codec. (Note that the file sizes are extremely large!)

Target generated by CMD_PF_FFV1.

DF_360P_VP9_400K

A distribution copy of a video, reduced to 360p resolution in webm format using the VP9 codec.

Target generated by CMD_DF_360P_VP9_400K.

DF_H264

A distribution copy of the video optimized for upload to video hosting services such as YouTube, Vimeo, or Internet Archive. With the default command the video is not scaled, but the bitrate may be reduced from the source file and the moov atom will be placed at the beginning of the file in order to enable streaming.

Target generated by CMD_DF_H264.

DF_H264_CONCAT

The same as “DF_H264”, but generated from an ffconcat list. All the clips in the source directory, as referenced by the “df-concat-list” file, will be assembled to create unified derivatives; thus the file “<project-name>.df-h264.<uuid>.mp4” will be a single video that could include all the footage from the “<project-name>.dir” directory, and subsequent derivatives can be created from concatenated video.

Target generated by CMD_DF_H264_CONCAT.

PF_WEBARC

The “web archive” is a directory containing a downloaded copy of the resource specified in a URL file, along with any other resources needed in order to display that resource (e.g., embedded images, stylesheets, or fonts). Links to the associated resources are converted to relative hyperlinks in order to make the resource viewable locally, but are otherwise unchanged.

Target generated by CMD_PF_WEBARC.

PF_SCREENSHOT

A full screenshot that captures the entire page of a web resource, as viewed in a web browser. This is a bitmat image, so it can be very large.

Target generated by CMD_PF_SCREENSHOT.

DF_SCREENSHOT_CROPPED

A cropped screenshot of a web resource, showing the visible part of the page without scrolling.

Target generated by CMD_DF_SCREENSHOT_CROPPED.

DF_IMG_STILL

A still frame from a video.

Target generated by CMD_DF_IMG_STILL.

DF_IMG_STILLS

A sequence of still images (thumbnails) from a video.

Target generated by CMD_DF_IMG_STILLS.

Examples

>>> import odea
>>> b = odea.test_bag()
>>> c = odea.load_sample_file('test_corrupt-file.jpg')
>>> c.derive('DF_IMG_MED', 'png')
>>> f1 = odea.load_sample_file('test_wav_sound.wav')
>>> d1 = f1.derive('DF_MP3', 'mp3')
>>> fd1 = odea.File(d1)
>>> fd1.filename
'data/deriv/test_wav_sound.df-mp3.0000000-0000-0000-0000-000000000000.mp3'
>>> f2 = odea.load_sample_file('test_url.urls')
>>> d2a = f2.derive('DF_PDF_HTML', 'pdf')
>>> d2b = f2.derive('PF_WEBARC', 'dir')
>>> d2c = f2.derive('PF_SCREENSHOT', 'png')
>>> print(b.tree())
./
    bagit.txt
    data/
        test_corrupt-file.SRC.0000000-0000-0000-0000-000000000000.jpg
        test_url.SRC.0000000-0000-0000-0000-000000000000.urls
        test_wav_sound.SRC.0000000-0000-0000-0000-000000000000.wav
        deriv/
            test_url.df-pdf-html.0000000-0000-0000-0000-000000000000.pdf
            test_url.pf-screenshot.0000000-0000-0000-0000-000000000000.png
            test_wav_sound.df-mp3.0000000-0000-0000-0000-000000000000.mp3
            test_url.pf-webarc.0000000-0000-0000-0000-000000000000.dir/
                commons.wikimedia.org/
                    w/
                        index.php@title=File%3A1913_Gandan_Monastery_in_Khuree.jpg&oldid=359804693.html
                example.net/
                    index.html
    file_metadata/
    html/
    item_metadata/
>>> f3 = odea.load_sample_file('test_img_jpeg.jpg')
>>> d3 = f3.derive('DF_IMG_MED', 'png')
>>> fd3 = odea.File(d3)
>>> f3.get_img_dimensions()
'2835x4289'
>>> fd3.get_img_dimensions()
'397x600'
thumbs()

Generate thumbnail images for the input filename.

Two thumbnail images are generated and saved to the THUMBS_DIR folder in the Bag root, with 360px and 800px widths.

The thumbnail files are named using the hash of the input filename, so will remain available even if the source file is moved or renamed. The paths to the generated images are stored in the thumb and preview properties.

>>> import odea
>>> b = odea.test_bag()
>>> f = odea.load_sample_file('test_img_jpeg.jpg')
>>> f.thumbs()
('thumbs/46568651e13bf1416a802075827b67ed-360x256.jpeg', 'thumbs/46568651e13bf1416a802075827b67ed-800x256.jpeg')

If the file contents change, it is necessary to remove the existing thumbnail images before generating a new one. The filenames are accessible from the properties thumb and preview.

>>> import os
>>> os.path.isfile(f.thumb)
True
>>> os.remove(f.thumb)
>>> os.path.isfile(f.thumb)
False
>>> del f.thumb
>>> f.thumb == None
True