Skip to content

Python

A guide to using iscc-lib from Python. Covers installation, code generation, structured results, streaming, text utilities, codec operations, and constants.


Installation

pip install iscc-lib

iscc-lib ships as a pre-built wheel with a compiled Rust extension. No Rust toolchain is required for installation.

Code generation

All 10 gen_*_v0 functions follow the same pattern: pass content-specific input and an optional bits parameter (default 64), and receive a structured result with an .iscc field containing the ISCC code string.

Meta-Code

Generate a Meta-Code from content metadata (title, description, structured metadata):

from iscc_lib import gen_meta_code_v0

result = gen_meta_code_v0(
    name="Die Unendliche Geschichte",
    description="Von Michael Ende",
)
print(result.iscc)  # "ISCC:AAA..."
print(result.name)  # Normalized name
print(result.metahash)  # BLAKE3 hash of metadata

The meta parameter accepts a JSON string or a data: URL with base64-encoded payload. When provided, meta takes precedence over description for the similarity digest:

import json

metadata = json.dumps({"title": "Example", "author": "Author"})
result = gen_meta_code_v0("Example Title", meta=metadata)
print(result.meta)  # data: URL with base64-encoded JSON

Text-Code

Generate a Text-Code from plain text content:

from iscc_lib import gen_text_code_v0

result = gen_text_code_v0("Hello World")
print(result.iscc)  # "ISCC:EAA..."
print(result.characters)  # Number of characters processed

Image-Code

Generate an Image-Code from a 32x32 grayscale thumbnail (1024 bytes):

from iscc_lib import gen_image_code_v0

# Pre-process your image to 32x32 grayscale (e.g., with Pillow)
pixels = bytes([128] * 1024)  # Placeholder: uniform gray
result = gen_image_code_v0(pixels)
print(result.iscc)  # "ISCC:EEA..."

Audio-Code

Generate an Audio-Code from a Chromaprint fingerprint vector (signed integers):

from iscc_lib import gen_audio_code_v0

# Obtain Chromaprint features externally (e.g., with pyacoustid)
fingerprint = [123456, -789012, 345678, 901234]
result = gen_audio_code_v0(fingerprint)
print(result.iscc)  # "ISCC:EIA..."

Video-Code

Generate a Video-Code from MPEG-7 frame signature vectors:

from iscc_lib import gen_video_code_v0

# Each frame signature is a list of 380 integers
frame_sigs = [[0] * 380, [1] * 380]
result = gen_video_code_v0(frame_sigs)
print(result.iscc)  # "ISCC:EMA..."

Mixed-Code

Combine multiple Content-Codes of different types into a Mixed-Code:

from iscc_lib import gen_text_code_v0, gen_image_code_v0, gen_mixed_code_v0

text_result = gen_text_code_v0("Hello World")
image_result = gen_image_code_v0(bytes([128] * 1024))

result = gen_mixed_code_v0([text_result.iscc, image_result.iscc])
print(result.iscc)  # "ISCC:EQA..."
print(result.parts)  # List of input code strings

Data-Code

Generate a Data-Code from raw bytes using content-defined chunking and MinHash:

from iscc_lib import gen_data_code_v0

data = b"Hello World" * 1000
result = gen_data_code_v0(data)
print(result.iscc)  # "ISCC:GAA..."

Data-Code also accepts file-like objects:

with open("document.pdf", "rb") as f:
    result = gen_data_code_v0(f)
print(result.iscc)

Instance-Code

Generate an Instance-Code from raw bytes using BLAKE3 hashing:

from iscc_lib import gen_instance_code_v0

data = b"Hello World"
result = gen_instance_code_v0(data)
print(result.iscc)  # "ISCC:IAA..."
print(result.datahash)  # Multihash of the data
print(result.filesize)  # Size in bytes

Instance-Code also accepts file-like objects:

with open("document.pdf", "rb") as f:
    result = gen_instance_code_v0(f)
print(result.datahash)

ISCC-CODE

Combine individual ISCC unit codes into a composite ISCC-CODE:

from iscc_lib import gen_data_code_v0, gen_instance_code_v0, gen_iscc_code_v0

data = b"Hello World" * 1000
data_result = gen_data_code_v0(data)
instance_result = gen_instance_code_v0(data)

result = gen_iscc_code_v0([data_result.iscc, instance_result.iscc])
print(result.iscc)  # "ISCC:KAA..."

Sum-Code

Generate a composite ISCC-CODE from a file in a single pass:

from pathlib import Path
from iscc_lib import gen_sum_code_v0

Path("example.bin").write_bytes(b"Hello World" * 1000)
result = gen_sum_code_v0("example.bin")
print(result.iscc)  # "ISCC:KAA..."
print(result.datahash)  # Multihash of the data
print(result.filesize)  # Size in bytes

Structured results

Every gen_*_v0 function returns a typed result object that supports both dict-style and attribute-style access:

from iscc_lib import gen_meta_code_v0

result = gen_meta_code_v0("Example Title", description="Example description")

# Attribute access
print(result.iscc)
print(result.name)
print(result.metahash)

# Dict access
print(result["iscc"])
print(result["name"])

# Iterate over keys
for key, value in result.items():
    print(f"{key}: {value}")

# JSON serialization
import json

print(json.dumps(result, indent=2))

Result types and their fields:

Result type Fields
MetaCodeResult iscc, name, metahash, description?, meta?
TextCodeResult iscc, characters
ImageCodeResult iscc
AudioCodeResult iscc
VideoCodeResult iscc
MixedCodeResult iscc, parts
DataCodeResult iscc
InstanceCodeResult iscc, datahash, filesize
IsccCodeResult iscc
SumCodeResult iscc, datahash, filesize

Fields marked with ? are optional and only present when the corresponding input was provided.

Streaming

For large files, use DataHasher and InstanceHasher to process data incrementally without loading everything into memory. Both follow the new() -> update() -> finalize() pattern.

DataHasher

from iscc_lib import DataHasher

hasher = DataHasher()

with open("large_file.bin", "rb") as f:
    while chunk := f.read(65536):
        hasher.update(chunk)

result = hasher.finalize()
print(result.iscc)  # Identical to gen_data_code_v0(entire_file)

You can also pass initial data or a file-like object to the constructor:

from iscc_lib import DataHasher

# From bytes
hasher = DataHasher(b"initial data")
hasher.update(b"more data")
result = hasher.finalize()

# From file
with open("file.bin", "rb") as f:
    hasher = DataHasher(f)
    result = hasher.finalize()

InstanceHasher

from iscc_lib import InstanceHasher

hasher = InstanceHasher()

with open("large_file.bin", "rb") as f:
    while chunk := f.read(65536):
        hasher.update(chunk)

result = hasher.finalize()
print(result.iscc)  # Identical to gen_instance_code_v0(entire_file)
print(result.datahash)  # Multihash of the complete data
print(result.filesize)  # Total bytes processed

Both hashers accept bytes or file-like objects in update(). After calling finalize(), the hasher is consumed and further calls to update() or finalize() raise an error.

Text utilities

iscc-lib provides text normalization functions used internally by the code generation pipeline. These are available for preprocessing your own text inputs.

text_clean

Normalize text for display: applies NFKC normalization, removes control characters (except newlines), normalizes line endings, collapses consecutive empty lines, and strips leading/trailing whitespace.

from iscc_lib import text_clean

cleaned = text_clean("  Hello\r\n\r\n\r\nWorld  ")
print(repr(cleaned))  # 'Hello\n\nWorld'

text_collapse

Simplify text for similarity hashing: lowercases, strips whitespace, punctuation, and diacritics. Used internally by gen_text_code_v0.

from iscc_lib import text_collapse

collapsed = text_collapse("Hello, World!")
print(collapsed)  # 'helloworld'

text_remove_newlines

Remove newlines and collapse whitespace to single spaces:

from iscc_lib import text_remove_newlines

single_line = text_remove_newlines("Hello\nWorld\nFoo")
print(single_line)  # 'Hello World Foo'

text_trim

Trim text so its UTF-8 byte size does not exceed a limit. Multi-byte characters that would be split are dropped entirely:

from iscc_lib import text_trim

trimmed = text_trim("Hello World", 5)
print(trimmed)  # 'Hello'

Codec operations

Functions for encoding, decoding, and decomposing ISCC codes. These operate on the ISCC binary format defined in ISO 24138.

Encode and decode

Construct an ISCC unit from raw header fields and digest, then decode it back:

from iscc_lib import encode_component, iscc_decode, MT, ST, VS

# Encode: maintype=0 (Meta), subtype=0, version=0, 64 bits, 8-byte digest
digest = bytes([0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08])
code = encode_component(0, 0, 0, 64, digest)
print(code)  # ISCC unit string (without "ISCC:" prefix)

# Decode: parse an ISCC unit string back into header components and digest
mt, st, vs, length, raw_digest = iscc_decode(code)
print(f"MainType: {mt}, SubType: {st}, Version: {vs}, Length: {length}")
print(f"Digest: {raw_digest.hex()}")

# Returned enum fields are IntEnum instances
assert isinstance(mt, MT)
assert isinstance(st, ST)
assert isinstance(vs, VS)

iscc_decode returns a tuple[MT, ST, VS, int, bytes] with IntEnum-typed values for the header fields.

Decompose

Split a composite ISCC-CODE into its individual unit codes:

from iscc_lib import (
    gen_data_code_v0,
    gen_instance_code_v0,
    gen_iscc_code_v0,
    iscc_decompose,
)

data = b"Hello World" * 1000
data_result = gen_data_code_v0(data)
instance_result = gen_instance_code_v0(data)
iscc_code = gen_iscc_code_v0([data_result.iscc, instance_result.iscc])

# Decompose into individual units
units = iscc_decompose(iscc_code.iscc)
for unit in units:
    print(unit)  # Each unit code (without "ISCC:" prefix)

Other codec functions

  • encode_base64(data: bytes) -> str — encode bytes to base64
  • json_to_data_url(json: str) -> str — convert a JSON string to a data:application/json;base64,... URL
  • soft_hash_video_v0(frame_sigs, bits=64) -> bytes — compute a video similarity hash from MPEG-7 frame signatures

Constants

Module-level constants used by the ISCC algorithms. These are available as direct imports and also through the core_opts namespace for iscc-core API parity:

from iscc_lib import (
    META_TRIM_NAME,
    META_TRIM_DESCRIPTION,
    IO_READ_SIZE,
    TEXT_NGRAM_SIZE,
    core_opts,
)

META_TRIM_NAME  # 128 — max byte length for name normalization
META_TRIM_DESCRIPTION  # 4096 — max byte length for description normalization
IO_READ_SIZE  # 4_194_304 — default read buffer size (4 MB)
TEXT_NGRAM_SIZE  # 13 — n-gram size for text similarity hashing

# core_opts namespace (iscc-core compatibility)
core_opts.meta_trim_name  # 128
core_opts.meta_trim_description  # 4096
core_opts.io_read_size  # 4_194_304
core_opts.text_ngram_size  # 13

Conformance testing

Verify that the library produces correct results for all official test vectors:

from iscc_lib import conformance_selftest

assert conformance_selftest() is True

Error handling

All gen_*_v0 functions raise ValueError on invalid input:

from iscc_lib import gen_text_code_v0

try:
    gen_text_code_v0("Hello", bits=13)  # bits must be a multiple of 32
except ValueError as e:
    print(f"Invalid input: {e}")