Skip to content

API Reference

Python bindings for the ISCC (ISO 24138:2024) code generation functions. All functions accept typed parameters and return JSON strings โ€” use json.loads() to parse the result.

import json
from iscc_lib import gen_meta_code_v0

result = json.loads(gen_meta_code_v0("Example Title"))
print(result["iscc"])  # "ISCC:..."

Functions

High-performance ISCC (ISO 24138:2024) implementation backed by Rust.

IO_READ_SIZE module-attribute

IO_READ_SIZE: int

Buffer size in bytes for streaming file reads (4,194,304 = 4 MB).

META_TRIM_DESCRIPTION module-attribute

META_TRIM_DESCRIPTION: int

Max UTF-8 byte length for description metadata trimming (4096).

META_TRIM_META module-attribute

META_TRIM_META: int

Max byte length for decoded meta parameter payload (128,000).

META_TRIM_NAME module-attribute

META_TRIM_NAME: int

Max UTF-8 byte length for name metadata trimming (128).

TEXT_NGRAM_SIZE module-attribute

TEXT_NGRAM_SIZE: int

Character n-gram width for text content features (13).

MT

Bases: IntEnum

ISCC MainType identifiers.

ST

Bases: IntEnum

ISCC SubType identifiers.

VS

Bases: IntEnum

ISCC Version identifiers.

IsccResult

Bases: dict

ISCC result with both dict-style and attribute-style access.

MetaCodeResult

Bases: IsccResult

Result of gen_meta_code_v0.

TextCodeResult

Bases: IsccResult

Result of gen_text_code_v0.

ImageCodeResult

Bases: IsccResult

Result of gen_image_code_v0.

AudioCodeResult

Bases: IsccResult

Result of gen_audio_code_v0.

VideoCodeResult

Bases: IsccResult

Result of gen_video_code_v0.

MixedCodeResult

Bases: IsccResult

Result of gen_mixed_code_v0.

DataCodeResult

Bases: IsccResult

Result of gen_data_code_v0.

InstanceCodeResult

Bases: IsccResult

Result of gen_instance_code_v0.

IsccCodeResult

Bases: IsccResult

Result of gen_iscc_code_v0.

SumCodeResult

Bases: IsccResult

Result of gen_sum_code_v0.

DataHasher

DataHasher(
    data: bytes
    | bytearray
    | memoryview
    | BinaryIO
    | None = None,
)

Streaming Data-Code generator.

Incrementally processes data with content-defined chunking and MinHash to produce results identical to gen_data_code_v0.

Create a new DataHasher with optional initial data.

update

update(
    data: bytes | bytearray | memoryview | BinaryIO,
) -> None

Push data into the hasher.

finalize

finalize(bits: int = 64) -> DataCodeResult

Consume the hasher and return a Data-Code result.

InstanceHasher

InstanceHasher(
    data: bytes
    | bytearray
    | memoryview
    | BinaryIO
    | None = None,
)

Streaming Instance-Code generator.

Incrementally hashes data with BLAKE3 to produce results identical to gen_instance_code_v0.

Create a new InstanceHasher with optional initial data.

update

update(
    data: bytes | bytearray | memoryview | BinaryIO,
) -> None

Push data into the hasher.

finalize

finalize(bits: int = 64) -> InstanceCodeResult

Consume the hasher and return an Instance-Code result.

alg_cdc_chunks

alg_cdc_chunks(
    data: bytes, utf32: bool, avg_chunk_size: int = 1024
) -> list[bytes]

Split data into content-defined chunks using gear rolling hash.

Uses a FastCDC-inspired algorithm to find content-dependent boundaries. Returns at least one chunk (empty bytes for empty input). When utf32 is true, aligns cut points to 4-byte boundaries.

Parameters:

Name Type Description Default
data bytes

Raw binary data to chunk.

required
utf32 bool

Whether to align cut points to 4-byte boundaries.

required
avg_chunk_size int

Target average chunk size in bytes (default 1024).

1024

Returns:

Type Description
list[bytes]

List of byte chunks that concatenate to the original data.

alg_minhash_256

alg_minhash_256(features: list[int]) -> bytes

Compute a 256-bit MinHash digest from 32-bit integer features.

Uses 64 universal hash functions with bit-interleaved compression to produce a 32-byte similarity-preserving digest.

Parameters:

Name Type Description Default
features list[int]

List of 32-bit unsigned integer features.

required

Returns:

Type Description
bytes

32-byte MinHash digest.

alg_simhash

alg_simhash(hash_digests: list[bytes]) -> bytes

Compute a SimHash from a sequence of equal-length hash digests.

Produces a similarity-preserving hash by counting bit frequencies across all input digests. Each output bit is set when its frequency meets or exceeds half the input count. Returns 32 zero bytes for empty input.

Parameters:

Name Type Description Default
hash_digests list[bytes]

List of equal-length byte hash digests.

required

Returns:

Type Description
bytes

Similarity-preserving hash as bytes (same length as input digests).

conformance_selftest

conformance_selftest() -> bool

Run all conformance tests against vendored test vectors.

Iterates through all 9 gen_*_v0 function sections in the conformance data, calls each function with the specified inputs, and compares results against expected output. Returns True if all tests pass.

Returns:

Type Description
bool

True if all conformance tests pass, False otherwise.

encode_base64

encode_base64(data: bytes) -> str

Encode bytes as base64url (RFC 4648 ยง5, no padding).

Returns a URL-safe base64 encoded string without padding characters.

Parameters:

Name Type Description Default
data bytes

Raw bytes to encode.

required

Returns:

Type Description
str

Base64url encoded string without padding.

encode_component

encode_component(
    mtype: int,
    stype: int,
    version: int,
    bit_length: int,
    digest: bytes,
) -> str

Encode raw digest components into a base32 ISCC unit string.

Takes integer type identifiers (matching MainType, SubType, Version enum values) and a raw digest, returns a base32-encoded ISCC unit string (without "ISCC:" prefix).

Parameters:

Name Type Description Default
mtype int

Main type identifier (0โ€“6).

required
stype int

Sub type identifier (0โ€“7).

required
version int

Version identifier (0).

required
bit_length int

Bit length of the code body (multiple of 32).

required
digest bytes

Raw digest bytes (at least bit_length // 8 bytes).

required

Returns:

Type Description
str

Base32-encoded ISCC unit string.

Raises:

Type Description
ValueError

If enum values are out of range or digest is too short.

iscc_decompose

iscc_decompose(iscc_code: str) -> list[str]

Decompose a composite ISCC-CODE into individual ISCC-UNITs.

Accepts a normalized ISCC-CODE or concatenated ISCC-UNIT sequence. The optional "ISCC:" prefix is stripped before decoding. Returns a list of base32-encoded ISCC-UNIT strings (without prefix).

Parameters:

Name Type Description Default
iscc_code str

ISCC-CODE or ISCC-UNIT sequence string.

required

Returns:

Type Description
list[str]

List of base32-encoded ISCC-UNIT strings.

Raises:

Type Description
ValueError

If the input is not a valid ISCC string.

json_to_data_url

json_to_data_url(json: str) -> str

Convert a JSON string into a data: URL with JCS canonicalization.

Parses the JSON, re-serializes to RFC 8785 (JCS) canonical form, base64-encodes the result, and wraps it in a data: URL. Uses application/ld+json media type when the JSON contains an @context key, otherwise application/json.

Parameters:

Name Type Description Default
json str

JSON string to convert.

required

Returns:

Type Description
str

Data URL string.

Raises:

Type Description
ValueError

If json is not valid JSON.

sliding_window

sliding_window(seq: str, width: int) -> list[str]

Generate sliding window n-grams from a string.

Returns overlapping substrings of width Unicode characters, advancing by one character at a time. If the input is shorter than width, returns a single element containing the full input.

Parameters:

Name Type Description Default
seq str

Input string to slide over.

required
width int

Width of sliding window (must be >= 2).

required

Returns:

Type Description
list[str]

List of window-sized substrings.

Raises:

Type Description
ValueError

If width is less than 2.

soft_hash_video_v0

soft_hash_video_v0(
    frame_sigs: Sequence[Sequence[int]], bits: int = 64
) -> bytes

Compute a similarity-preserving hash from video frame signatures.

Deduplicates frame signatures, computes column-wise sums across all unique frames, then applies WTA-Hash to produce a digest.

Parameters:

Name Type Description Default
frame_sigs Sequence[Sequence[int]]

List of frame signatures, each a list of signed integers.

required
bits int

Bit length of the output hash (default 64).

64

Returns:

Type Description
bytes

Raw hash bytes of length bits / 8.

Raises:

Type Description
ValueError

If frame_sigs is empty.

text_clean

text_clean(text: str) -> str

Clean and normalize text for display.

Applies NFKC normalization, removes control characters (except newlines), normalizes \r\n to \n, collapses consecutive empty lines to at most one, and strips leading/trailing whitespace.

Parameters:

Name Type Description Default
text str

Input text to clean.

required

Returns:

Type Description
str

Cleaned text.

text_collapse

text_collapse(text: str) -> str

Normalize and simplify text for similarity hashing.

Applies NFD normalization, lowercasing, removes whitespace and characters in Unicode categories C (control), M (mark), and P (punctuation), then recombines with NFKC normalization.

Parameters:

Name Type Description Default
text str

Input text to collapse.

required

Returns:

Type Description
str

Collapsed text suitable for similarity hashing.

text_remove_newlines

text_remove_newlines(text: str) -> str

Remove newlines and collapse whitespace to single spaces.

Converts multi-line text into a single normalized line by splitting on whitespace boundaries and joining with a single space.

Parameters:

Name Type Description Default
text str

Input text with newlines.

required

Returns:

Type Description
str

Single-line text with collapsed whitespace.

text_trim

text_trim(text: str, nbytes: int) -> str

Trim text so its UTF-8 encoded size does not exceed nbytes.

Finds the largest valid UTF-8 prefix within nbytes, then strips leading/trailing whitespace from the result. Multi-byte characters that would be split are dropped entirely.

Parameters:

Name Type Description Default
text str

Input text to trim.

required
nbytes int

Maximum byte length of the result.

required

Returns:

Type Description
str

Trimmed text.

iscc_decode

iscc_decode(iscc: str) -> tuple[MT, ST, VS, int, bytes]

Decode an ISCC unit string into header components and raw digest.

gen_meta_code_v0

gen_meta_code_v0(
    name: str,
    description: str | None = None,
    meta: str | dict | None = None,
    bits: int = 64,
) -> MetaCodeResult

Generate an ISCC Meta-Code from content metadata.

gen_text_code_v0

gen_text_code_v0(
    text: str, bits: int = 64
) -> TextCodeResult

Generate an ISCC Text-Code from plain text content.

gen_image_code_v0

gen_image_code_v0(
    pixels: bytes | bytearray | memoryview | Sequence[int],
    bits: int = 64,
) -> ImageCodeResult

Generate an ISCC Image-Code from pixel data.

gen_audio_code_v0

gen_audio_code_v0(
    cv: list[int], bits: int = 64
) -> AudioCodeResult

Generate an ISCC Audio-Code from a Chromaprint feature vector.

gen_video_code_v0

gen_video_code_v0(
    frame_sigs: Sequence[Sequence[int]], bits: int = 64
) -> VideoCodeResult

Generate an ISCC Video-Code from frame signature data.

gen_mixed_code_v0

gen_mixed_code_v0(
    codes: list[str], bits: int = 64
) -> MixedCodeResult

Generate an ISCC Mixed-Code from multiple Content-Code strings.

gen_data_code_v0

gen_data_code_v0(
    data: bytes | bytearray | memoryview | BinaryIO,
    bits: int = 64,
) -> DataCodeResult

Generate an ISCC Data-Code from raw byte data or a file-like stream.

gen_instance_code_v0

gen_instance_code_v0(
    data: bytes | bytearray | memoryview | BinaryIO,
    bits: int = 64,
) -> InstanceCodeResult

Generate an ISCC Instance-Code from raw byte data or a file-like stream.

gen_iscc_code_v0

gen_iscc_code_v0(
    codes: list[str], wide: bool = False
) -> IsccCodeResult

Generate a composite ISCC-CODE from individual ISCC unit codes.

gen_sum_code_v0

gen_sum_code_v0(
    path: str | PathLike,
    bits: int = 64,
    wide: bool = False,
    add_units: bool = False,
) -> SumCodeResult

Generate Data-Code + Instance-Code + ISCC-CODE from a file path in a single pass.