API Reference¶
Python bindings for the ISCC (ISO 24138:2024) code generation functions. All functions accept typed
parameters and return JSON strings โ use json.loads() to parse the result.
import json
from iscc_lib import gen_meta_code_v0
result = json.loads(gen_meta_code_v0("Example Title"))
print(result["iscc"]) # "ISCC:..."
Functions¶
High-performance ISCC (ISO 24138:2024) implementation backed by Rust.
IO_READ_SIZE
module-attribute
¶
Buffer size in bytes for streaming file reads (4,194,304 = 4 MB).
META_TRIM_DESCRIPTION
module-attribute
¶
Max UTF-8 byte length for description metadata trimming (4096).
META_TRIM_META
module-attribute
¶
Max byte length for decoded meta parameter payload (128,000).
META_TRIM_NAME
module-attribute
¶
Max UTF-8 byte length for name metadata trimming (128).
TEXT_NGRAM_SIZE
module-attribute
¶
Character n-gram width for text content features (13).
MT ¶
Bases: IntEnum
ISCC MainType identifiers.
ST ¶
Bases: IntEnum
ISCC SubType identifiers.
VS ¶
Bases: IntEnum
ISCC Version identifiers.
IsccResult ¶
Bases: dict
ISCC result with both dict-style and attribute-style access.
MetaCodeResult ¶
TextCodeResult ¶
ImageCodeResult ¶
AudioCodeResult ¶
VideoCodeResult ¶
MixedCodeResult ¶
DataCodeResult ¶
InstanceCodeResult ¶
IsccCodeResult ¶
SumCodeResult ¶
DataHasher ¶
Streaming Data-Code generator.
Incrementally processes data with content-defined chunking and MinHash
to produce results identical to gen_data_code_v0.
Create a new DataHasher with optional initial data.
InstanceHasher ¶
Streaming Instance-Code generator.
Incrementally hashes data with BLAKE3 to produce results identical
to gen_instance_code_v0.
Create a new InstanceHasher with optional initial data.
alg_cdc_chunks ¶
Split data into content-defined chunks using gear rolling hash.
Uses a FastCDC-inspired algorithm to find content-dependent boundaries.
Returns at least one chunk (empty bytes for empty input). When utf32
is true, aligns cut points to 4-byte boundaries.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
bytes
|
Raw binary data to chunk. |
required |
utf32
|
bool
|
Whether to align cut points to 4-byte boundaries. |
required |
avg_chunk_size
|
int
|
Target average chunk size in bytes (default 1024). |
1024
|
Returns:
| Type | Description |
|---|---|
list[bytes]
|
List of byte chunks that concatenate to the original data. |
alg_minhash_256 ¶
Compute a 256-bit MinHash digest from 32-bit integer features.
Uses 64 universal hash functions with bit-interleaved compression to produce a 32-byte similarity-preserving digest.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
features
|
list[int]
|
List of 32-bit unsigned integer features. |
required |
Returns:
| Type | Description |
|---|---|
bytes
|
32-byte MinHash digest. |
alg_simhash ¶
Compute a SimHash from a sequence of equal-length hash digests.
Produces a similarity-preserving hash by counting bit frequencies across all input digests. Each output bit is set when its frequency meets or exceeds half the input count. Returns 32 zero bytes for empty input.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hash_digests
|
list[bytes]
|
List of equal-length byte hash digests. |
required |
Returns:
| Type | Description |
|---|---|
bytes
|
Similarity-preserving hash as bytes (same length as input digests). |
conformance_selftest ¶
Run all conformance tests against vendored test vectors.
Iterates through all 9 gen_*_v0 function sections in the conformance
data, calls each function with the specified inputs, and compares results
against expected output. Returns True if all tests pass.
Returns:
| Type | Description |
|---|---|
bool
|
|
encode_base64 ¶
Encode bytes as base64url (RFC 4648 ยง5, no padding).
Returns a URL-safe base64 encoded string without padding characters.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
bytes
|
Raw bytes to encode. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Base64url encoded string without padding. |
encode_component ¶
Encode raw digest components into a base32 ISCC unit string.
Takes integer type identifiers (matching MainType, SubType,
Version enum values) and a raw digest, returns a base32-encoded
ISCC unit string (without "ISCC:" prefix).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mtype
|
int
|
Main type identifier (0โ6). |
required |
stype
|
int
|
Sub type identifier (0โ7). |
required |
version
|
int
|
Version identifier (0). |
required |
bit_length
|
int
|
Bit length of the code body (multiple of 32). |
required |
digest
|
bytes
|
Raw digest bytes (at least |
required |
Returns:
| Type | Description |
|---|---|
str
|
Base32-encoded ISCC unit string. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If enum values are out of range or digest is too short. |
iscc_decompose ¶
Decompose a composite ISCC-CODE into individual ISCC-UNITs.
Accepts a normalized ISCC-CODE or concatenated ISCC-UNIT sequence.
The optional "ISCC:" prefix is stripped before decoding. Returns
a list of base32-encoded ISCC-UNIT strings (without prefix).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
iscc_code
|
str
|
ISCC-CODE or ISCC-UNIT sequence string. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of base32-encoded ISCC-UNIT strings. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the input is not a valid ISCC string. |
json_to_data_url ¶
Convert a JSON string into a data: URL with JCS canonicalization.
Parses the JSON, re-serializes to RFC 8785 (JCS) canonical form,
base64-encodes the result, and wraps it in a data: URL. Uses
application/ld+json media type when the JSON contains an @context
key, otherwise application/json.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
json
|
str
|
JSON string to convert. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Data URL string. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
sliding_window ¶
Generate sliding window n-grams from a string.
Returns overlapping substrings of width Unicode characters,
advancing by one character at a time. If the input is shorter than
width, returns a single element containing the full input.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
seq
|
str
|
Input string to slide over. |
required |
width
|
int
|
Width of sliding window (must be >= 2). |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
List of window-sized substrings. |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
soft_hash_video_v0 ¶
Compute a similarity-preserving hash from video frame signatures.
Deduplicates frame signatures, computes column-wise sums across all unique frames, then applies WTA-Hash to produce a digest.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
frame_sigs
|
Sequence[Sequence[int]]
|
List of frame signatures, each a list of signed integers. |
required |
bits
|
int
|
Bit length of the output hash (default 64). |
64
|
Returns:
| Type | Description |
|---|---|
bytes
|
Raw hash bytes of length |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
text_clean ¶
Clean and normalize text for display.
Applies NFKC normalization, removes control characters (except newlines),
normalizes \r\n to \n, collapses consecutive empty lines to at
most one, and strips leading/trailing whitespace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text to clean. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Cleaned text. |
text_collapse ¶
Normalize and simplify text for similarity hashing.
Applies NFD normalization, lowercasing, removes whitespace and characters in Unicode categories C (control), M (mark), and P (punctuation), then recombines with NFKC normalization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text to collapse. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Collapsed text suitable for similarity hashing. |
text_remove_newlines ¶
Remove newlines and collapse whitespace to single spaces.
Converts multi-line text into a single normalized line by splitting on whitespace boundaries and joining with a single space.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text with newlines. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Single-line text with collapsed whitespace. |
text_trim ¶
Trim text so its UTF-8 encoded size does not exceed nbytes.
Finds the largest valid UTF-8 prefix within nbytes, then strips
leading/trailing whitespace from the result. Multi-byte characters that
would be split are dropped entirely.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input text to trim. |
required |
nbytes
|
int
|
Maximum byte length of the result. |
required |
Returns:
| Type | Description |
|---|---|
str
|
Trimmed text. |
iscc_decode ¶
Decode an ISCC unit string into header components and raw digest.
gen_meta_code_v0 ¶
gen_meta_code_v0(
name: str,
description: str | None = None,
meta: str | dict | None = None,
bits: int = 64,
) -> MetaCodeResult
Generate an ISCC Meta-Code from content metadata.
gen_text_code_v0 ¶
Generate an ISCC Text-Code from plain text content.
gen_image_code_v0 ¶
gen_image_code_v0(
pixels: bytes | bytearray | memoryview | Sequence[int],
bits: int = 64,
) -> ImageCodeResult
Generate an ISCC Image-Code from pixel data.
gen_audio_code_v0 ¶
Generate an ISCC Audio-Code from a Chromaprint feature vector.
gen_video_code_v0 ¶
Generate an ISCC Video-Code from frame signature data.
gen_mixed_code_v0 ¶
Generate an ISCC Mixed-Code from multiple Content-Code strings.
gen_data_code_v0 ¶
gen_data_code_v0(
data: bytes | bytearray | memoryview | BinaryIO,
bits: int = 64,
) -> DataCodeResult
Generate an ISCC Data-Code from raw byte data or a file-like stream.
gen_instance_code_v0 ¶
gen_instance_code_v0(
data: bytes | bytearray | memoryview | BinaryIO,
bits: int = 64,
) -> InstanceCodeResult
Generate an ISCC Instance-Code from raw byte data or a file-like stream.
gen_iscc_code_v0 ¶
Generate a composite ISCC-CODE from individual ISCC unit codes.
gen_sum_code_v0 ¶
gen_sum_code_v0(
path: str | PathLike,
bits: int = 64,
wide: bool = False,
add_units: bool = False,
) -> SumCodeResult
Generate Data-Code + Instance-Code + ISCC-CODE from a file path in a single pass.