# `Kreuzberg.ExtractionConfig`
[🔗](https://github.com/kreuzberg-dev/kreuzberg/blob/main/lib/kreuzberg/config.ex#L1)

Configuration structure for document extraction operations.

Provides options for controlling extraction behavior including caching, quality processing,
OCR, chunking, language detection, and post-processing. This module defines the configuration
schema and provides validation utilities to ensure configurations are valid before passing
them to the Rust extraction engine.

## Configuration Fields

### Boolean Flags (Top-level)

  * `:use_cache` - Enable result caching (default: true)
  * `:enable_quality_processing` - Enable quality post-processing (default: true)
  * `:force_ocr` - Force OCR even for searchable PDFs (default: false)
  * `:disable_ocr` - Disable OCR entirely — image files return empty content (default: false)

### Output Format Flags

  * `:output_format` - Content text format (default: "plain") - "plain", "markdown", "djot", "html"
  * `:result_format` - Result structure format (default: "unified") - "unified", "element_based"

### Nested Configuration Maps (Optional)

  * `:chunking` - Text chunking configuration with options like chunk_size, overlap, etc.
  * `:ocr` - OCR backend configuration with settings for language, PSM mode, etc.
    - Can include nested `:paddle_ocr_config` for PaddleOCR-specific settings
    - Can include nested `:element_config` for OCR element extraction settings
  * `:language_detection` - Language detection settings for multi-language content
  * `:postprocessor` - Post-processor configuration for cleaning/formatting extracted text
  * `:images` - Image extraction configuration (quality, format, preprocessing options)
  * `:pages` - Page-level extraction configuration (which pages to extract, etc.)
  * `:token_reduction` - Token reduction settings for optimizing output size
  * `:keywords` - Keyword extraction configuration
  * `:pdf_options` - PDF-specific options (requires pdf feature to be enabled)
  * `:html_options` - HTML to Markdown conversion options (quality, format, preprocessing options)
  * `:layout` - Layout detection configuration (confidence_threshold, apply_heuristics, table_model)
  * `:acceleration` - GPU acceleration configuration (provider, device_id)
  * `:security_limits` - Security limits for archive extraction (max sizes, compression ratio, etc.)
  * `:email` - Email extraction configuration (msg_fallback_codepage)
  * `:content_filter` - Content filter configuration (include_headers, include_footers, strip_repeating_text, include_watermarks)
  * `:max_concurrent_extractions` - Maximum concurrent extractions in batch operations (positive integer or nil)
  * `:cache_namespace` - Cache namespace for tenant isolation (string or nil)
  * `:cache_ttl_secs` - Per-request cache TTL in seconds (non-negative integer or nil)
  * `:extraction_timeout_secs` - Per-request extraction timeout in seconds; when exceeded, extraction is cancelled (non-negative integer or nil)

## Default Values

All boolean flags default to reasonable values:
- `use_cache`: true - Caching is enabled by default for better performance
- `enable_quality_processing`: true - Quality processing is enabled by default for better extraction results
- `force_ocr`: false - OCR is only used when necessary (searchable PDFs bypass OCR)
- `disable_ocr`: false - OCR is available when needed

Format defaults:
- `output_format`: "plain" - Raw extracted text (no formatting)
- `result_format`: "unified" - All content in unified content field

All nested configurations default to nil, allowing the Rust implementation to apply
its own defaults.

## Field Validation

The `validate/1` function ensures:
- Boolean fields are actually booleans
- Format fields are valid enum values
- Nested configurations are maps or nil
- No invalid field names are used

## Examples

    # Create config with chunking enabled
    iex> config = %Kreuzberg.ExtractionConfig{
    ...>   chunking: %{"enabled" => true, "chunk_size" => 1024},
    ...>   use_cache: true
    ...> }
    iex> Kreuzberg.ExtractionConfig.validate(config)
    {:ok, config}

    # Create config with markdown output format
    iex> config = %Kreuzberg.ExtractionConfig{
    ...>   output_format: "markdown",
    ...>   result_format: "unified"
    ...> }
    iex> Kreuzberg.ExtractionConfig.validate(config)
    {:ok, config}

    # Create config that forces OCR with element-based result format
    iex> config = %Kreuzberg.ExtractionConfig{
    ...>   force_ocr: true,
    ...>   result_format: "element_based",
    ...>   enable_quality_processing: true
    ...> }
    iex> Kreuzberg.ExtractionConfig.validate(config)
    {:ok, config}

    # Validate invalid configuration (non-boolean field)
    iex> config = %Kreuzberg.ExtractionConfig{use_cache: "yes"}
    iex> Kreuzberg.ExtractionConfig.validate(config)
    {:error, "Field 'use_cache' must be a boolean, got: string"}

    # Validate invalid format
    iex> config = %Kreuzberg.ExtractionConfig{output_format: "invalid"}
    iex> Kreuzberg.ExtractionConfig.validate(config)
    {:error, "Field 'output_format' must be one of: plain, text, markdown, md, djot, html, got: invalid"}

    # Convert to map for NIF
    iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 512}}
    iex> Kreuzberg.ExtractionConfig.to_map(config)
    %{
      "acceleration" => nil,
      "cache_namespace" => nil,
      "cache_ttl_secs" => nil,
      "chunking" => %{"size" => 512},
      "concurrency" => nil,
      "content_filter" => nil,
      "email" => nil,
      "disable_ocr" => false,
      "enable_quality_processing" => true,
      "extraction_timeout_secs" => nil,
      "force_ocr" => false,
      "force_ocr_pages" => nil,
      "html_options" => nil,
      "images" => nil,
      "include_document_structure" => false,
      "keywords" => nil,
      "language_detection" => nil,
      "layout" => nil,
      "max_archive_depth" => 3,
      "max_concurrent_extractions" => nil,
      "ocr" => nil,
      "output_format" => "plain",
      "pages" => nil,
      "pdf_options" => nil,
      "postprocessor" => nil,
      "result_format" => "unified",
      "security_limits" => nil,
      "token_reduction" => nil,
      "tree_sitter" => nil,
      "use_cache" => true
    }

# `config_map`

```elixir
@type config_map() :: %{required(String.t()) =&gt; any()}
```

# `layout_config`

```elixir
@type layout_config() ::
  %{
    confidence_threshold: float() | nil,
    apply_heuristics: boolean(),
    table_model: String.t() | nil
  }
  | nil
```

# `nested_config`

```elixir
@type nested_config() :: config_map() | nil
```

# `output_format`

```elixir
@type output_format() :: String.t()
```

# `result_format`

```elixir
@type result_format() :: String.t()
```

# `t`

```elixir
@type t() :: %Kreuzberg.ExtractionConfig{
  acceleration: nested_config(),
  cache_namespace: String.t() | nil,
  cache_ttl_secs: non_neg_integer() | nil,
  chunking: nested_config(),
  concurrency: nested_config(),
  content_filter: nested_config(),
  disable_ocr: boolean(),
  email: nested_config(),
  enable_quality_processing: boolean(),
  extraction_timeout_secs: non_neg_integer() | nil,
  force_ocr: boolean(),
  force_ocr_pages: [non_neg_integer()] | nil,
  html_options: config_map() | nil,
  html_output: nested_config(),
  images: nested_config(),
  include_document_structure: boolean(),
  keywords: nested_config(),
  language_detection: nested_config(),
  layout: layout_config(),
  max_archive_depth: non_neg_integer(),
  max_concurrent_extractions: non_neg_integer() | nil,
  ocr: nested_config(),
  output_format: output_format(),
  pages: nested_config(),
  pdf_options: nested_config(),
  postprocessor: nested_config(),
  result_format: result_format(),
  security_limits: nested_config(),
  token_reduction: nested_config(),
  tree_sitter: Kreuzberg.TreeSitterConfig.t() | nested_config(),
  use_cache: boolean()
}
```

# `discover`

```elixir
@spec discover() :: {:ok, t()} | {:error, :not_found | String.t()}
```

Discover and load an ExtractionConfig by searching directories.

Searches the current working directory and all parent directories for
a configuration file in the following order:
1. `kreuzberg.toml`
2. `kreuzberg.yaml`
3. `kreuzberg.yml`
4. `kreuzberg.json`

Returns the first configuration file found.

## Returns

  * `{:ok, config}` - Successfully discovered and loaded configuration
  * `{:error, :not_found}` - No configuration file found in directory tree
  * `{:error, reason}` - Error loading or parsing the configuration file

## Examples

    # When no config file exists
    iex> Kreuzberg.ExtractionConfig.discover()
    {:error, :not_found}

# `from_file`

```elixir
@spec from_file(String.t() | Path.t()) :: {:ok, t()} | {:error, String.t()}
```

Load an ExtractionConfig from a file.

Supports TOML, YAML, and JSON configuration file formats.
The file format is automatically detected based on the file extension
or file contents.

## Parameters

  * `file_path` - Path to the configuration file (String or Path.t())

## Returns

  * `{:ok, config}` - Successfully loaded configuration as a struct
  * `{:error, reason}` - Failed to load or parse the configuration file

## Supported Formats

  * `.toml` - TOML format (e.g., `kreuzberg.toml`)
  * `.yaml`, `.yml` - YAML format (e.g., `kreuzberg.yaml`)
  * `.json` - JSON format (e.g., `kreuzberg.json`)

## Examples

Loading from a TOML file:

    Kreuzberg.ExtractionConfig.from_file("kreuzberg.toml")
    # => {:ok, %Kreuzberg.ExtractionConfig{...}}

Loading from a YAML file:

    Kreuzberg.ExtractionConfig.from_file("/etc/config/extraction.yaml")
    # => {:ok, %Kreuzberg.ExtractionConfig{...}}

Handling missing files:

    Kreuzberg.ExtractionConfig.from_file("/nonexistent/file.toml")
    # => {:error, "File not found: ..."}

# `new`

```elixir
@spec new() :: t()
```

Creates a new ExtractionConfig with all default values.

## Examples

    iex> config = Kreuzberg.ExtractionConfig.new()
    iex> config.use_cache
    true

# `new`

```elixir
@spec new(keyword() | map()) :: t()
```

Creates a new ExtractionConfig from keyword list or map.

## Parameters

  * `opts` - Keyword list or map (supports string keys from JSON)

## Examples

    iex> config = Kreuzberg.ExtractionConfig.new(use_cache: false)
    iex> config.use_cache
    false

    iex> config = Kreuzberg.ExtractionConfig.new(%{"output_format" => "markdown"})
    iex> config.output_format
    "markdown"

# `to_map`

```elixir
@spec to_map(t() | map() | nil | list()) :: map() | nil
```

Converts an ExtractionConfig struct to a map for NIF serialization.

Returns a map containing all configuration fields, both boolean flags and
nested configurations. Serializes all values including nil for complete
representation.

## Parameters

  * `config` - An `ExtractionConfig` struct, a plain map, nil, or a keyword list

## Returns

A map with string keys representing the configuration options. All fields
are included, allowing the Rust side to override with provided values.

## Field Descriptions

  * `"chunking"` - Text chunking configuration (map or nil)
  * `"ocr"` - OCR backend configuration (map or nil)
  * `"language_detection"` - Language detection settings (map or nil)
  * `"postprocessor"` - Post-processor configuration (map or nil)
  * `"images"` - Image extraction configuration (map or nil)
  * `"pages"` - Page-level extraction configuration (map or nil)
  * `"token_reduction"` - Token reduction settings (map or nil)
  * `"keywords"` - Keyword extraction configuration (map or nil)
  * `"pdf_options"` - PDF-specific options (map or nil)
  * `"max_concurrent_extractions"` - Maximum concurrent extractions (positive integer or nil)
  * `"html_options"` - HTML to Markdown conversion options (map or nil)
  * `"layout"` - Layout detection configuration (map or nil)
  * `"include_document_structure"` - Include document structure in extraction (boolean)
  * `"use_cache"` - Enable caching (boolean)
  * `"enable_quality_processing"` - Enable quality processing (boolean)
  * `"force_ocr"` - Force OCR usage (boolean)
  * `"output_format"` - Content text format (string: "plain", "markdown", "djot", "html")
  * `"result_format"` - Result structure format (string: "unified", "element_based")

## Examples

    iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 512}, output_format: "markdown"}
    iex> Kreuzberg.ExtractionConfig.to_map(config)
    %{
      "acceleration" => nil,
      "cache_namespace" => nil,
      "cache_ttl_secs" => nil,
      "chunking" => %{"size" => 512},
      "concurrency" => nil,
      "email" => nil,
      "disable_ocr" => false,
      "enable_quality_processing" => true,
      "extraction_timeout_secs" => nil,
      "force_ocr" => false,
      "force_ocr_pages" => nil,
      "html_options" => nil,
      "images" => nil,
      "include_document_structure" => false,
      "keywords" => nil,
      "language_detection" => nil,
      "layout" => nil,
      "max_archive_depth" => 3,
      "max_concurrent_extractions" => nil,
      "ocr" => nil,
      "output_format" => "markdown",
      "pages" => nil,
      "pdf_options" => nil,
      "postprocessor" => nil,
      "result_format" => "unified",
      "security_limits" => nil,
      "token_reduction" => nil,
      "tree_sitter" => nil,
      "use_cache" => true
    }

    iex> config = %Kreuzberg.ExtractionConfig{}
    iex> Kreuzberg.ExtractionConfig.to_map(config)
    %{
      "acceleration" => nil,
      "cache_namespace" => nil,
      "cache_ttl_secs" => nil,
      "chunking" => nil,
      "concurrency" => nil,
      "content_filter" => nil,
      "email" => nil,
      "disable_ocr" => false,
      "enable_quality_processing" => true,
      "extraction_timeout_secs" => nil,
      "force_ocr" => false,
      "force_ocr_pages" => nil,
      "html_options" => nil,
      "images" => nil,
      "include_document_structure" => false,
      "keywords" => nil,
      "language_detection" => nil,
      "layout" => nil,
      "max_archive_depth" => 3,
      "max_concurrent_extractions" => nil,
      "ocr" => nil,
      "output_format" => "plain",
      "pages" => nil,
      "pdf_options" => nil,
      "postprocessor" => nil,
      "result_format" => "unified",
      "security_limits" => nil,
      "token_reduction" => nil,
      "tree_sitter" => nil,
      "use_cache" => true
    }

    iex> Kreuzberg.ExtractionConfig.to_map(nil)
    nil

    iex> Kreuzberg.ExtractionConfig.to_map(%{"use_cache" => false, "output_format" => "markdown"})
    %{"use_cache" => false, "output_format" => "markdown"}

# `validate`

```elixir
@spec validate(t()) :: {:ok, t()} | {:error, String.t()}
```

Validates an ExtractionConfig for correct field types and values.

Ensures that:
- Boolean fields (use_cache, enable_quality_processing, force_ocr) are actually booleans
- Format fields (output_format, result_format) are valid enum values
- Nested configuration fields are maps or nil
- All values are valid according to the configuration schema

This function is useful for early validation before passing configuration
to the extraction functions.

## Parameters

  * `config` - An `ExtractionConfig` struct to validate

## Returns

  * `{:ok, config}` - If the configuration is valid
  * `{:error, reason}` - If validation fails, with a descriptive error message

## Examples

    iex> config = %Kreuzberg.ExtractionConfig{use_cache: true}
    iex> Kreuzberg.ExtractionConfig.validate(config)
    {:ok, config}

    iex> config = %Kreuzberg.ExtractionConfig{output_format: "markdown", result_format: "unified"}
    iex> Kreuzberg.ExtractionConfig.validate(config)
    {:ok, config}

    iex> config = %Kreuzberg.ExtractionConfig{chunking: %{"size" => 1024}}
    iex> Kreuzberg.ExtractionConfig.validate(config)
    {:ok, config}

    iex> config = %Kreuzberg.ExtractionConfig{use_cache: "yes"}
    iex> Kreuzberg.ExtractionConfig.validate(config)
    {:error, "Field 'use_cache' must be a boolean, got: string"}

    iex> config = %Kreuzberg.ExtractionConfig{output_format: "invalid"}
    iex> Kreuzberg.ExtractionConfig.validate(config)
    {:error, "Field 'output_format' must be one of: plain, text, markdown, md, djot, html, got: invalid"}

    iex> config = %Kreuzberg.ExtractionConfig{chunking: "invalid"}
    iex> Kreuzberg.ExtractionConfig.validate(config)
    {:error, "Field 'chunking' must be a map or nil, got: string"}

    iex> config = %Kreuzberg.ExtractionConfig{force_ocr: true, enable_quality_processing: true}
    iex> Kreuzberg.ExtractionConfig.validate(config)
    {:ok, config}

---

*Consult [api-reference.md](api-reference.md) for complete listing*
