Skip to main content

SqlStorageClient

SQL implementation of the storage client.

This storage client provides access to datasets, key-value stores, and request queues that persist data to a SQL database using SQLAlchemy 2+. Each storage type uses two tables: one for metadata and one for records.

The client accepts either a database connection string or a pre-configured AsyncEngine. If neither is provided, it creates a default SQLite database 'crawlee.db' in the storage directory.

Database schema is automatically created during initialization. SQLite databases receive performance optimizations including WAL mode and increased cache size.

Warning

This is an experimental feature. The behavior and interface may change in future versions.

Hierarchy

Index

Methods

__aenter__

__aexit__

  • async __aexit__(exc_type, exc_value, exc_traceback): None
  • Async context manager exit.


    Parameters

    • exc_type: type[BaseException] | None
    • exc_value: BaseException | None
    • exc_traceback: TracebackType | None

    Returns None

__init__

  • __init__(*, connection_string, engine): None
  • Initialize the SQL storage client.


    Parameters

    • optionalkeyword-onlyconnection_string: str | None = None

      Database connection string (e.g., "sqlite+aiosqlite:///crawlee.db"). If not provided, defaults to SQLite database in the storage directory.

    • optionalkeyword-onlyengine: AsyncEngine | None = None

      Pre-configured AsyncEngine instance. If provided, connection_string is ignored.

    Returns None

close

  • async close(): None
  • Close the database connection pool.


    Returns None

create_dataset_client

  • async create_dataset_client(*, id, name, alias, configuration): DatasetClient

create_kvs_client

create_rq_client

create_session

  • create_session(): AsyncSession
  • Create a new database session.


    Returns AsyncSession

    A new AsyncSession instance.

get_accessed_modified_update_interval

  • get_accessed_modified_update_interval(): timedelta
  • Get the interval for accessed and modified updates.


    Returns timedelta

get_dialect_name

  • get_dialect_name(): str | None
  • Get the database dialect name.


    Returns str | None

get_rate_limit_errors

  • get_rate_limit_errors(): dict[int, int]

get_storage_client_cache_key

  • get_storage_client_cache_key(configuration): Hashable
  • Return a cache key that can differentiate between different storages of this and other clients.

    Can be based on configuration or on the client itself. By default, returns a module and name of the client class.


    Parameters

    Returns Hashable

initialize

  • async initialize(configuration): None
  • Initialize the database schema.

    This method creates all necessary tables if they don't exist. Should be called before using the storage client.


    Parameters

    Returns None

Properties

engine

engine: AsyncEngine

Get the SQLAlchemy AsyncEngine instance.