Back to blog

Building Argus: How We Created a Regulatory Knowledge Base for EU Financial Law

·8 min read·financialregulations.eu
technology
Argus
engineering

EU financial regulation is one of the most complex legal domains in the world. It is also one of the worst-served by existing technology. We built Argus — the regulatory knowledge base behind financialregulations.eu — to change that.

The Problem: Regulation Resists Simple Search

Financial regulation appears to be a text-search problem. It is not. Three characteristics make it fundamentally different.

It is unstructured and heterogeneous. EU regulation comes in at least six distinct document types: Regulations (directly applicable, like MiCAR), Directives (requiring national transposition, like AIFMD II), Delegated Regulations, Implementing Technical Standards, Regulatory Technical Standards, and supervisory guidance (guidelines, Q&As, opinions). Each has different legal weight, different structural conventions, and different update cycles. A Regulation has Articles grouped into Titles and Chapters. A Q&A document has numbered questions with prose answers. An ESMA guideline has numbered guidelines with compliance tables. Treating these uniformly as "documents" loses the legal structure that makes them meaningful.

It is multilingual with legal precision. EU legislation is authentic in all 24 official languages. A Dutch AFM guidance document may quote the English-language ESMA guideline it implements, which references the French-drafted Commission Delegated Regulation. A system that cannot perform cross-lingual retrieval is inherently limited.

It is densely cross-referential. A single MiCAR article might reference MiFID II, the Prospectus Regulation, the Market Abuse Regulation, and three ESMA guidelines in one paragraph. Understanding any provision often requires reading 3-5 other instruments. Keyword search fails because the relevant context lives in different documents under different terminology.

Architecture: BGE-M3, Qdrant, SQLite

BGE-M3 Embeddings

We evaluated seven embedding models against a benchmark of 200 regulatory queries. BGE-M3 (BAAI General Embedding — Multi-Functionality, Multi-Linguality, Multi-Granularity) won for three reasons.

Multilingual capability across 100+ languages with a single model — essential when regulatory text mixes languages within and across documents. Long-context support of 8,192 tokens per input, versus 512 for many earlier models — critical because a single DORA article can exceed 500 words. Multi-granularity retrieval supporting dense, sparse, and multi-vector modes. We use a hybrid: dense vectors for semantic similarity, sparse vectors (ColBERT-style) for precise legal terminology matching. A query for "Article 30 DORA contractual requirements" gets exact matching on the citation and semantic matching on the concept simultaneously.

We self-host the model to avoid API dependency.

Qdrant Vector Search

Qdrant won our evaluation over Pinecone, Weaviate, Milvus, and ChromaDB on five criteria: sub-50ms latency at scale (at 10,000+ vectors), rich payload filtering (filtering by regulation, article number, temporal applicability, and hierarchy level before vector similarity search), payload storage alongside vectors (storing chunk metadata without a separate lookup), operational simplicity (single-binary deployment with persistent storage), and open-source self-hosting economics.

Qdrant's filtering engine is particularly strong for our use case. A query about MiCAR should not retrieve MiFID II chunks unless cross-references are explicitly requested — and Qdrant's payload-based pre-filtering enforces this cleanly, dramatically improving precision over post-hoc filtering approaches.

SQLite for Metadata

Publication dates, applicability dates, amendment histories, source URLs, quality scores — all in SQLite. Zero-configuration operation, file-based portability, ACID transactions, and excellent read performance for a metadata store that is written infrequently and read constantly. We considered PostgreSQL but could not justify the operational overhead for thousands (not millions) of records. The metadata schema tracks source authority, document type, publication and application dates, regulatory hierarchy level, language, ingestion timestamps, quality scores, and processing status.

Legal-Structure-Aware Chunking

This is where most RAG systems fail on legal text.

Why Fixed-Size Chunking Fails

The standard RAG approach — 500-token chunks with overlap — is catastrophic for legislation. Article 30(2) of DORA lists mandatory ICT contract provisions across ~800 tokens. A fixed-size chunker splits this between items (d) and (e), producing two incomplete chunks. A user querying "What contractual provisions does DORA require?" retrieves a partial list. Worse, fixed-size chunking severs the dependency between a paragraph establishing scope and subsequent paragraphs applying that scope.

Article-Boundary Chunking

Argus chunks at legal-structural boundaries:

  • Article level as the primary boundary. An Article is the fundamental unit of EU legislation — it conveys a complete legal provision. If Article 16(2) is 800 tokens, the chunk is 800 tokens. If Article 3(1) (definitions) is 3,000 tokens, we sub-chunk at the individual definition level.
  • Paragraph level for articles exceeding 1,500 tokens, chunking at numbered paragraph boundaries while preserving parent Article context in metadata.
  • Recital level for preambles. Each recital provides interpretive context for specific Articles, and maintaining this granularity is essential for accurate retrieval.
  • Guideline and Q&A level for supervisory documents, respecting their distinct structural conventions.
  • Annex items at their logical boundaries: table rows, list items, or section headers depending on the Annex structure.

This produces chunks of varying size (100-2,000 tokens), which is intentional. Legal meaning does not come in uniform packages.

Each chunk carries full hierarchical metadata — Regulation (EU) 2022/2554, Title V, Chapter V, Article 30, paragraph 2, point (e) — enabling precise citation in generated analysis. The system can optionally expand to retrieve the full parent Article when a sub-chunk matches a query.

The 5-Layer Knowledge Model

Layer 1: Primary Legislation. Regulations and Directives — MiCAR, MiFID II, AIFMD, DORA, EMIR, SFDR, the Taxonomy Regulation. Binding, authoritative, cited directly.

Layer 2: Implementing Measures. Delegated Regulations, RTS, and ITS that flesh out Level 1 with specific requirements: reporting templates, calculation methodologies, classification criteria. Legally binding but subordinate.

Layer 3: Supervisory Guidance. ESMA/EBA/EIOPA guidelines, Q&As, opinions. "Comply or explain" — not strictly binding, but near-binding in practice. Heavily influences how regulations are interpreted and enforced.

Layer 4: Market Practice. Trade association guidance (AIMA, EFAMA, ISLA), market conventions, common interpretive positions. The "how it actually works" layer.

Layer 5: Internal Analysis. Argus-generated cross-reference maps, regulatory interaction analyses, and structured summaries. Clearly labelled as analytical, not authoritative.

Each chunk carries a hierarchy_level field (1-5) that weights authoritative sources higher in retrieval and enables appropriate citation caveats.

Bitemporal Versioning

Regulatory text has two temporal dimensions.

Legal time: when a provision was enacted, when it applies, when it was amended or repealed. MiCAR's Title III and Title IV on asset-referenced tokens and e-money tokens applied from 30 June 2024; the rest of the Regulation from 30 December 2024. DORA applied from 17 January 2025, but the designation of critical ICT third-party providers follows a separate timeline. A system that cannot distinguish these applicability dates will retrieve provisions that are not yet applicable, or miss provisions that are already in force.

System time: when the provision was ingested into the knowledge base and when it was last verified against the source. If a corrigendum was published on EUR-Lex yesterday, the system must know whether its indexed version reflects that update or an earlier version that may have been superseded.

Four timestamps per chunk: valid_from (date the provision became legally effective), valid_to (date amended or repealed, null if current), indexed_at (date ingested into Argus), superseded_at (date replaced by an updated version, null if current). Default retrieval filters to current law only, where both valid_to and superseded_at are null.

28 Monitoring Sources

EU-level (8): EUR-Lex (CELLAR API, hourly), ESMA, EBA, EIOPA, ECB, DG FISMA, ESRB, SRB — all checked daily.

National regulators (14): AFM and DNB (Netherlands), BaFin (Germany), CSSF (Luxembourg), AMF (France), FMA (Austria), CBI (Ireland), CONSOB (Italy), CNMV (Spain), FCA (UK, for equivalence analysis), FINMA (Switzerland, for cross-border distribution), plus the Dutch Staatscourant, German Bundesgesetzblatt, and Luxembourg Memorial for legislative gazette updates.

Specialised (6): ESMA registers (MiFID firms, funds, benchmarks), EBA interactive single rulebook, AIMA and EFAMA industry publications.

Each source has a dedicated ingestion pipeline with source-specific parsing logic. EUR-Lex provides structured XML via its CELLAR API. ESMA publishes RSS feeds with reasonably consistent metadata. Some national regulators provide well-structured feeds; others publish PDFs requiring extraction or HTML pages requiring scraping. The monitoring pipeline checks each source on a schedule calibrated to its update frequency — EUR-Lex hourly, ESMA and EBA daily, most NCAs daily, industry sources weekly.

The Quality Pipeline: 8 Stages

  1. Source verification — confirming authoritative origin, URL whitelist validation, digital signature checks
  2. Document classification — document type, issuing authority, regulatory domain
  3. Language detection — fastText identification, routing to language-specific processing
  4. Structural parsing — extracting titles, chapters, articles, paragraphs, recitals, annexes with hierarchical relationships
  5. Temporal extraction — parsing enactment dates, application dates, sunset clauses from both metadata and text
  6. Cross-reference extraction — structured links to referenced legislation enabling graph traversal
  7. Chunking and embedding — legal-boundary chunking followed by BGE-M3 dense and sparse embedding on GPU
  8. Deduplication and conflict detection — exact hash matching plus semantic similarity (cosine > 0.98), with conflicts flagged for human review

Approximately 8% of ingested content requires manual review, primarily due to parsing failures on non-standard NCA document formats.

What Is Next

Regulatory knowledge graph. A graph representation where regulations, articles, definitions, and obligations are nodes connected by typed edges (amends, implements, references, supersedes). This enables queries that pure vector similarity cannot answer.

Obligation extraction. Automatically structuring obligations from prose — who must do what, by when, under what conditions — enabling automated gap analysis and compliance checklist generation.

Graph-based retrieval expansion. When a query matches an article, automatically traversing the graph to retrieve referenced articles, implementing RTS, interpreting guidelines, and modifying amendments. This mimics how an experienced regulatory lawyer reads legislation — not as isolated provisions but as nodes in a web of interconnected instruments.


Try financialregulations.eu — start with 2 free regulatory queries. No credit card required.

Start Analysing — Free →

Try financialregulations.eu — start with 2 free queries.

No credit card required.

Start Analysing — Free →