MLPipeKit handles the full chain from EDC ingestion to CDISC-compliant dataset delivery. Each step in the pipeline is configurable per study, versioned, and auditable — because trial data cannot be a black box.
MLPipeKit pulls patient visit data from your EDC using CDISC ODM or HL7 FHIR R4 endpoints. Ingestion runs on a configurable schedule — every 4 hours during active enrollment, daily during follow-up. No manual CSV exports. No FTP directories. Data arrives with site code, visit window, and form version metadata attached.
Once data is ingested, the validation engine applies your rule library against it. Rules are written in a structured query format that non-programmers can maintain — clinical data managers define field-level range checks, cross-form consistency rules, and protocol-deviation flags without writing code. The engine runs all rules in parallel; a 10,000-record batch completes validation in under 90 seconds.
Discrepancies surface as queries routed to the relevant site coordinator. Each query carries the affected eCRF field, the violation rule, and the expected value range. Coordinators respond directly in the portal; responses are captured with timestamp and user signature per 21 CFR Part 11. Sponsors can see open query count, median response time, and resolution rate per site in real time.
After a visit window closes and queries are resolved, the transformation engine maps cleaned data to SDTM domains (DM, AE, CM, EX, VS, LB, and custom domains). ADaM datasets — ADSL, ADAE, ADLB — are generated from SDTM using sponsor-defined derivation specifications stored in the platform. Domain mappings are versioned and reusable across studies with the same protocol structure.
Final SDTM and ADaM datasets are exported with define.xml (CDISC Define-XML 2.1), annotated CRFs, and a full audit trail document. Pinnacle 21 Community conformance checks run automatically before export; issues are listed with fix suggestions before the package leaves the platform. The complete package targets FDA ECTD Module 5 structure.
Pre-built integrations for Medidata Rave, Oracle Clinical One, and Veeva Vault EDC. Each connector handles authentication, session management, and incremental data sync. The Medidata Rave connector supports both the legacy SOAP API and the newer REST-based Clinical Cloud API depending on the customer's Rave version.
Custom connectors for other EDC systems are available under the Enterprise plan. We have delivered connectors for BioClinica's CTMS and iMedidata eConsent within 3-week scopes.
MLPipeKit ships with 340 pre-configured edit checks covering ICH E6(R2) GCP requirements, FDA Data Standards Catalog checks, and common protocol-agnostic validations (date logic, range checks, required field compliance). These can be used as-is or modified.
Study-specific rules are added through the rule editor. Rules reference CRF fields by domain and variable name (CDISC SDTM terminology), so the same rule can apply across multiple EDC form versions without duplication. A rule impact analysis tool shows which existing data would be flagged by a new rule before it is activated.
The lock readiness view aggregates open query count, unresolved protocol deviations, missing visit data flags, and SAE narrative completeness across all sites. A study is classified as lock-ready when all criteria fall below sponsor-defined thresholds. Lock readiness date is projected based on current query resolution velocity — not a manual estimate.
Teams running 8–12 concurrent studies can monitor lock readiness for all of them from one dashboard without switching between EDC instances.
SDTM mapping uses a specification workbook format that statistical programmers already know. Import your existing SDTM spec directly; MLPipeKit parses column-to-domain variable mappings and applies them during transformation. Controlled terminology compliance (CDISC CT September 2024 release) is enforced at mapping time.
ADaM derivations are defined using a Python-based derivation DSL that executes inside the platform's managed environment. Derivation code is version-controlled, testable against historical data, and re-executable on any study snapshot.
All user actions in MLPipeKit — data edits, query responses, mapping changes, dataset exports — are recorded in an immutable audit trail stored separately from operational data. Audit records include user identity (tied to SSO/LDAP directory), session IP, timestamp, and the state of the record before and after the action.
The platform is validated under GAMP 5 Category 4. A validation package (IQ/OQ/PQ documentation, test scripts, deviation log) is available to customers for inclusion in their computer system validation documentation.
ADaM datasets are delivered to statistical programming environments through a direct mount to your organization's secure file share or through the MLPipeKit API. Each delivery includes dataset metadata in machine-readable format (XPT + YAML spec) so statistical programmers can write TLF code against a stable dataset schema without waiting for a manual data cut.
Integration with SAS Grid and Python-based analysis environments (using pyreadstat and pandas) is documented; setup scripts are provided for both.
Most teams complete setup and run their first automated validation batch within 5 business days of kickoff.