Predicting toxicity of chemicals: software beats animal testing
We created earlier a large machine‐readable database of 10,000 chemicals and 800,000 associated studies by natural language processing of the public parts of Registration, Evaluation, Authorisation and Restriction of Chemicals (REACH) registrations until December 2014. This database was used to assess the reproducibility of the six most frequently used Organisation for Economic Co‐operation and Development (OECD) guideline tests. These tests consume 55% of all animals in safety testing in Europe, i.e. about 600,000 animals. With 350–750 chemicals with multiple results per test, reproducibility (balanced accuracy) was 81% and 69% of toxic substances were found again in a repeat experiment (sensitivity 69%). Inspired by the increasingly used read‐across approach, we created a new type of QSAR, which is based on similarity of chemicals and not on chemical descriptors. A landscape of the chemical universe using 10 million structures was calculated, when based on Tanimoto indices similar chemicals are close and dissimilar chemicals far from each other. This allows placing any chemical of interest into the map and evaluating the information available for surrounding chemicals. In a data fusion approach, in which 74 different properties were taken into consideration, machine learning (random forest) allowed a fivefold cross‐validation for 190,000 (non‐) hazard labels of chemicals for which nine hazards were predicted. The balanced accuracy of this approach was 87% with a sensitivity of 89%. Each prediction comes with a certainty measure based on the homogeneity of data and distance of neighbours. Ongoing developments and future opportunities are discussed.