It was good news for budding legal AI researchers this week as Harvard announced its "Caselaw Access Project". The project makes 6.4 million US case reports freely available from the Harvard Law School Library. The disparate nature of state-level case reporting had previously made it difficult for the public (or anyone without commercial tools) to access consolidated data. Since data is the fuel of ML models, public access means more people will be able to explore the data and develop their own models.
What's key is that this is not just a data dump: users can query the data through an API, or bulk download it for processing. Adding the metadata to make this possible is a huge task and has taken the Harvard team years.
The UK also has good form on making legal data easily available. Legislation.gov has, for years now, made UK legislation freely and easily available. As with the Caselaw Access Project, the magic is not simply making the raw documents available (although the original print pdfs are available) but rather in providing invaluable metadata to allow users to query and manipulate the data.
For obvious reasons, publicly available databases of contracts are a little harder to come by. An exception are the "material contracts" which companies with US securities are required to file with the SEC. The SEC's EDGAR database does not make it easy to single out "material contracts" from other types of filings but, fortunately, the enterprising souls at Law Insider have done the job for you. The result is a frankly astonishing publicly available clause bank. Free access is for personal usage only but, if you happen to be personally interested in what US 'market' clauses look like, it's a fascinating place to explore.
We created this data by digitizing roughly 40 million pages of court decisions contained in roughly 40,000 bound volumes owned by the Harvard Law School Library.