Generative Data Intelligence

Tech giants duck questions on LLM copyright rules

Date:

In the UK’s Parliament this week, Microsoft and Meta ducked the question of whether creators should be paid when their copyrighted material is used to train large language models.

The tech titans, with combined revenues well in excess of $200 billion, were being grilled by the House of Lords Communications and Digital Committee when the copyright question came into focus.

In September, the Authors’ Guild, a trade association for published writers, and 17 authors filed a class-action lawsuit in the US over OpenAI’s use of their material to create its LLM-based services.

OpenAI CEO Sam Altman has since said the company would cover its clients’ legal costs for copyright infringement suits rather than remove the material from its training sets.

Microsoft has invested $13 billion in OpenAI. It has an extended partnership with the machine learning developer, powering its workloads on the Azure cloud platform and using its models to run the Copilot automated assistant.

Speaking to the Lords yesterday, Owen Larter, director of public policy at Microsoft’s Office of Responsible AI, said: “It’s important to appreciate what a large language model is. It’s a large model trained on text data, learning the associations between different ideas. It’s not necessarily sucking anything up from underneath.”

He said there should be a “framework” to provide some protection for copyrighted material and Microsoft would assume responsibility for any infringement by its LLM-based systems. But he also said Microsoft supports the recent Valance report into “pro-innovation” AI law in the UK which advocates for text and data exceptions in training models.

But Donald Michael, Lord Foster of Bath, pressed Larter on whether he would accept that if a company uses copyrighted material to build an LLM for profit, the copyright owner should be reimbursed.

The Microsoft director said: “It’s really important to understand that you need to train these large language models on large data sets if you’re going to get them to perform effectively, if you’re going to allow them to be safe and secure … There are also some competition issues [in making sure] that training of large models is available to everyone. If you go too far down a path where it’s very hard to obtain data to train models, then all of a sudden, the ability to do so will only be the preserve of very large companies.”

Litigation is already under way to address how training data sets Books1, Books2, and Books3, which effectively pirate copyrighted material, have been used to help build popular LLMs.

Meta is behind the Llama 2 LLM, which scales up to 70 billion parameters. The social media giant has promoted the model as open source, although FOSS purists point to some caveats in its approach.

Speaking to the Lords, Rob Sherman, vice president and deputy chief privacy officer for policy at Meta, said the company would comply with the law.

But he added that “maintaining broad access to information on the internet and information including for the use in innovation like this is quite important. I do support giving rights holders the ability to manage how their information is used.

“I’m a little bit cautious about the idea of forcing companies that are building AI to enter into bespoke agreements with individual rights holders or an order to pay for content that doesn’t have economic value for them.”

Last week, Dan Conway, CEO of the UK’s Publishers Association, told the committee that large language models were infringing copyrighted content on an “absolutely massive scale.”

“We know this in the publishing industry because of the Books3 database which lists 120,000 pirated book titles, which we know have been ingested by large language models,” he said. “We know that the content is being ingested on an absolutely massive scale by large language models. LLMs do infringe copyright at multiple parts of the process in terms of when they collect this information, how they store this information, and how they how they handle it. The copyright law is being broken on a massive scale.”

At the same hearing, Dr Hayleigh Bosher, reader in intellectual property law at Brunel University London, said she did not represent tech firms or content creators and offered up a neutral’s perspective.

“The principle of when you need a licence and when you don’t is clear,” she said, “and to make a reproduction of a copyright-protected work without permission would require a licence or would otherwise be infringement. That’s what AI does at different steps of the process: The ingestion, the running of the program, and potentially even the output.

“Some AI and tech developers are arguing a different interpretation of the law. I don’t represent either of those sides. I’m a copyright expert, and from my position, understanding of what copyright is supposed to achieve and how it achieves it, you would require a licence for that activity.” ®

spot_img

Latest Intelligence

spot_img

Chat with us

Hi there! How can I help you?