COMMENTARY: Since the Copyright Act of 1976, the principle of "fair use" has allowed individuals to gather information from publicly-available content without infringing on intellectual property (IP) rights. This informational accessibility that also offers personal protection, has existed as a cornerstone of copyright law for decades.However, artificial intelligence (AI) complicates this concept by operating on a vastly different scale. Its ability to scan and synthesize millions of articles in seconds, which often creates content that closely mirrors its sources, raises concerns about the future of data privacy.[SC Media Perspectives columns are written by a trusted community of SC Media cybersecurity subject matter experts. Read more Perspectives here.]Instead of passively analyzing large amounts of data, these models are scraping and digesting valuable information and repurposing it, presenting a new challenge to the traditional understanding of “fair use”. These systems can ingest megabytes per second, while the human brain yields a rate of about 50 bits per second. The ability to synthesize such a magnitude of information and then produce analyses does present an ethical challenge. The sheer difference in scale is concerning and prompts the question: Is it still “fair use” when AI models can absorb and then repurpose information on such a massive scale?The concern with AI’s synthesizing of public information is that it transcends traditional data analysis. In other words, it’s inferring information that the public may not explicitly share. In extreme cases, the technology predicts personal circumstances with remarkable accuracy. For example, over a decade ago Target used purchasing patterns to discover a teenager’s pregnancy. They began sending her father micro-targeted advertisements before her family even knew. Today, these tactics are employed in inference-based marketing, analyzing purchase history, social media posts and the tone of emails to predict consumer behavior in ways that almost feel invasive.With this context in mind, AI can even access mundane personal data without violating any existing privacy laws. Though regulations have attempted to offer some forms of protection through the General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), these standards are limited in managing AI’s ability to infer sensitive information. Without a broad standard that dictates the scale of “fair use” when it comes to large language models (LLMs), many organizations are left without an answer.While the issues surrounding “fair use” are far from resolved, there are early signs that major players are acknowledging the need to license certain types of IP. Large licensing agreements, such as those between Reddit and Google and OpenAI and The Financial Times, demonstrate that companies are finding it necessary and advantageous to seek explicit licenses for data rather than just scrape information and claim “fair use.” Though some private companies are taking matters into their own hands, many are looking to regulators for clearer standards.This question remains: Where do regulators draw the line when AI can predict information that humans have not explicitly shared? Where does the line fall between fair use and exploitation when AI can absorb an entire content library and repurpose it without regard for the original creators?These questions highlight the need to re-evaluate traditional frameworks that govern data usage and IP. There exists a blurry line between ownership and responsible attribution in creative and informational ecosystems. Although regulators are only beginning to tackle these questions, organizations must do their part to address these challenges presented by AI. They should ensure that any use of LLMs within their organizations include:AI’s growing ability to predict and infer beyond what consumers explicitly share demands stricter guidelines and safeguards. Without standard regulations, companies should take proactive steps to protect data privacy. Protective measures, such as robust data sanitization, transparent user policies, and memory-limiting practices can restrict the amount of information accessed by LLMs.The concern around “fair use” represents not just a question of legal implications. As regulators address these concerns, businesses and individuals alike will feel the impact based on these determinations of AI’s limitations on personal privacy.Steve Wilson, chief product officer, ExabeamSC Media Perspectives columns are written by a trusted community of SC Media cybersecurity subject matter experts. Each contribution has a goal of bringing a unique voice to important cybersecurity topics. Content strives to be of the highest quality, objective and non-commercial.
- Clear communication: Offer users clear policies about sharing personal or sensitive information with LLMs. Ensure that users understand the implications of sharing personal information.
- Data sanitization: Implement algorithms that remove personal information or other sensitive data from LLMs. Filtering out this information can protect companies from exposing private details or violating privacy standards.
- Temporary memory: Automatically erase sensitive information after the session ends, ensuring there’s no long-term retention of personal information. This protects the privacy of the user, and also reduces the risk of having this personal information accessed in the future.