Postechian Column: AI and Morals: Ethical Challenges in Training Models

Nowadays, being young in 2022, the public conversation around artificial intelligence (AI) was rekindled with the introduction of ChatGPT, an AI chatbot developed by OpenAI. While natural language models are not a recent invention, the capabilities of ChatGPT surprised many. Notably, ChatGPT has passed US medical, law, and business school exams without any additional training; in The New York Times article conducting a “Turing test” of the AI, not even experts could distinguish between an AI-generated writing or a genuine human’s work with absolute certainty. But I am not writing this article to bemoan nor extol the competence of ChatGPT; rather I wish to discuss the ethical concerns of AI, specifically in terms of copyright.
Currently, large datasets are needed to train a usable model; GPT-3—which ChatGPT is based on—was trained on “Common Crawl,” Wikipedia, and other unspecified datasets. Spanning from 2008, the Common Crawl dataset contains petabytes of data, some of which are copyrighted works. Due to the nature of the Common Crawl dataset, the dataset itself is mostly considered to be fair use. However, the question of whether it is fair use to monetize a model partially trained on copyrighted material remains.
Recently, Getty Images—a large provider of paid images—has sued the creators of an AI image generator, Stable Diffusion, claiming copyright infringement. Some images generated using the AI recreated the Getty Images watermark, indicating that the AI was trained using publicly available watermarked images. Although it is trademark infringement to copy the watermark, the dispute raises questions about the use of copyrighted material in AI training. Should everything on the internet be considered fair game for AI training, or should there be restrictions on the use of material that is clearly marked as copyrighted (e.g. with a watermark)?
To explore this topic further, Github Copilot, a programming AI capable of parsing natural language was trained using public code. However, Copilot—as with Stable Diffusion—sometimes copies entire sections of code and outputs them without proper attribution. As such, there is ongoing litigation on Copilot’s use of licensed public code without adequate accreditation. However, I imagine most readers have used code snippets from Stack Overflow or other similar resources in the past. Compared to humans, is there a higher standard AI must uphold, i.e., does all AI-generated content need to be original? Or does the problem lie in attribution and proper citations to solve this issue?
Bringing this all back, where do we draw the line in training an AI? Is it ethical to scrape the internet to train one? Is the line drawn in commercializing something built on the work of others? Perhaps everything on the internet, regardless of the original uploader’s intent, can be used? We must debate these questions today before AI becomes inevitably ubiquitous. We—individuals, governments, and corporations—must all work together to establish a legal framework that reflects a consensus on Ethical AI. Of course, copyright is merely a portion of the vast ethical discourse regarding AI. To prevent a future where ethics are disregarded in pursuit of profit, it is imperative that we start the discussion on Ethical AI immediately.

Kim Seong-joo (Mueunjae 22) 다른기사 보기