DeepSeek memperingatkan risiko ‘jailbreak’ untuk model sumber terbukanya

DeepSeek telah mengungkapkan detail tentang risiko yang ditimbulkan oleh model kecerdasan buatannya untuk pertama kalinya, dengan mencatat bahwa model sumber terbuka sangat rentan untuk “di-jailbreak” oleh aktor jahat.

Startup yang berbasis di Hangzhou ini mengatakan telah mengevaluasi modelnya menggunakan tolok ukur industri serta pengujiannya sendiri dalam sebuah artikel yang telah melalui tinjauan sejawat dan diterbitkan di jurnal akademik Nature.

Perusahaan AI Amerika sering mempublikasikan penelitian tentang risiko model mereka yang berkembang pesat dan telah memperkenalkan kebijakan mitigasi risiko sebagai tanggapan, seperti Kebijakan Penskalaan Bertanggung Jawab dari Anthropic dan Kerangka Kerja Kesiapsiagaan dari OpenAI.

Perusahaan Tiongkok kurang terbuka tentang risiko, meskipun model mereka hanya tertinggal beberapa bulan di belakang model serupa di AS, menurut para ahli AI. Namun, DeepSeek telah melakukan evaluasi risiko tersebut sebelumnya, termasuk “risiko perbatasan” yang paling serius, lapor Post sebelumnya.

The Nature paper provided more “granular” details about DeepSeek’s testing regime, said Fang Liang, an expert member of China’s AI Industry Alliance (AIIA), an industry body. These included “red-team” tests based on a framework introduced by Anthropic, in which testers try to get AI models to produce harmful speech.

According to the paper, DeepSeek found that its R1 reasoning model and V3 base model – released in January 2025 and December 2024, respectively – had slightly higher-than-average safety scores across six industry benchmarks than OpenAI’s o1 and GPT-4o, both released last year, and Anthropic’s Claude-3.7-Sonnet, released in February.

However, it found that R1 was “relatively unsafe” once its external “risk control” mechanism was removed, following tests on its own in-house safety benchmark consisting of 1,120 test questions.

AI companies typically try to prevent their systems from generating harmful content by “fine-tuning” the models themselves during the training process or adding external content filters.

However, experts have warned that such safety measures can be easily bypassed by techniques such as “jailbreaking”. For example, instead of asking a model to generate an instruction manual for creating a Molotov cocktail, the malicious user asks instead for a detailed history of the weapon.

DeepSeek found that all tested models exhibited “significantly increased rates” of harmful responses when faced with jailbreak attacks, with R1 and Alibaba Group Holding’s Qwen2.5 deemed most vulnerable because they are open-source. Alibaba owns the Post.

Open-source models are released free on the internet to anyone who wants to download and modify them. While this is beneficial for adoption of the technology, it can also make it possible for users to remove the model’s external safety mechanisms.

“We fully recognise that, while open source sharing facilitates the dissemination of advanced technologies within the community, it also introduces potential risks of misuse,” the paper said, which listed DeepSeek CEO Liang Wenfeng as the corresponding author.

“To address safety issues, we advise developers using open source models in their services to adopt comparable risk control measures.”

DeepSeek’s warning comes as Chinese policymakers stress the need to balance development and safety in China’s open-source AI ecosystem.

On Monday, a technical standards body associated with the Cyberspace Administration of China warned of the heightened risk of model vulnerabilities transmitting to downstream applications through open-sourcing.

“The open-sourcing of foundation models … will widen their impact and complicate repairs, making it easier for criminals to train ‘malicious models’,” the body said in a new update to its “AI Safety Governance Framework”.

The Nature paper also revealed for the first time R1’s compute training cost of US$294,000 – the subject of much speculation following the model’s high-profile release in January, due to it being significantly lower than the reported training costs of US models.

The paper also refuted accusations that DeepSeek “distilled” OpenAI’s models, referring to the controversial practice of training a model using the outputs of a competitor’s model.

Meanwhile, news of DeepSeek being featured on the front page of the prestigious Nature journal has been widely celebrated in China. On social media, the news quickly went trending, with DeepSeek referred to as the “first LLM company to be peer-reviewed”.

According to Fang, this peer-review recognition might encourage other Chinese AI companies to be more transparent about their safety and security practices, “as long as companies want to get their work published in world-leading journals”.