Today, it’s now undeniable that large-language models (LLMs) have paved the way for groundbreaking advancements in many fields, including code generation. However, these machine learning innovations do not come without some inherent risks. We’ve learned that there’s these models can potentially generate code with hardcoded secrets, such as API keys or database credentials. This practice stands in stark contrast to the recommended way of managing secrets – through a secrets manager.
Understanding hardcoded secrets
Hardcoded secrets refer to sensitive data that are directly embedded in the source code, including database credentials, API keys, encryption keys, and other types of private information. While this may seem like a convenient method of storing and accessing this data, it poses significant security risks.
If this code were to fall into the wrong hands, those secrets would get exposed, and attackers could compromise the associated services. Furthermore, hardcoding secrets in source code can cause issues if the company needs to rotate keys or change passwords. This would require having to change the code itself, recompile, and redeploy the application.
The role of LLMs
LLMs such as ChatGPT-4 have exhibited an impressive ability to generate code snippets based on given prompts. Although they are designed to understand the context and generate code that aligns with best practices, they may occasionally produce code with hardcoded secrets because of the nature of the training data they were fed.
For instance, if the training data includes numerous code snippets with hardcoded secrets, the LLM might replicate that pattern. It’s important to note that the model doesn’t inherently understand the concept of “secrets” or their security implications. It merely mimics the patterns it has observed during training.
The influence of documentation on LLMs
ChatGPT-4 has been trained on a diverse range of data, which can include public code repositories, technical blogs, forums, and, importantly, documentation. When these models encounter repeated patterns across the training data – like hardcoded secrets in code snippets – they learn to replicate those patterns. The code generated by LLMs reflects of what they have “seen” during training. So, if the training data includes examples of hardcoded secrets in code snippets, it’s possible the LLM will generate similar code.
While it’s easier to hardcode secrets in example code, it’s crucial that documentation writers balance simplicity with responsible coding practices. It’s possible to add a disclaimer to indicate that the code was developed for illustrative purposes only and that in a real-world scenario, we should never hardcode secrets.
However, it’s even better if the documentation offered examples of how to use secrets managers or environment variables for handling sensitive data. This way, readers would learn how to apply best practices from the very start, and LLMs trained on these examples would generate more secure code.
The importance of secrets management
Secrets management refers to the process of securely storing, managing, and accessing sensitive data such as API keys, passwords, and tokens. Secrets managers like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault offer a more secure and scalable way to handle secrets. They typically offer features such as automatic encryption, access control, auditing, and secrets rotation.
By using a secrets manager, sensitive data never gets embedded in the code, thereby mitigating the risk of exposing secrets. Additionally, when the secret values need changing, the developers can do it directly in the secrets manager without touching the codebase.
Mitigating the risk
Here are several strategies to mitigate the risk of hardcoded secrets in code generated by LLMs:
- Post-generation review: Perform a thorough review of the generated code. This should include manually checking for hardcoded secrets and using automated tools that can scan for potential issues.
- Training data sanitization: If possible, sanitize training data to exclude code snippets with hardcoded secrets to reduce the likelihood of the LLM replicating this insecure practice.
- Prompt optimization: By optimizing the prompts developers can explicitly request code that uses secrets management. This could lead to the LLMs generating code that follows this best practice.
- Model tuning: Once the team has control over the model training process, consider tuning the model to penalize the generation of hardcoded secrets.
- Secret detection: Apply a secrets detection tools on code generated by LLMs, in various flows from PR to pre-commit and pre-receive. This will mitigate this type of issue in a way similar to a developer making this mistake.
Although LLMs are powerful tools for code generation, it’s crucial to stay aware of the potential security risks, such as hardcoded secrets. Ensuring good practices in handling sensitive information has become a fundamental part of responsible AI use and development.
Lotem Guy, vice president, product, Cycode