Enhancing Data Granularity: Data Cleaning
Learn data-cleaning techniques to enhance the quality and relevance of your indexed data.
Building on from chunking, this lesson focuses on data-cleaning techniques to further enhance the quality and relevance of indexed data.
Data cleaning
Data cleaning involves removing irrelevant information and inconsistencies from your data. This improves the focus of each data chunk and ensures the LLM receives high-quality information for response generation. Here’s why data cleaning is crucial:
Improved retrieval accuracy: By removing irrelevant information, the retrieval process can focus on the most relevant data chunks that accurately match the user's query, leading to more precise responses.
Better context understanding: Cleaned data provides a clearer view of the context surrounding the information. This allows the LLM to understand the relationships between concepts and generate responses that are more coherent and relevant to the overall topic.
Reduced system bottlenecks: Removing unnecessary information can improve processing efficiency during retrieval and generation stages.
Data cleaning techniques
The following techniques are commonly used for cleaning the data:
Stop words removal
Special character removal
Text normalization
Fact-checking and updating information
Get hands-on with 1400+ tech skills courses.