Traditional smart home systems often rely on heterogeneous sensors and rigid rule-based pipelines, which limit scalability, flexibility, and ease of deployment. We propose a highly extensible smart home framework driven by multimodal large language models (MLLMs), utilizing a single camera as the minimal sensing unit for unified home environment perception, structured labeling, and autonomous decision-making. Our system automatically generates standardized tags, such as cleanliness, object locations, and human activities, to enable functions like cleaning scheduling, health reporting, and object tracking. These tags can also be extended to organize and integrate additional data sources, supporting comprehensive and scalable smart home management. Preliminary results demonstrate the feasibility and promise of this single-camera, MLLM-powered approach for smart home automation.